Skip to content

Natural Language Processing Use Cases

Asela Rajapakse edited this page Apr 23, 2018 · 3 revisions

Employing the Standford Parser and NLTK

Introduction

There are at least two demo services shipped together with the GEF (Stanford syntactic parser and NLTK Part-of-speech tagger). They are meant to demonstrate how the GEF can be applied to Natural Language Processing (NLP) tasks. These services may help to understand how scientific workflows are constructed within the GEF: it includes the creation of service images, metadata formatting, job execution, etc.

Stanford Parser

Stanford Parser is an open source software that allows to get a grammatical structure of a sentence. Such parsers are widely used in NLP, for example, in question answering, text classification, data linking, and in many other tasks. The GEF service based on the Stanford parser does not address any particular problem and is supposed to show the way a scientific application can be transformed into a docker image format suitable for the GEF. As an input it takes a text file (with *.txt extension) containing a text in English (since this particular parser instance has only an English model), the parser processes the text by adding a syntactic annotation, the results are immediately displayed in the terminal window (they are not saved in a file and therefore the output volume remains empty).

NLTK Part-Of-Speech Tagger

NLTK Part-Of-Speech Tagger is another example of a containerised open source tool. NLTK is considered to be one of the most popular (especially in education) NLP libraries. It is essentially a collection of parsers, tokenizers, stemmers, corpora, etc. The NLTK-based GEF service exploits only a part of speech tagger (averaged perceptron tagger) for English to augment an input text with part-of-speech information (by adding one POS-tag per token). The service accepts two inputs (a text file with the *.txt extension and a string containing some text), the results are saved in a CSV table which can be downloaded from an output volume, information about the status is printed in the terminal window.

Dockerfiles and Execution Scripts

The Dockerfile for the Standford Parser example can be found at https://github.com/EUDAT-GEF/GEF/blob/master/services/stanford-parser/Dockerfile. It requires no separate execution script because the Dockerfile downloads this script into the GEF Service at build time. But the Dockerfile for the NLRK example found at https://github.com/EUDAT-GEF/GEF/blob/master/services/nltk/Dockerfile needs a separate execution script. This script can be found at https://github.com/EUDAT-GEF/GEF/blob/master/services/nltk/posTagger.py.