720+ new NLP models, 300+ supported languages, translation, summarization, question answering and more with T5 and Marian models! - John Snow Labs NLU 1.1.0
720+ new NLP models, 300+ supported languages, translation, summarization, question answering and more with T5 and Marian models! - John Snow Labs NLU 1.1.0
NLU 1.1.0 Release Notes
We are incredibly excited to release NLU 1.1.0!
This release integrates the 720+ new models from the latest Spark-NLP 2.7.0 + releases
You can now achieve state-of-the-art results with Sequence2Sequence transformers on problems like text summarization, question answering, translation between 192+ languages, and extract Named Entity in various Right to Left written languages like Arabic, Persian, Urdu, and languages that require segmentation like Koreas, Japanese, Chinese, and many more in 1 line of code!
These new features are possible because of the integration of the Google's T5 models and Microsoft's Marian models transformers
NLU 1.1.0 has over 720+ new pretrained models and pipelines while extending the support of multi-lingual models to 192+ languages such as Chinese, Japanese, Korean, Arabic, Persian, Urdu, and Hebrew.
NLU 1.1.0 New Features
- 720+ new models you can find an overview of all NLU models here and further documentation in the models hub
- NEW: Introducing MarianTransformer annotator for machine translation based on MarianNMT models. Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
- NEW: Introducing T5Transformer annotator for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on
- NEW: Introducing brand new and refactored language detection and identification models. The new LanguageDetectorDL is faster, more accurate, and supports up to 375 languages
- NEW: Introducing WordSegmenter model for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean
- NEW: Introducing DocumentNormalizer component for cleaning content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters
NLU 1.1.0 New Notebooks for new features
- Translate between 192+ languages with marian
- Try out the 18 Tasks like Summarization Question Answering and more on T5
- Tokenize, extract POS and NER in Chinese
- Tokenize, extract POS and NER in Korean
- Tokenize, extract POS and NER in Japanese
- Normalize documents
- Aspect based sentiment NER sentiment for restaurants
NLU 1.1.0 New Classifier Training Tutorials
Binary Classifier training Jupyter tutorials
- 2 class Finance News sentiment classifier training
- 2 class Reddit comment sentiment classifier training
- 2 class Apple Tweets sentiment classifier training
- 2 class IMDB Movie sentiment classifier training
- 2 class twitter classifier training
Multi Class text Classifier training Jupyter tutorials
- 5 class WineEnthusiast Wine review classifier training
- 3 class Amazon Phone review classifier training
- 5 class Amazon Musical Instruments review classifier training
- 5 class Tripadvisor Hotel review classifier training
- 5 class Phone review classifier training
NLU 1.1.0 New Medium Tutorials
- 1 line to Glove Word Embeddings with NLU with t-SNE plots
- 1 line to Xlnet Word Embeddings with NLU with t-SNE plots
- 1 line to AlBERT Word Embeddings with NLU with t-SNE plots
- 1 line to CovidBERT Word Embeddings with NLU with t-SNE plots
- 1 line to Electra Word Embeddings with NLU with t-SNE plots
- 1 line to BioBERT Word Embeddings with NLU with t-SNE plots
Translation
Translation example
You can translate between more than 192 Languages pairs with the Marian Models
You need to specify the language your data is in as start_language
and the language you want to translate to as target_language
.
The language references must be ISO language codes
nlu.load('<start_language>.translate.<target_language>')
Translate Turkish to English:
nlu.load('tr.translate_to.en')
Translate English to French:
nlu.load('en.translate_to.fr')
Translate French to Hebrew:
nlu.load('fr.translate_to.he')
Translate English to Chinese:
nlu.load('en.translate_to.zh)
Translate English to Korean:
nlu.load('en.translate_to.ko)
Translate English to Japanese:
nlu.load('en.translate_to.ja)
Translate English to Urdu:
nlu.load('en.translate_to.ur)
translate_pipe = nlu.load('en.translate_to.de')
df = translate_pipe.predict('Billy likes to go to the mall every sunday')
df
sentence | translation |
---|---|
Billy likes to go to the mall every sunday | Billy geht gerne jeden Sonntag ins Einkaufszentrum |
T5
Overview of every task available with T5
The T5 model is trained on various datasets for 17 different tasks which fall into 8 categories.
- Text summarization
- Question answering
- Translation
- Sentiment analysis
- Natural Language inference
- Coreference resolution
- Sentence Completion
- Word sense disambiguation
Every T5 Task with explanation:
Task Name | Explanation |
---|---|
1.CoLA | Classify if a sentence is grammatically correct |
2.RTE | Classify whether a statement can be deducted from a sentence |
3.MNLI | Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class). |
4.MRPC | Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent) |
5.QNLI | Classify whether the answer to a question can be deducted from an answer candidate. |
6.QQP | Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent) |
7.SST2 | Classify the sentiment of a sentence as positive or negative |
8.STSB | Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes) |
9.CB | Classify for a premise and a hypothesis whether they contradict each other or not (binary). |
10.COPA | Classify for a question, premise, and 2 choices which choice the correct choice is (binary). |
11.MultiRc | Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary), |
12.WiC | Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences. |
13.WSC/DPR | Predict for an ambiguous pronoun in a sentence what it is referring to. |
14.Summarization | Summarize text into a shorter representation. |
15.SQuAD | Answer a question for a given context. |
16.WMT1. | Translate English to German |
17.WMT2. | Translate English to French |
18.WMT3. | Translate English to Romanian |
refer to this notebook to see how to use every T5 Task.
Question Answering
Predict an answer
to a question
based on input context
.
This is based on SQuAD - Context based question answering
Predicted Answer | Question | Context |
---|---|---|
carbon monoxide | What does increased oxygen concentrations in the patient’s lungs displace? | Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment. |
pie | What did Joey eat for breakfast? | Once upon a time, there was a squirrel named Joey. Joey loved to go outside and play with his cousin Jimmy. Joey and Jimmy played silly games together, and were always laughing. One day, Joey and Jimmy went swimming together 50 at their Aunt Julie’s pond. Joey woke up early in the morning to eat some food before they left. Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast. After he ate, he and Jimmy went to the pond. On their way there they saw their friend Jack Rabbit. They dove into the water and swam for several hours. The sun was out, but the breeze was cold. Joey and Jimmy got out of the water and started walking home. Their fur was wet, and the breeze chilled them. When they got home, they dried off, and Jimmy put on his favorite purple shirt. Joey put on a blue shirt with red and green dots. The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed,' |
# Set the task on T5
t5['t5'].setTask('question ')
# define Data, add additional tags between sentences
data = ['''
What does increased oxygen concentrations in the patient’s lungs displace?
context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
''']
#Predict on text data with T5
t5.predict(data)
How to configure T5 task parameter for Squad Context based question answering and pre-process data
.setTask('question:)
and prefix the context which can be made up of multiple sentences with context:
Example pre-processed input for T5 Squad Context based question answering:
question: What does increased oxygen concentrations in the patient’s lungs displace?
context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
Text Summarization
Summarizes
a paragraph into a shorter version with the same semantic meaning, based on Text summarization
# Set the task on T5
pipe = nlu.load('summarize')
# define Data, add additional tags between sentences
data = [
'''
The belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .
''',
''' Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus; the former concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, and areas under or between curves. These two branches are related to each other by the fundamental theorem of calculus, and they make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.[1] Infinitesimal calculus was developed independently in the late 17th century by Isaac Newton and Gottfried Wilhelm Leibniz.[2][3] Today, calculus has widespread uses in science, engineering, and economics.[4] In mathematics education, calculus denotes courses of elementary mathematical analysis, which are mainly devoted to the study of functions and limits. The word calculus (plural calculi) is a Latin word, meaning originally "small pebble" (this meaning is kept in medicine – see Calculus (medicine)). Because such pebbles were used for calculation, the meaning of the word has evolved and today usually means a method of computation. It is therefore used for naming specific methods of calculation and related theories, such as propositional calculus, Ricci calculus, calculus of variations, lambda calculus, and process calculus.'''
]
#Predict on text data with T5
pipe.predict(data)
Predicted summary | Text |
---|---|
manchester united face newcastle in the premier league on wednesday . louis van gaal's side currently sit two points clear of liverpool in fourth . the belgian duo took to the dance floor on monday night with some friends . | the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth . |
Binary Sentence similarity/ Paraphrasing
Binary sentence similarity example
Classify whether one sentence is a re-phrasing or similar to another sentence
This is a sub-task of GLUE and based on MRPC - Binary Paraphrasing/ sentence similarity classification
t5 = nlu.load('en.t5.base')
# Set the task on T5
t5['t5'].setTask('mrpc ')
# define Data, add additional tags between sentences
data = [
''' sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .
sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 "
'''
,
'''
sentence1: I like to eat peanutbutter for breakfast
sentence2: I like to play football.
'''
]
#Predict on text data with T5
t5.predict(data)
Sentence1 | Sentence2 | prediction |
---|---|---|
We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . | Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " . | equivalent |
I like to eat peanutbutter for breakfast | I like to play football | not_equivalent |
How to configure T5 task for MRPC and pre-process text
.setTask('mrpc sentence1:)
and prefix second sentence with sentence2:
Example pre-processed input for T5 MRPC - Binary Paraphrasing/ sentence similarity
mrpc
sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .
sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11",
Regressive Sentence similarity/ Paraphrasing
Measures how similar two sentences are on a scale from 0 to 5 with 21 classes representing a regressive label.
This is a sub-task of GLUE and based onSTSB - Regressive semantic sentence similarity .
t5 = nlu.load('en.t5.base')
# Set the task on T5
t5['t5'].setTask('stsb ')
# define Data, add additional tags between sentences
data = [
''' sentence1: What attributes would have made you highly desirable in ancient Rome?
sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?'
'''
,
'''
sentence1: What was it like in Ancient rome?
sentence2: What was Ancient rome like?
''',
'''
sentence1: What was live like as a King in Ancient Rome??
sentence2: What was Ancient rome like?
'''
]
#Predict on text data with T5
t5.predict(data)
Question1 | Question2 | prediction |
---|---|---|
What attributes would have made you highly desirable in ancient Rome? | How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER? | 0 |
What was it like in Ancient rome? | What was Ancient rome like? | 5.0 |
What was live like as a King in Ancient Rome?? | What is it like to live in Rome? | 3.2 |
How to configure T5 task for stsb and pre-process text
.setTask('stsb sentence1:)
and prefix second sentence with sentence2:
Example pre-processed input for T5 STSB - Regressive semantic sentence similarity
stsb
sentence1: What attributes would have made you highly desirable in ancient Rome?
sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?',
Grammar Checking
Grammar checking with T5 example)
Judges if a sentence is grammatically acceptable.
Based on CoLA - Binary Grammatical Sentence acceptability classification
pipe = nlu.load('grammar_correctness')
# Set the task on T5
pipe['t5'].setTask('cola sentence: ')
# define Data
data = ['Anna and Mike is going skiing and they is liked is','Anna and Mike like to dance']
#Predict on text data with T5
pipe.predict(data)
sentence | prediction |
---|---|
Anna and Mike is going skiing and they is liked is | unacceptable |
Anna and Mike like to dance | acceptable |
Document Normalization
Document Normalizer example
The DocumentNormalizer extracts content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters
pipe = nlu.load('norm_document')
data = '<!DOCTYPE html> <html> <head> <title>Example</title> </head> <body> <p>This is an example of a simple HTML page with one paragraph.</p> </body> </html>'
df = pipe.predict(data,output_level='document')
df
text | normalized_text |
---|---|
<!DOCTYPE html> <html> <head> <title>Example</title> </head> <body> <p>This is an example of a simple HTML page with one paragraph.</p> </body> </html> |
Example This is an example of a simple HTML page with one paragraph. |
Word Segmenter
Word Segmenter Example
The WordSegmenter segments languages without any rule-based tokenization such as Chinese, Japanese, or Korean
pipe = nlu.load('ja.segment_words')
# japanese for 'Donald Trump and Angela Merkel dont share many opinions'
ja_data = ['ドナルド・トランプとアンゲラ・メルケルは多くの意見を共有していません']
df = pipe.predict(ja_data, output_level='token')
df
token |
---|
ドナルド |
・ |
トランプ |
と |
アンゲラ |
・ |
メルケル |
は |
多く |
の |
意見 |
を |
共有 |
し |
て |
い |
ませ |
ん |
Installation
# PyPi
!pip install nlu pyspark==2.4.7
#Conda
# Install NLU from Anaconda/Conda
conda install -c johnsnowlabs nlu