600 new models for over 75 new languages including Ancient, Dead and Extinct languages, 155 languages total covered, 400% Tokenizer Speedup, 18x USE-Embeddings GPU speedup in John Snow Labs NLU 3.4.4
We are very excited to announce NLU 3.4.4 has been released with over 600 new models, over 75 new languages, and 155 languages covered in total,
400% speedup for tokenizers and 18x speedup of UniversalSentenceEncoder on GPU.
On the general NLP side, we have transformer-based Embeddings and Token Classifiers powered by state of the art CamemBertEmbeddings and DeBertaForTokenClassification based
architectures as well as various new models for
Historical
, Ancient
,Dead
, Extinct
, Genetic
, and Constructed
languages like
Old Church Slavonic
, Latin
, Sanskrit
, Esperanto
, Volapük
, Coptic
, Nahuatl
, Ancient Greek (to 1453)
, Old Russian
.
On the healthcare side, we have Portuguese De-identification Models
, have NER
models for Gene detection and finally RxNorm Sentence resolution model for mapping and extracting pharmaceutical actions (e.g. analgesic, hypoglycemic)
as well as treatments (e.g. backache, diabetes).
For full release notes with all models see
here
or here ,
First-time language models covered
The languages for these models are covered for the very first time ever by NLU.
Number | Language Name(s) | NLU Reference | Spark NLP Reference | Task | Annotator Class | Scope | Language Type |
---|---|---|---|---|---|---|---|
0 | Sanskrit | sa.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Ancient |
1 | Sanskrit | sa.lemma | lemma_vedic | Lemmatization | LemmatizerModel | Individual | Ancient |
2 | Sanskrit | sa.pos | pos_vedic | Part of Speech Tagging | PerceptronModel | Individual | Ancient |
3 | Sanskrit | sa.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Ancient |
4 | Volapük | vo.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Constructed |
5 | Nahuatl languages | nah.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Collective | Genetic |
6 | Aragonese | an.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
7 | Assamese | as.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
8 | Asturian, Asturleonese, Bable, Leonese | ast.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
9 | Bashkir | ba.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
10 | Bavarian | bar.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
11 | Bishnupriya | bpy.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
12 | Burmese | my.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
13 | Cebuano | ceb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
14 | Central Bikol | bcl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
15 | Chechen | ce.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
16 | Chuvash | cv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
17 | Corsican | co.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
18 | Dhivehi, Divehi, Maldivian | dv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
19 | Egyptian Arabic | arz.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
20 | Emiliano-Romagnolo | eml.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
21 | Erzya | myv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
22 | Georgian | ka.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
23 | Goan Konkani | gom.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
24 | Javanese | jv.embed.distilbert | distilbert_embeddings_javanese_distilbert_small | Embeddings | DistilBertEmbeddings | Individual | Living |
25 | Javanese | jv.embed.javanese_distilbert_small_imdb | distilbert_embeddings_javanese_distilbert_small_imdb | Embeddings | DistilBertEmbeddings | Individual | Living |
26 | Javanese | jv.embed.javanese_roberta_small | roberta_embeddings_javanese_roberta_small | Embeddings | RoBertaEmbeddings | Individual | Living |
27 | Javanese | jv.embed.javanese_roberta_small_imdb | roberta_embeddings_javanese_roberta_small_imdb | Embeddings | RoBertaEmbeddings | Individual | Living |
28 | Javanese | jv.embed.javanese_bert_small_imdb | bert_embeddings_javanese_bert_small_imdb | Embeddings | BertEmbeddings | Individual | Living |
29 | Javanese | jv.embed.javanese_bert_small | bert_embeddings_javanese_bert_small | Embeddings | BertEmbeddings | Individual | Living |
30 | Kirghiz, Kyrgyz | ky.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
31 | Letzeburgesch, Luxembourgish | lb.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
32 | Letzeburgesch, Luxembourgish | lb.lemma | lemma_spacylookup | Lemmatization | LemmatizerModel | Individual | Living |
33 | Letzeburgesch, Luxembourgish | lb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
34 | Ligurian | lij.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
35 | Lombard | lmo.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
36 | Low German, Low Saxon | nds.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
37 | Macedonian | mk.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
38 | Macedonian | mk.lemma | lemma_spacylookup | Lemmatization | LemmatizerModel | Individual | Living |
39 | Macedonian | mk.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
40 | Maithili | mai.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
41 | Manx | gv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
42 | Mazanderani | mzn.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
43 | Minangkabau | min.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
44 | Mingrelian | xmf.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
45 | Mirandese | mwl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
46 | Neapolitan | nap.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
47 | Nepal Bhasa, Newari | new.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
48 | Northern Frisian | frr.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
49 | Northern Sami | sme.lemma | lemma_giella | Lemmatization | LemmatizerModel | Individual | Living |
50 | Northern Sami | sme.pos | pos_giella | Part of Speech Tagging | PerceptronModel | Individual | Living |
51 | Northern Sotho, Pedi, Sepedi | nso.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
52 | Occitan (post 1500) | oc.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
53 | Ossetian, Ossetic | os.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
54 | Pfaelzisch | pfl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
55 | Piemontese | pms.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
56 | Romansh | rm.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
57 | Scots | sco.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
58 | Sicilian | scn.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
59 | Sinhala, Sinhalese | si.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
60 | Sinhala, Sinhalese | si.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
61 | Sundanese | su.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
62 | Sundanese | su.embed.sundanese_roberta_base | roberta_embeddings_sundanese_roberta_base | Embeddings | RoBertaEmbeddings | Individual | Living |
63 | Tagalog | tl.lemma | lemma_spacylookup | Lemmatization | LemmatizerModel | Individual | Living |
64 | Tagalog | tl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
65 | Tagalog | tl.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
66 | Tagalog | tl.embed.roberta_tagalog_large | roberta_embeddings_roberta_tagalog_large | Embeddings | RoBertaEmbeddings | Individual | Living |
67 | Tagalog | tl.embed.roberta_tagalog_base | roberta_embeddings_roberta_tagalog_base | Embeddings | RoBertaEmbeddings | Individual | Living |
68 | Tajik | tg.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
69 | Tatar | tt.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
70 | Tatar | tt.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
71 | Tigrinya | ti.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
72 | Tosk Albanian | als.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
73 | Tswana | tn.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Individual | Living |
74 | Turkmen | tk.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
75 | Upper Sorbian | hsb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
76 | Venetian | vec.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
77 | Vlaams | vls.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
78 | Walloon | wa.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
79 | Waray (Philippines) | war.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
80 | Western Armenian | hyw.pos | pos_armtdp | Part of Speech Tagging | PerceptronModel | Individual | Living |
81 | Western Armenian | hyw.lemma | lemma_armtdp | Lemmatization | LemmatizerModel | Individual | Living |
82 | Western Frisian | fy.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
83 | Western Panjabi | pnb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
84 | Yakut | sah.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
85 | Zeeuws | zea.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Individual | Living |
86 | Albanian | sq.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Macrolanguage | Living |
87 | Albanian | sq.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
88 | Azerbaijani | az.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
89 | Azerbaijani | az.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | Macrolanguage | Living |
90 | Malagasy | mg.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
91 | Malay (macrolanguage) | ms.embed.albert | albert_embeddings_albert_large_bahasa_cased | Embeddings | AlbertEmbeddings | Macrolanguage | Living |
92 | Malay (macrolanguage) | ms.embed.distilbert | distilbert_embeddings_malaysian_distilbert_small | Embeddings | DistilBertEmbeddings | Macrolanguage | Living |
93 | Malay (macrolanguage) | ms.embed.albert_tiny_bahasa_casedl | albert_embeddings_albert_tiny_bahasa_cased | Embeddings | AlbertEmbeddings | Macrolanguage | Living |
94 | Malay (macrolanguage) | ms.embed.albert_base_bahasa_cased | albert_embeddings_albert_base_bahasa_cased | Embeddings | AlbertEmbeddings | Macrolanguage | Living |
95 | Malay (macrolanguage) | ms.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
96 | Mongolian | mn.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
97 | Oriya (macrolanguage) | or.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
98 | Pashto, Pushto | ps.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
99 | Quechua | qu.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
100 | Sardinian | sc.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
101 | Serbo-Croatian | sh.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
102 | Uzbek | uz.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | Macrolanguage | Living |
All Healthcare models
Powered by the amazing
Spark NLP for Healthcare 3.5.2 and
Spark NLP for Healthcare 3.5.1 releases.
Number | NLU Reference | Spark NLP Reference | Task | Language Name(s) | Annotator Class | Language Type | Scope |
---|---|---|---|---|---|---|---|
0 | en.med_ner.biomedical_bc2gm | ner_biomedical_bc2gm | Named Entity Recognition | English | MedicalNerModel | Living | Individual |
1 | en.med_ner.biomedical_bc2gm | ner_biomedical_bc2gm | Named Entity Recognition | English | MedicalNerModel | Living | Individual |
2 | en.resolve.rxnorm_action_treatment | sbiobertresolve_rxnorm_action_treatment | Entity Resolution | English | SentenceEntityResolverModel | Living | Individual |
3 | en.classify.token_bert.ner_ade | bert_token_classifier_ner_ade | Named Entity Recognition | English | MedicalBertForTokenClassifier | Living | Individual |
4 | en.classify.token_bert.ner_ade | bert_token_classifier_ner_ade | Named Entity Recognition | English | MedicalBertForTokenClassifier | Living | Individual |
5 | pt.med_ner.deid.subentity | ner_deid_subentity | De-identification | Portuguese | MedicalNerModel | Living | Individual |
6 | pt.med_ner.deid.generic | ner_deid_generic | De-identification | Portuguese | MedicalNerModel | Living | Individual |
7 | pt.med_ner.deid | ner_deid_generic | De-identification | Portuguese | MedicalNerModel | Living | Individual |
See next comment for more details