Skip to content

600 new models for over 75 new languages including Ancient, Dead and Extinct languages, 155 languages total covered, 400% Tokenizer Speedup, 18x USE-Embeddings GPU speedup in John Snow Labs NLU 3.4.4

Compare
Choose a tag to compare
@C-K-Loan C-K-Loan released this 20 May 13:01
· 510 commits to master since this release
74cf002

We are very excited to announce NLU 3.4.4 has been released with over 600 new models, over 75 new languages, and 155 languages covered in total,
400% speedup for tokenizers and 18x speedup of UniversalSentenceEncoder on GPU.

On the general NLP side, we have transformer-based Embeddings and Token Classifiers powered by state of the art CamemBertEmbeddings and DeBertaForTokenClassification based
architectures as well as various new models for
Historical, Ancient,Dead, Extinct, Genetic, and Constructed languages like
Old Church Slavonic, Latin, Sanskrit, Esperanto, Volapük, Coptic, Nahuatl, Ancient Greek (to 1453), Old Russian.
On the healthcare side, we have Portuguese De-identification Models, have NER models for Gene detection and finally RxNorm Sentence resolution model for mapping and extracting pharmaceutical actions (e.g. analgesic, hypoglycemic)
as well as treatments (e.g. backache, diabetes).

For full release notes with all models see
here
or here ,

First-time language models covered

The languages for these models are covered for the very first time ever by NLU.

Number Language Name(s) NLU Reference Spark NLP Reference Task Annotator Class Scope Language Type
0 Sanskrit sa.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Ancient
1 Sanskrit sa.lemma lemma_vedic Lemmatization LemmatizerModel Individual Ancient
2 Sanskrit sa.pos pos_vedic Part of Speech Tagging PerceptronModel Individual Ancient
3 Sanskrit sa.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Ancient
4 Volapük vo.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Constructed
5 Nahuatl languages nah.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Collective Genetic
6 Aragonese an.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
7 Assamese as.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
8 Asturian, Asturleonese, Bable, Leonese ast.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
9 Bashkir ba.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
10 Bavarian bar.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
11 Bishnupriya bpy.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
12 Burmese my.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
13 Cebuano ceb.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
14 Central Bikol bcl.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
15 Chechen ce.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
16 Chuvash cv.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
17 Corsican co.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
18 Dhivehi, Divehi, Maldivian dv.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
19 Egyptian Arabic arz.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
20 Emiliano-Romagnolo eml.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
21 Erzya myv.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
22 Georgian ka.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
23 Goan Konkani gom.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
24 Javanese jv.embed.distilbert distilbert_embeddings_javanese_distilbert_small Embeddings DistilBertEmbeddings Individual Living
25 Javanese jv.embed.javanese_distilbert_small_imdb distilbert_embeddings_javanese_distilbert_small_imdb Embeddings DistilBertEmbeddings Individual Living
26 Javanese jv.embed.javanese_roberta_small roberta_embeddings_javanese_roberta_small Embeddings RoBertaEmbeddings Individual Living
27 Javanese jv.embed.javanese_roberta_small_imdb roberta_embeddings_javanese_roberta_small_imdb Embeddings RoBertaEmbeddings Individual Living
28 Javanese jv.embed.javanese_bert_small_imdb bert_embeddings_javanese_bert_small_imdb Embeddings BertEmbeddings Individual Living
29 Javanese jv.embed.javanese_bert_small bert_embeddings_javanese_bert_small Embeddings BertEmbeddings Individual Living
30 Kirghiz, Kyrgyz ky.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
31 Letzeburgesch, Luxembourgish lb.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
32 Letzeburgesch, Luxembourgish lb.lemma lemma_spacylookup Lemmatization LemmatizerModel Individual Living
33 Letzeburgesch, Luxembourgish lb.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
34 Ligurian lij.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
35 Lombard lmo.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
36 Low German, Low Saxon nds.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
37 Macedonian mk.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
38 Macedonian mk.lemma lemma_spacylookup Lemmatization LemmatizerModel Individual Living
39 Macedonian mk.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
40 Maithili mai.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
41 Manx gv.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
42 Mazanderani mzn.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
43 Minangkabau min.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
44 Mingrelian xmf.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
45 Mirandese mwl.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
46 Neapolitan nap.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
47 Nepal Bhasa, Newari new.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
48 Northern Frisian frr.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
49 Northern Sami sme.lemma lemma_giella Lemmatization LemmatizerModel Individual Living
50 Northern Sami sme.pos pos_giella Part of Speech Tagging PerceptronModel Individual Living
51 Northern Sotho, Pedi, Sepedi nso.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
52 Occitan (post 1500) oc.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
53 Ossetian, Ossetic os.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
54 Pfaelzisch pfl.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
55 Piemontese pms.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
56 Romansh rm.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
57 Scots sco.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
58 Sicilian scn.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
59 Sinhala, Sinhalese si.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
60 Sinhala, Sinhalese si.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
61 Sundanese su.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
62 Sundanese su.embed.sundanese_roberta_base roberta_embeddings_sundanese_roberta_base Embeddings RoBertaEmbeddings Individual Living
63 Tagalog tl.lemma lemma_spacylookup Lemmatization LemmatizerModel Individual Living
64 Tagalog tl.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
65 Tagalog tl.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
66 Tagalog tl.embed.roberta_tagalog_large roberta_embeddings_roberta_tagalog_large Embeddings RoBertaEmbeddings Individual Living
67 Tagalog tl.embed.roberta_tagalog_base roberta_embeddings_roberta_tagalog_base Embeddings RoBertaEmbeddings Individual Living
68 Tajik tg.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
69 Tatar tt.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
70 Tatar tt.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
71 Tigrinya ti.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
72 Tosk Albanian als.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
73 Tswana tn.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Individual Living
74 Turkmen tk.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
75 Upper Sorbian hsb.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
76 Venetian vec.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
77 Vlaams vls.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
78 Walloon wa.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
79 Waray (Philippines) war.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
80 Western Armenian hyw.pos pos_armtdp Part of Speech Tagging PerceptronModel Individual Living
81 Western Armenian hyw.lemma lemma_armtdp Lemmatization LemmatizerModel Individual Living
82 Western Frisian fy.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
83 Western Panjabi pnb.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
84 Yakut sah.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
85 Zeeuws zea.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Individual Living
86 Albanian sq.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Macrolanguage Living
87 Albanian sq.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
88 Azerbaijani az.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
89 Azerbaijani az.stopwords stopwords_iso Stop Words Removal StopWordsCleaner Macrolanguage Living
90 Malagasy mg.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
91 Malay (macrolanguage) ms.embed.albert albert_embeddings_albert_large_bahasa_cased Embeddings AlbertEmbeddings Macrolanguage Living
92 Malay (macrolanguage) ms.embed.distilbert distilbert_embeddings_malaysian_distilbert_small Embeddings DistilBertEmbeddings Macrolanguage Living
93 Malay (macrolanguage) ms.embed.albert_tiny_bahasa_casedl albert_embeddings_albert_tiny_bahasa_cased Embeddings AlbertEmbeddings Macrolanguage Living
94 Malay (macrolanguage) ms.embed.albert_base_bahasa_cased albert_embeddings_albert_base_bahasa_cased Embeddings AlbertEmbeddings Macrolanguage Living
95 Malay (macrolanguage) ms.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
96 Mongolian mn.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
97 Oriya (macrolanguage) or.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
98 Pashto, Pushto ps.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
99 Quechua qu.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
100 Sardinian sc.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
101 Serbo-Croatian sh.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living
102 Uzbek uz.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel Macrolanguage Living

All Healthcare models

Powered by the amazing
Spark NLP for Healthcare 3.5.2 and
Spark NLP for Healthcare 3.5.1 releases.

Number NLU Reference Spark NLP Reference Task Language Name(s) Annotator Class Language Type Scope
0 en.med_ner.biomedical_bc2gm ner_biomedical_bc2gm Named Entity Recognition English MedicalNerModel Living Individual
1 en.med_ner.biomedical_bc2gm ner_biomedical_bc2gm Named Entity Recognition English MedicalNerModel Living Individual
2 en.resolve.rxnorm_action_treatment sbiobertresolve_rxnorm_action_treatment Entity Resolution English SentenceEntityResolverModel Living Individual
3 en.classify.token_bert.ner_ade bert_token_classifier_ner_ade Named Entity Recognition English MedicalBertForTokenClassifier Living Individual
4 en.classify.token_bert.ner_ade bert_token_classifier_ner_ade Named Entity Recognition English MedicalBertForTokenClassifier Living Individual
5 pt.med_ner.deid.subentity ner_deid_subentity De-identification Portuguese MedicalNerModel Living Individual
6 pt.med_ner.deid.generic ner_deid_generic De-identification Portuguese MedicalNerModel Living Individual
7 pt.med_ner.deid ner_deid_generic De-identification Portuguese MedicalNerModel Living Individual

See next comment for more details