-
Notifications
You must be signed in to change notification settings - Fork 4
yeslogic/corpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
README To regenerate the data from scratch: - compile corpus and syllables (cargo build) - download Wikipedia dumps (check wikipedia/download.sh) - collect other data and run corpus on it to create files in words/ - run ./run.sh SCRIPTS bn = Bengali hi = Hindi / Devanagari / Marathi ta = Tamil or = Oriya te = Telugu gu = Gujarati pa = Punjabi / Gurmukhi ml = Malayalam kn = Kannada si = Sinhala SOURCES Indian translations of "Code Swaraj" by Carl Malamud. Dictionaries: - http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Shabdanjali/Shabdanjali.tgz - https://sanskritdocuments.org/hindi/dict/eng-hin_unic.html - http://ltrc.iiit.ac.in/onlineServices/Dictionaries/eng-hin-utf/eng-hindi-dict-utf8.zip Wikipedia: - https://dumps.wikimedia.org/bnwiki/20190801/bnwiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/hiwiki/20190801/hiwiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/tawiki/20190801/tawiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/orwiki/20190801/orwiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/tewiki/20190801/tewiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/guwiki/20190801/guwiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/pawiki/20190801/pawiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/mlwiki/20190801/mlwiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/knwiki/20190801/knwiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/siwiki/20190801/siwiki-20190801-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/bnwiki/20181001/bnwiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/hiwiki/20181001/hiwiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/tawiki/20181001/tawiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/orwiki/20181001/orwiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/tewiki/20181001/tewiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/guwiki/20181001/guwiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/pawiki/20181001/pawiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/mlwiki/20181001/mlwiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/knwiki/20181001/knwiki-20181001-pages-articles-multistream.xml.bz2 - https://dumps.wikimedia.org/siwiki/20181001/siwiki-20181001-pages-articles-multistream.xml.bz2 Reddit: - http://files.pushshift.io/reddit/comments/RC_2018-09.xz - http://files.pushshift.io/reddit/submissions/RS_2018-09.xz Hindi news: - https://www.bhaskar.com/national/ - https://www.jagran.com/ Bengali news: - https://www.anandabazar.com/ Gujarati news: - https://www.divyabhaskar.co.in/ Punjabi news: - https://jagbani.punjabkesari.in/ Tamil news: - http://www.dinamalar.com/ - https://tamil.samayam.com/ - http://www.dinakaran.com/ Malayalam news: - https://www.manoramaonline.com/news/latest-news.html Kannada news: - http://www.kannadaprabha.com/ Sinhala news: - http://www.lankadeepa.lk/ - http://www.ada.lk/ - http://divaina.com/daily/ - http://www.rivira.lk/online/
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published