Acamedic – The Academic Dictionary

A small Hunspell dictionary for professional, scientific writing.

High Quality: based on SCOWL en-US dictionary and thoroughly tested.
High Sensitivity:
- removed words such as thee, posses, fatuous, or jerk.
- less missed errors (but probably more false-positives)
Academic Language:
- added words such as overapproximation, whitepaper, and bitmask.
- select the scientific domains you need during build
Easy Install: available as extension for LibreOffice and Firefox/Thunderbird.

Getting Started

Download this repository and then decide for which applications you want to install the dictionary:

Install System-Wide for Hunspell

copy en-Academic.dic and en-Academic.aff to /usr/share/hunspell

Install LibreOffice Extension

Automatic: Download and open acamedic-libreoffice.oxt file in the addons folder.
Manual:
- Start LibreOffice and select Tools → Extension Manager... → Add.
- Open acamedic-libreoffice.oxt from the addons folder.

Install Thunderbird Extension

Automatic: Download and open acamedic-mozilla.xpi.
Manual:
- Start Thunderbird and select Tools → Add-ons → ⚙ → Install Add-on from file.
- Open acamedic-mozilla.xpi from the addons folder.

Install for Sublime-Text
Linux: Copy en-Academic.dic and en-Academic.aff to ~/.config/sublime-text-3/Packages/Language - English/ Windows: Copy en-Academic.dic and en-Academic.aff to C:\Users\YOUR_USER_NAME\AppData\Roaming\Sublime Text 3\Packages\Language - English while replacing YOUR_USER_NAME with your username.

Install for Visual Studio Code

Install the extension denisgerguri.hunspell-spellchecker (Ctrl+Shift+P, type Ext install, type hunspell)
copy en-Academic.dic and en-Academic.aff to ~/.vscode/extensions/denisgerguri.hunspell-spellchecker-1.0.1/languages/
Follow instructions at https://marketplace.visualstudio.com/items?itemName=denisgerguri.hunspell-spellchecker#adding-new-language

Install for TeXstudio

Start TeXstudio and select 'Options → configure Texstudio ... → Language ...'
Under the spell check sub-group, check the path of the spelling dictionary. In Windows OS, the path is: 'C:\Program Files (x86)\texstudio\dictionaries'
Copy en-Academic.dic and en-Academic.aff to the path from step 2.
From the same configuration window of step 2, you can choose en-Academic from the drop menu of the default language.
Restart TeXstudio

Install for Texmaker

Start Texmaker and select 'Options → Configure Texmaker → Editor'
At spelling dictionary, enter the path of the downloaded repository or click the browse button.
Select en-Academic.dic.
Click OK to close.

Contributing Words

The project is in an early stage and you might find many words in your domain that are missing. Please collect them over time and create an issue with your list of words. Again, the idea is to include only words that you use often or rare words that are very distinct and cannot be confused with other terms.

Building and Testing

The dictionary is constructed from several individual dictionary files in the src folder.

/base/ contains common words.
- en_US_basic.dic basic and simple words, e.g. book
- en_US_extra.dic generic, formal and abstract, e.g. interchangeable
- en_US_special.dic special, more domain specific names that are hard to confuse.
/academic/ contains special mathematical, technical, chemical, etc. terms.
/names/ contains special names starting with capital letter.
/codes/ contains keywords of programming languages.

Automated Testing

We perform several types of automated tests to ensure a high quality of the dictionary:

Correctness: check if all words in the dictionary are spelled correctly. This is done by looking up all words in a larger dictionary. Some words are open for debate, e.g. whether testcase is correct or test case should be the only correct spelling.
Duplicates: check if words are encoded twice. This is not a problem for correctness but might indicate some kind of unwanted addition.
Coverage: check which percentage of words of reference articles (e.g. Wikipedia) are included in the dictionary. We aim for high coverage but also intentionally exclude words to improve spell checking of similar words.
Consistency: check if there are inconsistencies with certain grammar rules. E.g. the ending icly should in general be ically

Motivation for this Dictionary

For spell checking of academic documents, it is not useful if dictionaries include words such as thee or wee. They will most likely mask a spelling error of the words the or we. Since they are archaic or words, probably no one is going to write them but rather read them in some historic text fragments. And even if you write them, a spelling error of these words will most likely be masked by other words because instead of wee there is we, see, or weed.

Furthermore, the standard dictionaries include a lot of problematic words, such as wit, dome, or wont.

Even text analysis tool make mistakes.

Grammarly, for example, can also analyze context and grammar but still does not spot the following spelling error: Which refers to the state of keeping some resource secrete from unauthorized parties.

Creation Process

I have created this dictionary using the following process:

take a very small base dictionary: SCOWL-20
manually go through it an remove non-scientific terms
use it to check reference papers/articles
add newly found terms that are in SCOWL-60
manually check and add further unrecognized terms
use the tests to double check all words

Terms I have removed

archaic terms such as brethren, cobbler, sod, thee, thou, unto, wive
inappropriate words such as cum, slut, gnome, sexy, slave
narrative adjectives such as cunning, fatuous, fierce, ghastly, hitherto, pompous, sheer, absurd
uncommon words with common alternatives, such as envisage (use envision), futile (use useless), or horrific (use horrible).
rare words that are very close to common terms, such as fist (too close to first)
colloquial words, such as eh, gig, hey, lad, lousy, oh, bugger
words that are very far away from technical contexts, such as horde, hail, hog, hut, lark, mummy, bigot
religious terms: heresy, sermon, sinful
words often used to insult: hypocrite, illiterate, inane, jerk, moron, snobbery
British / Scottish words that sneaked into the US dictionary, such as lorry, nay, dole, duff, dustbin
potential mistakes found in SCOWL-20: cs, alias's (should be alias'), also species's, die's, elect's, feel's, want's

Testing

I do automatic testing on the resulting dictionary file:

check that there no duplicates or that duplicates are justified, e.g. clean can be a verb clean/SDG or adjective clean/YRT
run on some reference papers and check that not many words are missing
check that all words are spelled correctly:
- using SCOWL-60 and SCOWL-95 dictionary
- manually double-check all remaining words with
  - Google
  - Cambridge Dictionary
  - Merriam-Webster
  - Oxford Dictionary

Lessons Learned

the suffixes -ic and -ical are mostly interchangeable
words with -ical or -ual suffix should (normally) have the /Y adverb extension.
words with -tive suffix often have /SYP
words should either have -icity or -ness suffices and these should not have plural forms. E.g. use simplifications instead of simplicities.
words ending in -ed very rarely can be negated by prefixing in-/im-/ir-/il-

Future Plan: Part-of-Speech Encoding

The Hunspell compression could be used to indicate the POS type of a word. For example /SDG means that either s, ed or ing can be appended to the word, indicating a regular verb. However, the compression is only concerned with the size and therefore it needs to be checked for confusing suffixes.

Misleading expansion examples:

we/DGT expands to we, wed, wing, and west
be/DT expands to be, bed, and best
should/RZ expands to should and shoulder/S

As a result, I started to encode regular verbs with /SDG and nouns with /MS, adjectives with /RT or /Y for the adjective/adverb combination.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
addons		addons
doc		doc
res/img		res/img
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
build_dic.py		build_dic.py
en-Academic.aff		en-Academic.aff
en-Academic.dic		en-Academic.dic
test_dic.py		test_dic.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Acamedic – The Academic Dictionary

Getting Started

Contributing Words

Building and Testing

Automated Testing

Motivation for this Dictionary

Even text analysis tool make mistakes.

Creation Process

Terms I have removed

Testing

Lessons Learned

Future Plan: Part-of-Speech Encoding

About

Releases

Packages

Contributors 3

Languages

License

emareg/acamedic

Folders and files

Latest commit

History

Repository files navigation

Acamedic – The Academic Dictionary

Getting Started

Contributing Words

Building and Testing

Automated Testing

Motivation for this Dictionary

Even text analysis tool make mistakes.

Creation Process

Terms I have removed

Testing

Lessons Learned

Future Plan: Part-of-Speech Encoding

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages