poetry
(see how to install it)
scikit-learn
jupyter
jupyterlab
matplotlib
pandas
python-crfsuite
$ poetry install
Training pipelines are available inside the notebooks/
folder. Each notebook
can be executed and reproduce cell by cell.
- linearCRF: This setting considers all the information available. Features are mentioned inside notebooks in the first cell.
- POSLess: In this setting we excluded the POS tags.
- HMMLike: This setting takes into account the minimum information, i.e. information about the current letter and the immediately preceding one. We use this name because this configuration contains similar information as the HMMs but using CRFs to build the.
Inside notebooks/
folder there are notebook with the postfix
_ejemplos.ipynb
for experimental enviroment. Those notebooks are useful to
see pre-trained models in acton.
- L1 = 0.0
- L2 = 0.0
- Max de iterions = 50
- model name:
HMMLike_baseline_k_[1-3].crfsuite
- Delete duplicated lines
$ sort -u corpus > corpus_uniq
- Show duplicated lines
$ diff --color corpus_sort corpus_uniq
To solve encoding/decogding problems with python-crfsuite
we
substitute next otomí characters:
- u̱ -> μ
- a̱̱ -> α
- e̱ -> ε
- i̱ -> ι
- Get the glossed corpus
- Text preprocessing
- Make the feature lists for each letter in sentences
- Split test and train sets
- Training and models build
- Tags generations and performance tests