pyfasttext

Yet another Python binding for fastText.

The binding supports Python 2.6, 2.7 and Python 3. It requires Cython.

Numpy and cysignals are also dependencies, but are optional.

pyfasttext has been tested successfully on Linux and Mac OS X.

Warning: if you want to compile pyfasttext on Windows, do not compile with the cysignals module because it does not support this platform.

>>> model.nwords
500000
>>> model.numpy_normalized_vectors
array([[-0.07549749, -0.09407753, ...],
       [ 0.00635979, -0.17272158, ...],
       ...,
       [-0.01009259,  0.14604086, ...],
       [ 0.12467574, -0.0609326 , ...]], dtype=float32)
>>> model.numpy_normalized_vectors.shape
(500000, 100) # (number of words, dimension)

Misc operations with word vectors

Word similarity

>>> model.similarity('dog', 'cat')
0.75596606254577637

Most similar words

>>> model.nearest_neighbors('dog', k=2)
[('dogs', 0.7843924736976624), ('cat', 75596606254577637)]

Analogies

The model.most_similar() method works similarly as the one in gensim.

>>> model.most_similar(positive=['woman', 'king'], negative=['man'], k=1)
[('queen', 0.77121970653533936)]

Text classification

Supervised learning

>>> model = FastText()
>>> model.supervised(input='/path/to/input.txt', output='/path/to/model', epoch=100, lr=0.7)

Get all the labels

>>> model.labels
['LABEL1', 'LABEL2', ...]

Get the number of labels

>>> model.nlabels
100

Prediction

To obtain the k most likely labels from test sentences, there are multiple model.predict_*() methods.

The default value for k is 1. If you want to obtain all the possible labels, use None for k.

Labels and probabilities

If you have a list of strings (or an iterable object), use this:

>>> model.predict_proba(['first sentence\n', 'second sentence\n'], k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]

If you want to test a single string, use this:

>>> model.predict_proba_single('first sentence\n', k=2)
[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)]

WARNING: In order to get the same probabilities as the fastText binary, you have to add a newline (\n) at the end of each string.

If your test data is stored inside a file, use this:

>>> model.predict_proba_file('/path/to/test.txt', k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]

Normalized probabilities

For performance reasons, fastText probabilities often do not sum up to 1.0.

If you want normalized probabilities (where the sum is closer to 1.0 than the original probabilities), you can use the normalized=True parameter in all the methods that output probabilities (model.predict_proba(), model.predict_proba_file() and model.predict_proba_single()).

>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified\n', k=None))
0.9785203068801335
>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified\n', k=None, normalized=True))
0.9999999999999898

Labels only

If you have a list of strings (or an iterable object), use this:

>>> model.predict(['first sentence\n', 'second sentence\n'], k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]

If you want to test a single string, use this:

>>> model.predict_single('first sentence\n', k=2)
['LABEL1', 'LABEL3']

WARNING: In order to get the same probabilities as the fastText binary, you have to add a newline (\n) at the end of each string.

If your test data is stored inside a file, use this:

>>> model.predict_file('/path/to/test.txt', k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]

Quantization

Use keyword arguments in the model.quantize() method.

>>> model.quantize(input='/path/to/input.txt', output='/path/to/model')

You can load quantized models using the FastText constructor or the model.load_model() method.

Is a model quantized?

If you want to know if a model has been quantized before, use the model.quantized attribute.

>>> model = FastText('/path/to/model.bin')
>>> model.quantized
False
>>> model = FastText('/path/to/model.ftz')
>>> model.quantized
True

Subwords

fastText can use subwords (i.e. character ngrams) when doing unsupervised or supervised learning.

You can access the subwords, and their associated vectors, using pyfasttext.

Get the subwords

fastText's word embeddings can be augmented with subword-level information. It is possible to retrieve the subwords and their associated vectors from a model using pyfasttext.

To retrieve all the subwords for a given word, use the model.get_all_subwords(word) method.

>>> model.args.get('minn'), model.args.get('maxn')
(2, 4)
>>> model.get_all_subwords('hello') # word + subwords from 2 to 4 characters
['hello', '<h', '<he', '<hel', 'he', 'hel', 'hell', 'el', 'ell', 'ello', 'll', 'llo', 'llo>', 'lo', 'lo>', 'o>']

For fastText, < means "beginning of a word" and > means "end of a word".

As you can see, fastText includes the full word. You can omit it using the omit_word=True keyword argument.

>>> model.get_all_subwords('hello', omit_word=True)
['<h', '<he', '<hel', 'he', 'hel', 'hell', 'el', 'ell', 'ello', 'll', 'llo', 'llo>', 'lo', 'lo>', 'o>']

When a model is quantized, fastText may prune some subwords. If you want to see only the subwords that are really used when computing a word vector, you should use the model.get_subwords(word) method.

>>> model.quantized
True
>>> model.get_subwords('beautiful')
['eau', 'aut', 'ful', 'ul']
>>> model.get_subwords('hello')
['hello'] # fastText will not use any subwords when computing the word vector, only the full word

Get the subword vectors

To get the individual vectors given the subwords, use the model.get_numpy_subword_vectors(word) method.

>>> model.get_numpy_subword_vectors('beautiful') # 4 vectors, so 4 rows
array([[ 0.49022141,  0.13586822,  ..., -0.14065443,  0.89617103], # subword "eau"
       [-0.42594951,  0.06260503,  ..., -0.18182631,  0.34219387], # subword "aut"
       [ 0.49958718,  2.93831301,  ..., -1.97498322, -1.16815805], # subword "ful"
       [-0.4368791 , -1.92924356,  ...,  1.62921488, 1.90240896]], dtype=float32) # subword "ul"

In fastText, the final word vector is the average of these individual vectors.

>>> import numpy as np
>>> vec1 = model.get_numpy_vector('beautiful')
>>> vecs2 = model.get_numpy_subword_vectors('beautiful')
>>> np.allclose(vec1, np.average(vecs2, axis=0))
True

Sentence and text vectors

To compute the vector of a sequence of words (i.e. a sentence), fastText uses two different methods: * one for unsupervised models * another one for supervised models

When fastText computes a word vector, recall that it uses the average of the following vectors: the word itself and its subwords.

Unsupervised models

For unsupervised models, the representation of a sentence for fastText is the average of the normalized word vectors.

To get the resulting vector as a regular Python array, use the model.get_sentence_vector(line) method.

To get the resulting vector as a numpy ndarray, use the model.get_numpy_sentence_vector(line) method.

>>> vec = model.get_numpy_sentence_vector('beautiful cats')
>>> vec1 = model.get_numpy_vector('beautiful', normalized=True)
>>> vec2 = model.get_numpy_vector('cats', normalized=True)
>>> np.allclose(vec, np.average([vec1, vec2], axis=0)
True

Supervised models

For supervised models, fastText uses the regular word vectors, as well as vectors computed using word ngrams (i.e. shorter sequences of words from the sentence). When computing the average, these vectors are not normalized.

To get the resulting vector as a regular Python array, use the model.get_text_vector(line) method.

To get the resulting vector as a numpy ndarray, use the model.get_numpy_text_vector(line) method.

>>> model.get_numpy_sentence_vector('beautiful cats') # for an unsupervised model
array([-0.20266785,  0.3407566 ,  ...,  0.03044436,  0.39055538], dtype=float32)
>>> model.get_numpy_text_vector('beautiful cats') # for a supervised model
array([-0.20840774,  0.4289546 ,  ..., -0.00457615,  0.52417743], dtype=float32)

Misc utilities

Show the module version

>>> import pyfasttext
>>> pyfasttext.__version__
'0.4.3'

Show fastText version

As there is no version number in fastText, we use the latest fastText commit hash (from HEAD) as a substitute.

>>> import pyfasttext
>>> pyfasttext.__fasttext_version__
'431c9e2a9b5149369cc60fb9f5beba58dcf8ca17'

Show the model (hyper)parameters

>>> model.args
{'bucket': 11000000,
 'cutoff': 0,
 'dim': 100,
 'dsub': 2,
 'epoch': 100,
...
}

Show the model version number

fastText uses a versioning scheme for its generated models. You can retrieve the model version number using the model.version attribute.

version number	description
-1	for really old models with no version number
11	first version number added by fastText
12	for models generated after fastText added support for subwords in supervised learning

>>> model.version
12

Extract labels or classes from a dataset

You can use the FastText object to extract labels or classes from a dataset. The label prefix (which is __label__ by default) is set using the label parameter in the constructor.

If you load an existing model, the label prefix will be the one defined in the model.

>>> model = FastText(label='__my_prefix__')

Extract labels

There can be multiple labels per line.

>>> model.extract_labels('/path/to/dataset1.txt')
[['LABEL2', 'LABEL5'], ['LABEL1'], ...]

Extract classes

There can be only one class per line.

>>> model.extract_classes('/path/to/dataset2.txt')
['LABEL3', 'LABEL1', 'LABEL2', ...]

Exceptions

The fastText source code directly calls exit() when something wrong happens (e.g. a model file does not exist, ...).

Instead of exiting, pyfasttext raises a Python exception (RuntimeError).

>>> import pyfasttext
>>> model = pyfasttext.FastText('/path/to/non-existing_model.bin')
Model file cannot be opened for loading!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/pyfasttext.pyx", line 124, in pyfasttext.FastText.__cinit__ (src/pyfasttext.cpp:1800)
  File "src/pyfasttext.pyx", line 348, in pyfasttext.FastText.load_model (src/pyfasttext.cpp:5947)
RuntimeError: fastext tried to exit: 1

Interruptible operations

pyfasttext uses cysignals to make all the computationally intensive operations (e.g. training) interruptible.

To easily interrupt such an operation, just type Ctrl-C in your Python shell.

>>> model.skipgram(input='/path/to/input.txt', output='/path/to/mymodel')
Read 12M words
Number of words:  60237
Number of labels: 0
... # type Ctrl-C during training
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/pyfasttext.pyx", line 680, in pyfasttext.FastText.skipgram (src/pyfasttext.cpp:11125)
  File "src/pyfasttext.pyx", line 674, in pyfasttext.FastText.train (src/pyfasttext.cpp:11009)
  File "src/pyfasttext.pyx", line 668, in pyfasttext.FastText.train (src/pyfasttext.cpp:10926)
  File "src/cysignals/signals.pyx", line 94, in cysignals.signals.sig_raise_exception (build/src/cysignals/signals.c:1328)
KeyboardInterrupt
>>> # you can have your shell back!

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

pyfasttext

Table of Contents

Installation

Simplest way to install pyfasttext: use pip

Possible compilation error

Cloning

Requirements for Python 2.7

Building and installing manually

Building and installing without optional dependencies

Usage

How to load the library?

How to load an existing model?

Word representation learning

Training using Skipgram

Training using CBoW

Word vectors

Word vectors access

Vector for a given word

Words for a given vector

Get the number of words in the model

Get all the word vectors in a model