Yet another Python binding for fastText.
The binding supports Python 2.6, 2.7 and Python 3. It requires Cython.
Numpy and cysignals are also dependencies, but are optional.
pyfasttext
has been tested successfully on Linux and Mac OS X.pyfasttext
on Windows, do not
compile with the cysignals
module because it does not support this
platform.- pyfasttext
- Table of Contents
To compile pyfasttext
, make sure you have the following compiler: *
GCC (g++
) with C++11 support. * LLVM (clang++
) with (at least)
partial C++17 support.
Just type these lines:
pip install cython
pip install pyfasttext
If you have a compilation error, you can try to install cysignals
manually:
pip install cysignals
Then, retry to install pyfasttext
with the already mentioned pip
command.
pyfasttext
uses git
submodules.--recursive
option when you clone the
repository.git clone --recursive https://github.com/vrasneur/pyfasttext.git
cd pyfasttext
pyfasttext
needs bytes
objects, which are not
available natively in Python2.future
module with pip
.pip install future
First, install all the requirements:
pip install -r requirements.txt
Then, build and install with setup.py
:
python setup.py install
pyfasttext
can export word vectors as numpy
ndarray
s,
however this feature can be disabled at compile time.
To compile without numpy
, pyfasttext has a USE_NUMPY
environment
variable. Set this variable to 0 (or empty), like this:
USE_NUMPY=0 python setup.py install
If you want to compile without cysignals
, likewise, you can set the
USE_CYSIGNALS
environment variable to 0 (or empty).
>>> from pyfasttext import FastText
>>> model = FastText('/path/to/model.bin')
or
>>> model = FastText()
>>> model.load_model('/path/to/model.bin')
fastText
binary
(input
, output
, epoch
, lr
, ...).FastText
object.>>> model = FastText()
>>> model.skipgram(input='data.txt', output='model', epoch=100, lr=0.7)
>>> model = FastText()
>>> model.cbow(input='data.txt', output='model', epoch=100, lr=0.7)
By default, a single word vector is returned as a regular Python array of floats.
>>> model['dog']
array('f', [-1.308749794960022, -1.8326224088668823, ...])
Numpy ndarray
The model.get_numpy_vector(word)
method returns the word vector as a
numpy
ndarray
.
>>> model.get_numpy_vector('dog')
array([-1.30874979, -1.83262241, ...], dtype=float32)
If you want a normalized vector (i.e. the vector divided by its norm),
there is an optional boolean parameter named normalized
.
>>> model.get_numpy_vector('dog', normalized=True)
array([-0.07084749, -0.09920666, ...], dtype=float32)
model[word]
or
model.get_numpy_vector(word)
is
model.words_for_vector(vector, k)
.k
words closest to the provided vector.
The default value for k
is 1.>>> king = model.get_numpy_vector('king')
>>> man = model.get_numpy_vector('man')
>>> woman = model.get_numpy_vector('woman')
>>> model.words_for_vector(king + woman - man, k=1)
[('queen', 0.77121970653533936)]
>>> model.nwords
500000
>>> for word in model.words:
... print(word, model[word])
Numpy ndarray
If you want all the word vectors as a big numpy
ndarray
, you can
use the numpy_normalized_vectors
member. Note that all these vectors
are normalized.
>>> model.nwords
500000
>>> model.numpy_normalized_vectors
array([[-0.07549749, -0.09407753, ...],
[ 0.00635979, -0.17272158, ...],
...,
[-0.01009259, 0.14604086, ...],
[ 0.12467574, -0.0609326 , ...]], dtype=float32)
>>> model.numpy_normalized_vectors.shape
(500000, 100) # (number of words, dimension)
>>> model.similarity('dog', 'cat')
0.75596606254577637
>>> model.nearest_neighbors('dog', k=2)
[('dogs', 0.7843924736976624), ('cat', 75596606254577637)]
The model.most_similar()
method works similarly as the one in
gensim.
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], k=1)
[('queen', 0.77121970653533936)]
>>> model = FastText()
>>> model.supervised(input='/path/to/input.txt', output='/path/to/model', epoch=100, lr=0.7)
>>> model.labels
['LABEL1', 'LABEL2', ...]
>>> model.nlabels
100
k
most likely labels from test sentences, there are
multiple model.predict_*()
methods.k
is 1. If you want to obtain all the
possible labels, use None
for k
.If you have a list of strings (or an iterable object), use this:
>>> model.predict_proba(['first sentence\n', 'second sentence\n'], k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]
If you want to test a single string, use this:
>>> model.predict_proba_single('first sentence\n', k=2)
[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)]
WARNING: In order to get the same probabilities as the fastText
binary, you have to add a newline (\n
) at the end of each string.
If your test data is stored inside a file, use this:
>>> model.predict_proba_file('/path/to/test.txt', k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]
Normalized probabilities
For performance reasons, fastText probabilities often do not sum up to 1.0.
If you want normalized probabilities (where the sum is closer to 1.0
than the original probabilities), you can use the normalized=True
parameter in all the methods that output probabilities
(model.predict_proba()
, model.predict_proba_file()
and
model.predict_proba_single()
).
>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified\n', k=None))
0.9785203068801335
>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified\n', k=None, normalized=True))
0.9999999999999898
If you have a list of strings (or an iterable object), use this:
>>> model.predict(['first sentence\n', 'second sentence\n'], k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]
If you want to test a single string, use this:
>>> model.predict_single('first sentence\n', k=2)
['LABEL1', 'LABEL3']
WARNING: In order to get the same probabilities as the fastText
binary, you have to add a newline (\n
) at the end of each string.
If your test data is stored inside a file, use this:
>>> model.predict_file('/path/to/test.txt', k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]
Use keyword arguments in the model.quantize()
method.
>>> model.quantize(input='/path/to/input.txt', output='/path/to/model')
You can load quantized models using the FastText
constructor or the
model.load_model()
method.
If you want to know if a model has been quantized before, use the
model.quantized
attribute.
>>> model = FastText('/path/to/model.bin')
>>> model.quantized
False
>>> model = FastText('/path/to/model.ftz')
>>> model.quantized
True
fastText can use subwords (i.e. character ngrams) when doing unsupervised or supervised learning.
You can access the subwords, and their associated vectors, using
pyfasttext
.
fastText's word embeddings can be augmented with subword-level
information. It is possible to retrieve the subwords and their
associated vectors from a model using pyfasttext
.
To retrieve all the subwords for a given word, use the
model.get_all_subwords(word)
method.
>>> model.args.get('minn'), model.args.get('maxn')
(2, 4)
>>> model.get_all_subwords('hello') # word + subwords from 2 to 4 characters
['hello', '<h', '<he', '<hel', 'he', 'hel', 'hell', 'el', 'ell', 'ello', 'll', 'llo', 'llo>', 'lo', 'lo>', 'o>']
For fastText, <
means "beginning of a word" and >
means "end of
a word".
As you can see, fastText includes the full word. You can omit it using
the omit_word=True
keyword argument.
>>> model.get_all_subwords('hello', omit_word=True)
['<h', '<he', '<hel', 'he', 'hel', 'hell', 'el', 'ell', 'ello', 'll', 'llo', 'llo>', 'lo', 'lo>', 'o>']
When a model is quantized, fastText may prune some subwords. If you
want to see only the subwords that are really used when computing a word
vector, you should use the model.get_subwords(word)
method.
>>> model.quantized
True
>>> model.get_subwords('beautiful')
['eau', 'aut', 'ful', 'ul']
>>> model.get_subwords('hello')
['hello'] # fastText will not use any subwords when computing the word vector, only the full word
To get the individual vectors given the subwords, use the
model.get_numpy_subword_vectors(word)
method.
>>> model.get_numpy_subword_vectors('beautiful') # 4 vectors, so 4 rows
array([[ 0.49022141, 0.13586822, ..., -0.14065443, 0.89617103], # subword "eau"
[-0.42594951, 0.06260503, ..., -0.18182631, 0.34219387], # subword "aut"
[ 0.49958718, 2.93831301, ..., -1.97498322, -1.16815805], # subword "ful"
[-0.4368791 , -1.92924356, ..., 1.62921488, 1.90240896]], dtype=float32) # subword "ul"
In fastText, the final word vector is the average of these individual vectors.
>>> import numpy as np
>>> vec1 = model.get_numpy_vector('beautiful')
>>> vecs2 = model.get_numpy_subword_vectors('beautiful')
>>> np.allclose(vec1, np.average(vecs2, axis=0))
True
To compute the vector of a sequence of words (i.e. a sentence), fastText uses two different methods: * one for unsupervised models * another one for supervised models
When fastText computes a word vector, recall that it uses the average of the following vectors: the word itself and its subwords.
For unsupervised models, the representation of a sentence for fastText is the average of the normalized word vectors.
model.get_sentence_vector(line)
method.numpy
ndarray
, use the
model.get_numpy_sentence_vector(line)
method.>>> vec = model.get_numpy_sentence_vector('beautiful cats')
>>> vec1 = model.get_numpy_vector('beautiful', normalized=True)
>>> vec2 = model.get_numpy_vector('cats', normalized=True)
>>> np.allclose(vec, np.average([vec1, vec2], axis=0)
True
For supervised models, fastText uses the regular word vectors, as well as vectors computed using word ngrams (i.e. shorter sequences of words from the sentence). When computing the average, these vectors are not normalized.
model.get_text_vector(line)
method.numpy
ndarray
, use the
model.get_numpy_text_vector(line)
method.>>> model.get_numpy_sentence_vector('beautiful cats') # for an unsupervised model
array([-0.20266785, 0.3407566 , ..., 0.03044436, 0.39055538], dtype=float32)
>>> model.get_numpy_text_vector('beautiful cats') # for a supervised model
array([-0.20840774, 0.4289546 , ..., -0.00457615, 0.52417743], dtype=float32)
>>> import pyfasttext
>>> pyfasttext.__version__
'0.4.3'
As there is no version number in fastText, we use the latest fastText
commit hash (from HEAD
) as a substitute.
>>> import pyfasttext
>>> pyfasttext.__fasttext_version__
'431c9e2a9b5149369cc60fb9f5beba58dcf8ca17'
>>> model.args
{'bucket': 11000000,
'cutoff': 0,
'dim': 100,
'dsub': 2,
'epoch': 100,
...
}
fastText uses a versioning scheme for its generated models. You can
retrieve the model version number using the model.version
attribute.
version number | description |
---|---|
-1 | for really old models with no version number |
11 | first version number added by fastText |
12 | for models generated after fastText added support for subwords in supervised learning |
>>> model.version
12
You can use the FastText
object to extract labels or classes from a
dataset. The label prefix (which is __label__
by default) is set
using the label
parameter in the constructor.
If you load an existing model, the label prefix will be the one defined in the model.
>>> model = FastText(label='__my_prefix__')
There can be multiple labels per line.
>>> model.extract_labels('/path/to/dataset1.txt')
[['LABEL2', 'LABEL5'], ['LABEL1'], ...]
There can be only one class per line.
>>> model.extract_classes('/path/to/dataset2.txt')
['LABEL3', 'LABEL1', 'LABEL2', ...]
The fastText
source code directly calls exit() when something wrong
happens (e.g. a model file does not exist, ...).
Instead of exiting, pyfasttext
raises a Python exception
(RuntimeError
).
>>> import pyfasttext
>>> model = pyfasttext.FastText('/path/to/non-existing_model.bin')
Model file cannot be opened for loading!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/pyfasttext.pyx", line 124, in pyfasttext.FastText.__cinit__ (src/pyfasttext.cpp:1800)
File "src/pyfasttext.pyx", line 348, in pyfasttext.FastText.load_model (src/pyfasttext.cpp:5947)
RuntimeError: fastext tried to exit: 1
pyfasttext
uses cysignals
to make all the computationally
intensive operations (e.g. training) interruptible.
To easily interrupt such an operation, just type Ctrl-C
in your
Python shell.
>>> model.skipgram(input='/path/to/input.txt', output='/path/to/mymodel')
Read 12M words
Number of words: 60237
Number of labels: 0
... # type Ctrl-C during training
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/pyfasttext.pyx", line 680, in pyfasttext.FastText.skipgram (src/pyfasttext.cpp:11125)
File "src/pyfasttext.pyx", line 674, in pyfasttext.FastText.train (src/pyfasttext.cpp:11009)
File "src/pyfasttext.pyx", line 668, in pyfasttext.FastText.train (src/pyfasttext.cpp:10926)
File "src/cysignals/signals.pyx", line 94, in cysignals.signals.sig_raise_exception (build/src/cysignals/signals.c:1328)
KeyboardInterrupt
>>> # you can have your shell back!