leveraging modelkit and the IMDB reviews dataset
This sample project aims at illustrating modelkit's powerful features, based on the documentation tutorial: NLP x Sentiment Analysis.
It also serves as a sandbox for any developer willing to try out modelkit in real conditions, from the package organization to the Github CI and through the use of CLIs and HTTP serving.
Jump to the result hosted on Heroku
at https://modelkit-imdb.herokuapp.com/docs/.
First, please source .env
or run the following:
export MODELKIT_ASSETS_DIR=.local_storage
export MODELKIT_STORAGE_BUCKET=.
export MODELKIT_STORAGE_PREFIX=.remote_storage
export MODELKIT_STORAGE_PROVIDER=local
export MODELKIT_DEFAULT_PACKAGE=modelkit_imdb # shortcut for CLIs
Once done, let's create a new python virtual environment and install the dev requirements:
pip install -r requirements-dev.txt
Before going further, let's remind us of the models that are available in the modelkit_imdb
package with the following CLI:
modelkit describe
modelkit encourages you to organize your project as a python package to benefit from the python software tools and clearness.
Hence, this sample project was first arranged following this minimal python package structure:
modelkit-imdb # project directory
├── modelkit_imdb # the python package name
│ ├── __init__.py # the different models implemented
│ ├── classifiers.py
│ ├── tokenizers.py
│ └── vectorizers.py
├── requirements.txt
├── setup.cfg # setup configurations for our package
├── setup.py
└── tests # tests outside the modelkit_imdb package
├── __init__.py
├── conftest.py
└── test_auto_testing.py
This way, your package can easily be shared and set up across your organization using pip, via an artifactory or git:
pip install git+https://github.com/clustree/modelkit-imdb.git 10:41:07
Before using it as a regular python package, with modelkit's support:
import modelkit_imdb
from modelkit import ModelLibrary
# use the ModelLibrary to automatically discover and load models from modelkit_imdb
library = ModelLibrary(models=modelkit_imdb)
tokenizer = library.get("imdb_tokenizer")
tokenizer.predict("I love this movie!")
In addition to this minimal python package structure, several files were also added for deployment and the github CI:
Dockerfile
&heroku.yml
: to automatically deploy on Heroku after on themain
branch once tests passMakefile
: to remind us how to lint, test or compute the coverage rate.gitignore
: to prevent from versioning non-wanted filesnoxfile.py
: to tests on multiple python environment, useful for the CI- etc.
You are now all set to write your machine learning models following software engineering's and modelkit's best practices.
For this sample project, we emulated a remote storage locally at .remote_storage
, which contains all the differet artefacts created so far:
vocabulary.txt
, for theimdb_vectorizer
model.h5
: for theimdb_classifier
In production, you may be using / want to use AWS S3
, GCS
or whatever (safely) configured remote storage: modelkit has your back and provides driver to directly read and write on those providers (you can also write your own!).
When using a model, modelkit automatically retrieves its different assets from the remote storage, and safely caches to a local storage.
To better understand how it all works, (modelkit's versioning and Assets Management), let's restart from scratch.
Make sure the .env
file was sourced to set the environment variables needed by modelkit.
First, let's grab the different assets from the current local storage before dropping it as well as the remote storage:
mkdir -p tmp/classifier tmp/vectorizer
cp .local_storage/imdb/classifier/0.0/model.h5 tmp/classifier
cp .local_storage/imdb/vectorizer/0.0/vocabulary.txt tmp/vectorizer
rm -rf .local_storage/imdb .local_storage/.cache .remote_storage/imdb .remote_storage/.cache
Now, let's use modelkit
assets management CLI to version our two assets along with their directories, under the imdb
namespace:
# modelkit assets new [PATH] [NAMESPACE/NAME]
modelkit assets new tmp/vectorizer imdb/vectorizer
# Current assets manager:
# - storage provider = `<LocalStorageDriver bucket=.>`
# - bucket = `.`
# - prefix = `.remote_storage`
# Current asset: `imdb/vectorizer`
# - name = `imdb/vectorizer`
# Push a new asset `imdb/vectorizer` with version `0.0`?
# [y/N]: y
modelkit assets new tmp/classifier imdb/classifier
# ...
The just-versioned assets are now located under the .remote_storage
directory, along with their version and metadata.
They now can be used as part of modelkit models, in the CONFIGURATIONS
map:
class Classifier(modelkit.Model[MovieReviewItem, MovieSentimentItem]):
CONFIGURATIONS = {
"imdb_classifier": {
"asset": "imdb/classifier:0.0[/model.h5]", # namespace/name:version[subfile]
"model_dependencies": {
"tokenizer": "imdb_tokenizer",
"vectorizer": "imdb_vectorizer",
},
}
}
As you can see, they are pinned to a given version so that you can freely update them hurting production:
modelkit assets update tmp/classifier imdb/classifier
# Current assets manager:
# - storage provider = `<LocalStorageDriver bucket=.>`
# - bucket = `.`
# - prefix = `.remote_storage`
# Current asset: `imdb/classifier`
# - name = `imdb/classifier`
# - major version = `None`
# - minor version (ignored) = `None`
# Found a total of 1 versions (1 major versions) for `imdb/classifier`
# - major `0` = 0.0
# Push a new asset version `0.1` for `imdb/classifier`?
# [y/N]:
They are then retrieved and cached in .local_storage
once called by at a model:
from modelkit import ModelLibrary
lib = ModelLibrary(models="modelkit_imdb") # package or path to the package
classifier = lib.get("imdb_classifier")
classifier.predict({"text": "I love this movie so much"})
# MovieSentimentItem(label='good', score=0.6999041438102722)
That's it, make sure to clean the ./tmp
folder before leaving!
Let's make sure everything works as intended by running some tests.
pytest
As you can see in the tests/
subfolder, two things were added:
- in
conftest.py
: a pytest fixturemodel_library
which creates aModelLibrary
with all the models found in the package
from modelkit.testing.fixtures import modellibrary_fixture
modellibrary_fixture(
models=modelkit_imdb,
fixture_name="model_library",
)
- in
test_auto_testing.py
: a test which iterates through allmodelkit_imdb
models to find tests and run them, using the just-definedmodel_library
fixture:
from modelkit.testing import modellibrary_auto_test
modellibrary_auto_test(
models=modelkit_imdb,
fixture_name="model_library"
)
For more info, head over to Testing.
The following CLI will start a single worker which will expose all the models found under the modelkit_imdb
package leveraging uvicorn and FastAPI:
modelkit serve
Voilà: the uvicorn worker is now running at http://localhost:8000
.
modelkit also provides out-of-the-box support for gunicorn:
gunicorn --workers 4 \
--bind 0.0.0.0:8000 \
--preload \
--worker-class=uvicorn.workers.UvicornWorker \
'modelkit.api:create_modelkit_app()'
Check out the generated SwaggerUI
at http://localhost:8000/docs to see all the endpoints and try them out:
Of course, you can also POST
your request on the endpoint of your choice:
curl -X 'POST' \
'http://localhost:8000/predict/imdb_classifier' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"text": "This movie sucks! It is the worst I have ever seen in my entire life"
}'
# {"label":"bad","score": 0.1530771553516388}
To conclude this sample project, a minimal Dockerfile
was written as well as a heroku.yml
file so that to host our different models on Heroku
at https://modelkit-imdb.herokuapp.com/docs/.
You can run it locally using Docker and enjoy the Swagger at: http://localhost:8000/docs:
docker build -t modelkit-imdb .
docker run -p 8000:8000 -e PORT=8000 modelkit-imdb