SageMaker Hugging Face Inference Toolkit is an open-source library for serving 🤗 Transformers and Diffusers models on Amazon SageMaker. This library provides default pre-processing, predict and postprocessing for certain 🤗 Transformers and Diffusers models and tasks. It utilizes the SageMaker Inference Toolkit for starting up the model server, which is responsible for handling inference requests.
For Training, see Run training on Amazon SageMaker.
For the Dockerfiles used for building SageMaker Hugging Face Containers, see AWS Deep Learning Containers.
For information on running Hugging Face jobs on Amazon SageMaker, please refer to the 🤗 Transformers documentation.
For notebook examples: SageMaker Notebook Examples.
needs to be adjusted -> currently pseudo code
Install Amazon SageMaker Python SDK
pip install sagemaker --upgrade
Create a Amazon SageMaker endpoint with a trained model.
from sagemaker.huggingface import HuggingFaceModel
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
model_data='s3://my-trained-model/artifacts/model.tar.gz',
role=role,
)
# deploy model to SageMaker Inference
huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")
Create a Amazon SageMaker endpoint with a model from the 🤗 Hub.
note: This is an experimental feature, where the model will be loaded after the endpoint is created. Not all sagemaker features are supported, e.g. MME
from sagemaker.huggingface import HuggingFaceModel
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
'HF_TASK':'question-answering'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")
The SageMaker Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below.
The HF_TASK
environment variable defines the task for the used 🤗 Transformers pipeline. A full list of tasks can be find here.
HF_TASK="question-answering"
The HF_MODEL_ID
environment variable defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. The 🤗 Hub provides +10 000 models all available through this environment variable.
HF_MODEL_ID="distilbert-base-uncased-finetuned-sst-2-english"
The HF_MODEL_REVISION
is an extension to HF_MODEL_ID
and allows you to define/pin a revision of the model to make sure you always load the same model on your SageMaker Endpoint.
HF_MODEL_REVISION="03b4d196c19d0a73c7e0322684e97db1ec397613"
The HF_API_TOKEN
environment variable defines the your Hugging Face authorization token. The HF_API_TOKEN
is used as a HTTP bearer authorization for remote files, like private models. You can find your token at your settings page.
HF_API_TOKEN="api_XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
The HF_TRUST_REMOTE_CODE
environment variable defines wether or not to allow for custom models defined on the Hub in their own modeling files. Allowed values are "True"
and "False"
HF_TRUST_REMOTE_CODE="True"
The HF_OPTIMUM_BATCH_SIZE
environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is 1
. Not required when model is already converted.
HF_OPTIMUM_BATCH_SIZE="1"
The HF_OPTIMUM_SEQUENCE_LENGTH
environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.
HF_OPTIMUM_SEQUENCE_LENGTH="128"
The Hugging Face Inference Toolkit allows user to override the default methods of the HuggingFaceHandlerService
. Therefore, they need to create a folder named code/
with an inference.py
file in it. You can find an example for it in sagemaker/17_customer_inference_script.
For example:
model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
|- inference.py
|- requirements.txt
In this example, pytorch_model.bin
is the model file saved from training, inference.py
is the custom inference module, and requirements.txt
is a requirements file to add additional dependencies.
The custom module can override the following methods:
model_fn(model_dir, context=None)
: overrides the default method for loading the model, the return valuemodel
will be used in thepredict()
for predicitions. It receives argument themodel_dir
, the path to your unzippedmodel.tar.gz
.transform_fn(model, data, content_type, accept_type)
: overrides the default transform function with a custom implementation. Customers using this would have to implementpreprocess
,predict
andpostprocess
steps in thetransform_fn
. NOTE: This method can't be combined withinput_fn
,predict_fn
oroutput_fn
mentioned below.input_fn(input_data, content_type)
: overrides the default method for preprocessing, the return valuedata
will be used in thepredict()
method for predicitions. The input isinput_data
, the raw body of your request andcontent_type
, the content type form the request Header.predict_fn(processed_data, model)
: overrides the default method for predictions, the return valuepredictions
will be used in thepostprocess()
method. The input isprocessed_data
, the result of thepreprocess()
method.output_fn(prediction, accept)
: overrides the default method for postprocessing, the return valueresult
will be the respond of your request(e.g.JSON
). The inputs arepredictions
, the result of thepredict()
method andaccept
the return accept type from the HTTP Request, e.g.application/json
The SageMaker Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
- Provide
HF_MODEL_ID
, the model repo id on huggingface.co which contains the compiled model under.neuron
format. e.g.optimum/bge-base-en-v1.5-neuronx
- Provide the
HF_OPTIMUM_BATCH_SIZE
andHF_OPTIMUM_SEQUENCE_LENGTH
environment variables to compile the model on the fly, e.g.HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128
- Include
neuron
dictionary in the config.json file in the model archive, e.g.neuron: {"static_batch_size": 1, "static_sequence_length": 128}
The currently supported tasks can be found here. If you plan to deploy an LLM, we recommend taking a look at Neuronx TGI, which is purposly build for LLMs
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
SageMaker Hugging Face Inference Toolkit is licensed under the Apache 2.0 License.
Install all test and development packages with
pip3 install -e ".[test,dev]"
- manually change
MMS_CONFIG_FILE
wget -O sagemaker-mms.properties https://raw.githubusercontent.com/aws/deep-learning-containers/master/huggingface/build_artifacts/inference/config.properties
- Run Container, e.g.
text-to-image
HF_MODEL_ID="stabilityai/stable-diffusion-xl-base-1.0" HF_TASK="text-to-image" python src/sagemaker_huggingface_inference_toolkit/serving.py
-
Adjust
handler_service.py
and comment outif content_type in content_types.UTF8_TYPES:
thats needed for SageMaker but cannot be used locally -
Send request
curl --request POST \
--url http://localhost:8080/invocations \
--header 'Accept: image/png' \
--header 'Content-Type: application/json' \
--data '"{\"inputs\": \"Camera\"}" \
--output image.png
Note: You need to run this on an Inferentia2 instance.
- manually change
MMS_CONFIG_FILE
wget -O sagemaker-mms.properties https://raw.githubusercontent.com/aws/deep-learning-containers/master/huggingface/build_artifacts/inference/config.properties
-
Adjust
handler_service.py
and comment outif content_type in content_types.UTF8_TYPES:
thats needed for SageMaker but cannot be used locally -
Run Container,
- transformers
text-classification
withHF_OPTIMUM_BATCH_SIZE
andHF_OPTIMUM_SEQUENCE_LENGTH
HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" HF_TASK="text-classification" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 python src/sagemaker_huggingface_inference_toolkit/serving.py
- sentence transformers
feature-extration
withHF_OPTIMUM_BATCH_SIZE
andHF_OPTIMUM_SEQUENCE_LENGTH
HF_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2" HF_TASK="feature-extraction" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 python src/sagemaker_huggingface_inference_toolkit/serving.py
- Send request
curl --request POST \
--url http://localhost:8080/invocations \
--header 'Content-Type: application/json' \
--data "{\"inputs\": \"I like you.\"}"