[Bug]: docker image openvino/model_server:latest-gpu does not serve the model correctly #27541

fedecompa · 2024-11-13T15:02:58Z

OpenVINO Version

2024.3

Operating System

Windows System

Device used for inference

intel UHD Graphics GPU

Framework

None

Model used

meta-llama/Llama-3.2-3B-Instruct

Issue description

I deployed the llama 3.2 -3B model using the image: openvino/model_server:latest-gpu following the documentation here:

https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_demos_continuous_batching.html

and the folder structure for the openvino IR model:

https://github.com/openvinotoolkit/model_server/blob/main/docs/models_repository.md

The command in my docker-compose is:
command: --model_path /workspace/Llama-3.2-3B-Instruct --model_name meta-llama/Llama-3.2-3B-Instruct --port 9001 --rest_port 8001 --target_device GPU

From the logs in the container I see that the server loads the model and starts correctly. Indeed if I call the API http://localhost:8001/v1/config I obtain:

{
"meta-llama/Llama-3.2-3B-Instruct" :
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}
}

However when I call the completions endpoint I get 404: {
"error": "Model with requested name is not found"
}

Step-by-step reproduction

No response

Relevant log output

No response

Issue submission checklist

I'm reporting an issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.

Iffa-Intel · 2024-11-15T07:00:33Z

@fedecompa I encountered several issues too when attempting the steps in this guide (which you shared) on Windows: How to serve LLM models with Continuous Batching via OpenAI API.

Please note that this demo was officially validated on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9. Other OS/hardware might work but still, issues are expected.

fedecompa · 2024-11-15T13:04:05Z

@Iffa-Intel thanks for the reply.
Actually the GPU is detected correctly from the docker container running on the WSL2 Ubuntu22.
And also the model is running correctly with the OVModelForCausalLM library for python on windows locally:

model_id = "EmbeddedLLM/Llama-3.2-3B-Instruct-int4-sym-ov"
model = OVModelForCausalLM.from_pretrained(model_id, device="GPU.0", trust_remote_code=True)

So it is actually very strange...

Iffa-Intel · 2024-11-19T06:09:59Z

@fedecompa we'll further investigate & clarify this and get back to you. This probably relates to the architecture of WSL2 in Windows vs Ubuntu which influenced the OpenVINO library functionality.

avitial · 2024-11-22T21:52:26Z

@fedecompa I see you listed using 2024.3 version, I've just tried the 2024.5 version of the model server image for GPU and the issue does not reproduce. Would it be possible to try the latest version? Hope this resolves the issue on your side, let us know if you have any questions or issue persists. Note I've tried meta-llama/Meta-Llama-3-8B-Instruct, let me check also with meta-llama/Llama-3.2-3B-Instruct, based on the error it might be caused by a mismatch in the model's name.

$ python model_server/demos/common/export_models/export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --cache_size 2 --config_file_path models/config.json --model_repository_path models --overwrite_models

$ docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:2024.5-gpu --rest_port 8000 --config_path /workspace/config.json

$ curl http://localhost:8000/v3/chat/completions \\
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "max_tokens":30,
    "stream":false,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is OpenVINO?"
      }
    ]
  }'| jq .

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   733  100   439  100   294    166    111  0:00:02  0:00:02 --:--:--   278
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "OpenVINO is an open-source software framework developed by Intel for optimizing and deploying deep learning (DL) models on various hardware platforms, including CPU,",
        "role": "assistant"
      }
    }
  ],
  "created": 1732311879,
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 27,
    "completion_tokens": 30,
    "total_tokens": 57
  }
}

avitial · 2024-11-22T22:27:47Z

@fedecompa just checked with meta-llama/Llama-3.2-3B-Instruct model and it is also working with 2024.5, please give it a go on your end and see if issue resolves on your side. Hope this helps!

$ curl http://localhost:8000/v3/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",                                                                                                                                                                  "max_tokens":30,
    "stream":false,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is OpenVINO?"
      }
    ]
  }'| jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   726  100   435  100   291    285    191  0:00:01  0:00:01 --:--:--   476
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "OpenVINO is an open-source software development kit (SDK) used for optimized performance on various platforms, especially with deep learning and AI applications. It",
        "role": "assistant"
      }
    }
  ],
  "created": 1732314311,
  "model": "meta-llama/Llama-3.2-3B-Instruct",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 47,
    "completion_tokens": 30,
    "total_tokens": 77
  }
}

fedecompa · 2024-11-27T14:50:14Z

@avitial executing the python script from WSL2 terminal it works for the embedding model "BAAI/bge-m3" with the latest-gpu openvino/model_server docker image (2024.5).

But it still fails for the chat model "meta-llama/Llama-3.2-3B-Instruct" with this error:

Error during llm node initialization for models_path: /workspace/meta-llama/Llama-3.2-3B-Instruct/./ exception: Exception from src/inference/src/cpp/infer_request.cpp:245:
2024-11-27 15:26:55 Exception from src/plugins/intel_cpu/src/graph.cpp:1365:
2024-11-27 15:26:55 Node VocabDecoder_122 of type Reference
2024-11-27 15:26:55 Check 'inputs.size() == 4' failed at /openvino_tokenizers/src/vocab_decoder.cpp:33:
2024-11-27 15:26:55 Too few inputs passed to VocabDecoder, it means it is not converted properly or it is not used in the supported pattern
2024-11-27 15:26:55 [2024-11-27 14:26:55.428][1][serving][error][mediapipegraphdefinition.cpp:467] Failed to process LLM node graph meta-llama/Llama-3.2-3B-Instruct
2024-11-27 15:26:55 [2024-11-27 14:26:55.428][1][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: meta-llama/Llama-3.2-3B-Instruct state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent

fedecompa added bug Something isn't working support_request labels Nov 13, 2024

YuChern-Intel assigned Iffa-Intel Nov 13, 2024

Iffa-Intel added the PSE label Nov 19, 2024

avitial self-assigned this Nov 22, 2024

avitial added category: GPU OpenVINO GPU plugin and removed bug Something isn't working labels Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: docker image openvino/model_server:latest-gpu does not serve the model correctly #27541

[Bug]: docker image openvino/model_server:latest-gpu does not serve the model correctly #27541

fedecompa commented Nov 13, 2024 •

edited

Loading

Iffa-Intel commented Nov 15, 2024

fedecompa commented Nov 15, 2024 •

edited

Loading

Iffa-Intel commented Nov 19, 2024 •

edited

Loading

avitial commented Nov 22, 2024 •

edited

Loading

avitial commented Nov 22, 2024

fedecompa commented Nov 27, 2024 •

edited

Loading

[Bug]: docker image openvino/model_server:latest-gpu does not serve the model correctly #27541

[Bug]: docker image openvino/model_server:latest-gpu does not serve the model correctly #27541

Comments

fedecompa commented Nov 13, 2024 • edited Loading

OpenVINO Version

Operating System

Device used for inference

Framework

Model used

Issue description

Step-by-step reproduction

Relevant log output

Issue submission checklist

Iffa-Intel commented Nov 15, 2024

fedecompa commented Nov 15, 2024 • edited Loading

Iffa-Intel commented Nov 19, 2024 • edited Loading

avitial commented Nov 22, 2024 • edited Loading

avitial commented Nov 22, 2024

fedecompa commented Nov 27, 2024 • edited Loading

fedecompa commented Nov 13, 2024 •

edited

Loading

fedecompa commented Nov 15, 2024 •

edited

Loading

Iffa-Intel commented Nov 19, 2024 •

edited

Loading

avitial commented Nov 22, 2024 •

edited

Loading

fedecompa commented Nov 27, 2024 •

edited

Loading