Question about tokenizer_manager.py line 200 : Regarding the preprocessing of multimodal models #1824
Replies: 4 comments
-
Question 1: According to the code here, the multi-modal inputs_ids is generated in the preprocessing stage, and the process is asynchronous. So the infer of the visual model and the infer of the LLM may happen at the same time. Will there be any conflict? |
Beta Was this translation helpful? Give feedback.
-
Question 2: In the few multimodal models that have been implemented so far, process_images_async() does not return the input_ids field, so I am confused by the code on line 200 here. |
Beta Was this translation helpful? Give feedback.
-
My current dilemma is that I want to add support for a new multimodal model, but I don't know whether process_images_async() should return the input_ids of the image mappings or the raw pixel information? |
Beta Was this translation helpful? Give feedback.
-
You can learn from existing multimodal examples (llama 3.2 and llava). sglang/python/sglang/srt/managers/tokenizer_manager.py Lines 206 to 210 in f407fcf |
Beta Was this translation helpful? Give feedback.
-
sglang/python/sglang/srt/managers/tokenizer_manager.py
Line 200 in 6fcd6d7
Beta Was this translation helpful? Give feedback.
All reactions