Instructions to use zai-org/GLM-4.6V-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-4.6V-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="zai-org/GLM-4.6V-Flash") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("zai-org/GLM-4.6V-Flash") model = AutoModelForImageTextToText.from_pretrained("zai-org/GLM-4.6V-Flash") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zai-org/GLM-4.6V-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-4.6V-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.6V-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/zai-org/GLM-4.6V-Flash
- SGLang
How to use zai-org/GLM-4.6V-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.6V-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.6V-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.6V-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.6V-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use zai-org/GLM-4.6V-Flash with Docker Model Runner:
docker model run hf.co/zai-org/GLM-4.6V-Flash
extract reasoning_content error
here is my start code by vllm:
docker run -d
--name vllm-custom
--gpus device=0
-p 8000:8000
-v /data/lab/Models/zai/GLM-4.6V-Flash:/models
vllm-glm46v-t5
/models
--host 0.0.0.0
--max-model-len 3840
--tool-call-parser glm45
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name GLM-4.6V-Flash
when i call this port by:
predict_ret = client.chat.completions.create(
model='GLM-4.6V-Flash',
messages=messages,
extra_body={
"chat_template_kwargs": {
"enable_thinking": True
}
}
)
i cant get the predict_ret.choices[0].message.reasoning_content, all think result are put in the predict_ret.choices[0].message.content.
example:
prompt: what is bug ?
print(f"cot_Content: {predict_ret.choices[0].message.reasoning_content}\n")
print(f"Content: {predict_ret.choices[0].message.content}\n")
cot_Content: None
Content: 用户问的是“what is bug?”,也就是“什么是bug?”。
首先,我需要理解用户的需求。用户可能是在询问软件或程序中的bug的定义,也可能是在更广泛的意义上询问bug,比如生物学上的昆虫,或者一般意义上的错误。不过,考虑到上下文,用户可能是在编程或技术领域,所以主要应该解释软件或程序中的bug。
...
vllm version is 0.13.0 , transformers 5.0.0rc0
I have the same problem, there seems to be some problem with the Reasoning Parser
When switching to vllm nightly everything is in reasoning_content but content is empty....
it seems the end thinking token is either not properly generated or maybe the chat template has some problem
Funnyly enough when running it through open-webui (there it uses the streaming endpoint) it properly splits thinking and content in the UI
but im not too deep into understanding how the chat template and reasoning parsers work...
any help would be appreciated
for now im using it with streaming, and that seems to properly work, just if you want to test it, try it with streaming