What is the json/dictionary format fopr VLM-R1 converted in here?

#1
by briliantnugraha - opened

Hi, could you share the json format used in here?
I tried to input similar json as in the original VLM-R1's HF, but the output is just a repetitive of the same thing.

VLM_R1/$ python3 test_vlm.py 
input image finish
===

messages:  First thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. Please carefully check the image and detect the following objects: ['find all nespresso capsules in the image']. Output the bbox coordinates of detected objects in <answer></answer>. The bbox coordinates in Markdown format should be: 
json
[{"bbox_2d": [x1, y1, x2, y2], "label": "object name"}]

 If no targets are detected in the image, simply respond with "None".
===

[VLM] get response =<think>Let's analyze this step by step. The question asks specifically for nespresso capsules. These are small, cylindrical objects typically used to make coffee or espresso. They often have a distinctive shape and color that can help identify them.</think><answer>
<think>
The task is asking for the detection of Nespresso capsules.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer> First thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. Please carefully check the image and detect all<points x1="70" y1="25" alt="all">all</points><think> 
Let's analyze this step by step. The question asks specifically for Nespresso capsules, which are small cylindrical objects typically used to make coffee or espresso.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer> First thinks about the reasoning process in the mind and then provides the user with the answer.<think>
Let's analyze this step by step. The question asks specifically for Nespresso capsules.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer> Firstly, let's think about the reasoning process needed to detect nespresso capsules in the image.
<think>
To identify Nespresso capsules, we need to look for small cylindrical objects that are typically used for making coffee or espresso. These capsules usually have a distinctive shape and color.        
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer>
thinking_process:  Let's analyze this step by step. The question asks specifically for nespresso capsules. These are small, cylindrical objects typically used to make coffee or espresso. They often have a distinctive shape and color that can help identify them.
answer_output:  <think>
The task is asking for the detection of Nespresso capsules.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]

You'd have to ask the model creator, it's the first link in the model card. We only quantize the model.

Sign up or log in to comment