What is the json/dictionary format fopr VLM-R1 converted in here?

by briliantnugraha - opened Jun 30

Jun 30

Hi, could you share the json format used in here?
I tried to input similar json as in the original VLM-R1's HF, but the output is just a repetitive of the same thing.

VLM_R1/$ python3 test_vlm.py 
input image finish
===

messages:  First thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. Please carefully check the image and detect the following objects: ['find all nespresso capsules in the image']. Output the bbox coordinates of detected objects in <answer></answer>. The bbox coordinates in Markdown format should be: 
json
[{"bbox_2d": [x1, y1, x2, y2], "label": "object name"}]

 If no targets are detected in the image, simply respond with "None".
===

[VLM] get response =<think>Let's analyze this step by step. The question asks specifically for nespresso capsules. These are small, cylindrical objects typically used to make coffee or espresso. They often have a distinctive shape and color that can help identify them.</think><answer>
<think>
The task is asking for the detection of Nespresso capsules.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer> First thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. Please carefully check the image and detect all<points x1="70" y1="25" alt="all">all</points><think> 
Let's analyze this step by step. The question asks specifically for Nespresso capsules, which are small cylindrical objects typically used to make coffee or espresso.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer> First thinks about the reasoning process in the mind and then provides the user with the answer.<think>
Let's analyze this step by step. The question asks specifically for Nespresso capsules.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer> Firstly, let's think about the reasoning process needed to detect nespresso capsules in the image.
<think>
To identify Nespresso capsules, we need to look for small cylindrical objects that are typically used for making coffee or espresso. These capsules usually have a distinctive shape and color.        
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]


</answer>
thinking_process:  Let's analyze this step by step. The question asks specifically for nespresso capsules. These are small, cylindrical objects typically used to make coffee or espresso. They often have a distinctive shape and color that can help identify them.
answer_output:  <think>
The task is asking for the detection of Nespresso capsules.
</think>
<answer>
json
[
 {"bbox_2d": [105, 64, 139, 87], "label": "nespresso capsules"}
]

mradermacher

Owner Jul 3

You'd have to ask the model creator, it's the first link in the model card. We only quantize the model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment