Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

LICENSE.md +67 -0
README.md +413 -3
added_tokens.json +130 -0
chat_template.jinja +2 -0
config.json +158 -0
generation_config.json +10 -0
merges.txt +0 -0
model.safetensors +3 -0
preprocessor_config.json +35 -0
processor_config.json +4 -0
special_tokens_map.json +39 -0
tokenizer.json +0 -0
tokenizer_config.json +1191 -0
video_preprocessor_config.json +48 -0
vocab.json +0 -0

LICENSE.md ADDED Viewed

	@@ -0,0 +1,67 @@

+# Model License, Data Attribution & Disclaimer
+## Model License
+This model and its associated weights are released under the
+**Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)**.
+You are free to:
+- **Share** — copy and redistribute the model in any medium or format
+- **Adapt** — remix, transform, and build upon the model
+Under the following terms:
+- **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
+- **NonCommercial** — You may not use the model for commercial purposes.
+- **ShareAlike** — If you modify, fine-tune, or build upon the model, you must distribute your contributions under the same license.
+Full license text: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
+---
+## Data Sources & Attribution
+This model was trained on **derived data** based on the following publicly available datasets. **No original dataset content is included in this release.**
+- Datasets under `CC BY-NC-SA 4.0`
+  - [UGroundV1](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
+  - [UGround-V1-Data-Box](https://huggingface.co/datasets/osunlp/UGround-V1-Data-Box)
+  - [GTA1 grounding dataset](https://huggingface.co/datasets/HelloKKMe/grounding_dataset)
+- Datasets under `MIT License`
+  - [AgentNet](https://huggingface.co/datasets/xlangai/AgentNet)
+  - [GUI-R1](https://huggingface.co/datasets/ritzzai/GUI-R1): only used in evaluation.
+- Datasets under `Apache License 2.0`
+  - [Jedi](https://huggingface.co/datasets/xlangai/Jedi)
+  - [GUI-Net-Mini](https://huggingface.co/datasets/Bofeee5675/GUI-Net-Mini)
+  - [GUI-Net-1M](https://huggingface.co/datasets/Bofeee5675/GUI-Net-1M)
+  - [Aguvis-stage1](https://huggingface.co/datasets/xlangai/aguvis-stage1)
+  - [Aguvis-stage2](https://huggingface.co/datasets/xlangai/aguvis-stage2)
+  - [OS-Atlas-data](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data)
+- Unlicensed datasets
+  - [DocVQA](https://www.docvqa.org/): Publicly available, no license restrictions.
+All rights for these datasets remain with their respective authors and licensors.
+---
+## Combined Licensing Context
+Since several datasets with **CC BY-NC-SA 4.0** license are used, it must also be distributed under that license.
+Datasets under MIT and Apache 2.0 are license-compatible and impose no additional obligations.
+---
+## Disclaimer
+This model and documentation are provided **“as is”**, without warranty of any kind, express or implied, including but not limited to merchantability, fitness for a particular purpose, and non-infringement.
+The authors and contributors assume no responsibility for how this model or any derivative works are used.
+Users are solely responsible for ensuring compliance with all applicable dataset licenses, laws, and regulations.
+Commercial use of this model is **not permitted** under the CC BY-NC-SA 4.0 license.
+---
+## © Copyright
+© Vocaela AI, 2025
+All rights reserved except as granted under the licenses above.

README.md CHANGED Viewed

@@ -1,3 +1,413 @@
----
-license: cc-by-nc-sa-4.0
----

+---
+license: cc-by-nc-sa-4.0
+language:
+- en
+base_model:
+- HuggingFaceTB/SmolVLM2-500M-Video-Instruct
+---
+# Vocaela-500M: A Tiny Mighty GUI Agent Model
+**TL;DR:**
+A compact 500M-parameter Vision-Language Model (VLM) designed for GUI agents. Given a screenshot and a simple instruction (e.g., “click the submit button”), it outputs structured JSON actions with precise pixel coordinates. Despite its small size, it delivers surprisingly strong performance—often surpassing much larger models—while running smoothly on laptops and even mobile devices.
+## Model description
+A growing number of models can now operate computer and mobile GUIs on behalf of users. However, most are massive and impractical for everyday devices like laptops or phones. While many GUI agent models chase higher autonomy, Vocaela-500M explores a different path: a smaller, efficient model focused on precise low-level control.
+Given a screenshot and an explicit instruction such as “click the submit button,” it produces structured JSON actions with pixel coordinates. By narrowing the scope, we maximize efficiency, achieving smooth performance on laptops and even mobile devices.
+Despite its compact 500M parameters, Vocaela-500M performs surprisingly well on grounding and GUI control tasks—often matching or surpassing larger models. This marks a new step in scaling GUI agent models downward toward lightweight, practical deployment.
+- Type: Vision-Language Model (VLM) for Computer GUI Agents
+- Size: 500M parameters
+- Input: Screenshot + natural language instruction (specific GUI action)
+- Output: Structured JSON describing GUI action(s), including pixel coordinates
+- Recommended image resolution: longer edge < 2048
+- Fine-tuned from: [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) (brief as `SmolVLM2-500M` below)
+- License: [CC BY-NC-SA 4.0](./LICENSE.md)
+- Developed by: [Vocaela AI](https://vocaela.ai/)
+## Action space
+The following table lists the default action schema used during training. Users may extend or redefine it via system prompts.
+|              | Action        | Parameters              | Parameters' Values                                                                                        | Example                                                                | Meaning |
+|:-------------|:--------------|:------------------------|:----------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------|:-|
+| Common       | type          | text                    | string, the text to type in                                                                               | {"action": "type", "text": "example"}                                  | Typing in specified text |
+|              | click         | coordinate              | [x,y], scaled [0, 1), position to click on                                            | {"action": "click", "coordinate": [0.1,0.5]}                           | Click using mouse or tap using finger at specified position |
+| Desktop Only | mouse_move    | coordinate              | [x,y], scaled [0, 1), position to move to                                             | {"action": "mouse_move", "coordinate": [0.1,0.5]}                      | Move mouse to specified position |
+|              | drag          | coordinate, coordinate2 | [x,y], scaled [0, 1), start (`coordinate`) and end (`coordinate2`) position to drag   | {"action": "drag", "coordinate": [0.1,0.5], "coordinate2": [0.2,0.6]}  | Drag mouse (click left button and hold)  from specified start position to end position |
+|              | right_click   | coordinate              | [x,y], scaled [0, 1), position to click on                                            | {"action": "right_click", "coordinate": [0.1,0.5]}                     | Click right mouse button at specified position |
+|              | middle_click  | coordinate              | [x,y], scaled [0, 1), position to click on                                            | {"action": "middle_click", "coordinate": [0.1,0.5]}                    | Click middle mouse button at specified position |
+|              | double_click  | coordinate              | [x,y], scaled [0, 1), position to click on                                            | {"action": "double_click", "coordinate": [0.1,0.5]}                    | Double click left mouse button at specified position |
+|              | scroll        | scroll_direction        | enum: {'up', 'down'}                                                                                      | {"action": "scroll", "scroll_direction": "up"}                         | Scroll mouse wheel with specified direction |
+|              | press_key     | key, presses            | `key`: string, single key pressed; `presses`, integer, number of times to press               | {"action": "press_key", "key": 'enter'}                                | Press a single key |
+|              | hotkey        | hotkeys                 | list of string, combination of keys to press, e.g, ['ctrl', 'c']                                         | {"action": "hotkey", "hotkeys": ["ctrl", "c"]}                         | Press hotkey combinations e.g., Ctrl+C |
+| Mobile Only  | long_press    | coordinate, time        | `coordinate`: [x,y], scaled [0, 1), position to press on; `time`: seconds to hold             | {"action": "long_press", "coordinate": [0.1,0.5], "time": 5}           | Press at specified position and hold for specified time (s) |
+|              | swipe         | swipe_direction, swipe_from, coordinate | `swipe_direction`: direction swipe towards, enum {'up', 'down', 'left', 'right'}, `swipe_from`: general area to swipe from, enum {'top', 'bottom', 'left', 'right', 'center', 'top_left', 'top_right', 'bottom_left', 'bottom_right'}, `coordinate`: [x,y], scaled [0, 1), accurate position to swipe from. `swipe_from` and `coordinate` are optional. | {"action": "swipe", "coordinate": [0.1,0.5], "swipe_direction": 'up'} | Swipe from specified start position towards specified direction |
+|              | system_button | button                  | string, system button to press, enum: {'Back', 'Home', 'Menu', 'Enter'}                               | {"action": "system_button", "button": "home"}                          | Press a specified system button |
+|              | open          | text                    | string, name of app to open                                                                               | {"action": "open", "text": "Google Chrome"}                            | Open a specified app |
+See below Section [System messages](#system-messages) for example of how to instruct the action space.
+## How to use
+The model is used in the same way as SmolVLM2-500M. The example below shows how to load the model and processor, construct multimodal messages, and perform inference. For system messages, please refer to Section [System messages](#system-messages).
+For a completed running example, please refer to the simple play demo [vocaela-500m-demo](https://github.com/vocaela/vocaela-500m-demo/blob/main/readme.md), which is a completed example of loading model, creating messages, preprocessing, and doing model inference.
+```python
+from transformers import AutoProcessor, AutoModelForImageTextToText
+import torch
+model_path = "Vocaela/Vocaela-500M"
+processor = AutoProcessor.from_pretrained(model_path)
+torch_dtype = torch.float16 # using torch.bfloat16 if your device supports it
+device = 'cuda' # using 'cpu' if want to inference on cpu
+_attn_implementation = 'sdpa' # using "flash_attention_2" if it available in your env
+model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch_dtype, _attn_implementation=_attn_implementation).to(device)
+# Ensure the 'content' field uses a list format for every message, even for single items; otherwise, apply_chat_template's result will be wrong without raising any exception.
+messages = [
+    {
+      "role": "system",
+      "content": [
+        { "type": "text", "text": "<SYSTEM_MESSAGE>"}, # please reference section [System messages](#system-messages) for choices of using message for computer use or mobile use.
+      ]
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "<image full path>"},
+            {"type": "text", "text": "Click the ..."},
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(model.device, dtype=torch.bfloat16)
+generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
+generated_texts = processor.batch_decode(
+    generated_ids,
+    skip_special_tokens=True,
+)
+print(generated_texts[0])
+```
+## Evaluation results
+We evaluated the model on two levels of tasks:
+- Grounding: the model is asked to directly output screen coordinate (x,y) of a concerned GUI element. In the past, related works show a trend of improving this low-level capability by scaling up model size. However, this model impresses us by how a tiny model can still perform remarkablly on grounding.
+- Low-level GUI agent task: the model is asked to execute low-level GUI instructions such as "click the submit button", "type 'diet food' in the search box", "scroll up the page", "open chrome", etc. Although the task is not like those popular highly autonomous agentic ones, we hope it is still a self-contained "agent" model instead of only a "grounding" model.
+### Grounding evaluation
+#### Screenspot
+The following table compares Vocaela-500M with other small (<=4B) specialized GUI models on the ScreenSpot benchmark. Numbers are from original papers/pages unless otherwise noted.
+| Model                             | Mobile-Text  | Mobile-Icon  | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Overall   |
+|:---------------------------------:|:------------:|:------------:|:------------:|:------------:|:--------:|:--------:|:---------:|
+| *General purpose models*
+| QWen2.5-VL-3B [[1]](#ref1)        | -            | -            | -            | -            | -        | -        | 55.5 (from [[10]](#ref10)) |
+| InternVL3.5-4B [[2]](#ref2)       | -            | -            | -            | -            | -        | -        | 83.6      |
+| QWen3-VL-4B Instruct [[3]](#ref3) | -            | -            | -            | -            | -        | -        | 94.0      |
+| *Specialized GUI models*                                                                             |
+| OS-Atlas-4B [[4]](#ref4)          | 85.7         | 58.5         | 72.2         | 45.7         | 82.6     | 63.1     | 70.1      |
+| ShowUI-2B [[5]](#ref5)            | 92.3         | 75.5         | 76.3         | 61.1         | 81.7     | 63.6     | 75.1      |
+| UGround-V1-2B [[6]](#ref6)        | 89.4         | 72.0         | 88.7         | 65.7         | 81.3     | 68.9     | 77.7      |
+| UI-Tars-2B [[7]](#ref7)           | 93.0         | 75.5         | 90.7         | 68.6         | 84.3     | 74.8     | 82.3      |
+| <u>**Vocaela-500M**</u>           | 92.7         | 70.3         | 90.7         | 75.0         | 87.4     | 78.2     | 83.1      |
+| TongUI-3B [[8]](#ref8)            | 92.6         | 77.7         | 92.3         | 77.1         | 87.8     | 74.8     | 83.6      |
+| GUI-Actor-2B [[9]](#ref9)         | 93.0         | 79.9         | 88.1         | 78.6         | 90.9     | 84.0     | 86.5      |
+| InfiGUI-R1-3B [[10]](#ref10)      | 97.1         | 81.2         | 94.3         | 77.1         | 91.7     | 77.6     | 87.5      |
+| UI-R1-E-3B                        | 97.1         | 83.0         | 95.4         | 77.9         | 91.7     | 85.0     | 89.2      |
+#### ScreenspotV2
+The following table compares Vocaela-500M with other small (<=4B) specialized GUI models on the ScreenSpotV2 benchmark. Numbers are from the [ScreenSpotV2/ScreenSpotPro leaderboard page](https://gui-agent.github.io/grounding-leaderboard/screenspot.html) except Phi-Ground-4B and GUI-Actor-2B from the original papers.
+| Model                               | Mobile-Text  | Mobile-Icon  | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Overall   |
+|:-----------------------------------:|:------------:|:------------:|:------------:|:------------:|:--------:|:--------:|:---------:|
+| *General purpose models*                                                                                                 |
+| QWen2.5-VL-3B [[1]](#ref1)          | 62.1         | 46.4         | 54.1         | 30.0         | 31.2     | 48.3     | 46.9      |
+| *Specialized GUI models*                                                                                          |
+| OS-Atlas-4B [[4]](#ref4)            | 82.8         | 64.0         | 64.4         | 46.4         | 78.6     | 60.1     | 68.5      |
+| ShowUI-2B [[5]](#ref5)              | 92.1         | 75.4         | 78.9         | 59.3         | 84.2     | 61.1     | 77.3      |
+| Phi-Ground-4B [[11]](#ref11)        | 94.1         | 62.0         | 91.7         | 77.1         | 94.4     | 78.3     | 84.1      |
+| TongUI-3B [[8]](#ref8)              | 94.4         | 79.6         | 92.8         | 75.0         | 87.6     | 77.8     | 85.5      |
+| <u>**Vocaela-500M**</u>                    | 95.9         | 73.93        | 95.4         | 75.7         | 91.0     | 75.4     | 85.8      |
+| ZonUI-3B [[14]](#ref14)             | 98.6         | 82.9         | 92.3         | 74.3         | 88.0     | 74.4     | 86.6      |
+| GUI-Actor-2B [[9]](#ref9)           | 95.0         | 82.2         | 92.2         | 81.8         | 92.9     | 82.7     | 88.6      |
+| UI-R1-E-3B                          | 98.2         | 83.9         | 94.8         | 75.0         | 93.2     | 83.7     | 89.5      |
+| Holo1.5-3B [[12]](#ref12)           | 99.2         | 88.0         | 95.0         | 89.7         | 91.8     | 84.8     | 91.7      |
+#### Showdown
+The following table compares Vocaela-500M with other small specialized GUI models on the Showdown benchmark. Numbers are from the Phi-Ground-4B [[11]](#ref11) paper except Holo1.5-3B/Holo1.5-7B from their model pages. Compared with ScreenSpot and ScreenSpotV2, the Showdown dataset includes many examples requiring app-specific knowledge rather than pure visual grounding—posing a greater challenge for compact models like Vocaela-500M. Despite this, it still outperforms several 4B–7B models.
+| Model                          | Acc         |
+|:------------------------------:|:-----------:|
+| OS-Atlas-4B [[4]](#ref4)       | 15.8 |
+| SeeClick-9.6B [[15]](#ref15)   | 24.6 |
+| OS-Atlas-7B [[4]](#ref4)       | 41.1 |
+| UGround-7B [[6]](#ref6)        | 46.5 |
+| <u>**Vocaela-500M**</u>               | 52.1 |
+| UGround-v1-7B [[6]](#ref6)     | 57.8 |
+| Phi-Ground-4B [[11]](#ref11)   | 58.2 |
+| UI-TARS-2B [[7]](#ref7)        | 59.8 |
+| Phi-Ground-7B [[11]](#ref11)   | 62.5 |
+| UI-TARS-7B [[7]](#ref7)        | 66.1 |
+| UI-TARS-1.5-7B [[7]](#ref7)    | 67.2 |
+| Holo1.5-3B [[12]](#ref12)      | 67.5 |
+| Holo1.5-7B [[13]](#ref13)      | 72.2 |
+### Low-level agent evaluation
+Following convention of related works, we report the three metrics:
+- `Type`: Accuracy of predict the action type, e.g., 'click', 'type' etc.
+- `Grounding`: The accuracy of the coordinate of some actions requiring output screen coordinate, such as 'click'.
+- `SR`: The step success rate.
+#### AndroidControl-Low
+| Model                        | Type   | Grounding | SR     |
+|:----------------------------:|:------:|:---------:|:------:|
+| OS-Atlas-4B [[4]](#ref4)     | 64.58  | 71.19     | 40.62  |
+| OS-Atlas-7B [[4]](#ref4)     | 73.00  | 73.37     | 50.94  |
+| GUI-R1-3B [[16]](#ref16)     | 83.68  | 81.59     | 64.41  |
+| GUI-R1-7B [[16]](#ref16)     | 85.17  | 84.02     | 66.52  |
+| Aria-UI (3.5B act. of 24.9B) [[17]](#ref17) |   -    | 87.7      | 67.3   |
+| <u>**Vocaela-500M**</u>             | 83.98  | 81.52     | 69.68  |
+| Aguvis-7B [[18]](#ref18)     |   –    |   –       | 80.5   |
+| UI-R1-3B [[19]](#ref19)      | 94.3   | 82.6      | 88.5   |
+| UI-TARS-2B [[7]](#ref7)      | 98.1   | 87.3      | 89.3 (from [[10]](#ref10)) |
+| InfiGUI-R1-3B [[10]](#ref10) | 96.0   | 93.2      | 92.1   |
+Numbers other than Vocaela-500M are from each own reference except UI-TARS-2B which is from InfiGUI-R1-3B [[10]](#ref10).
+#### GUI-Act-Web
+| Model                        | Type   | Grounding | SR     |
+|:----------------------------:|:------:|:---------:|:------:|
+| OS-Atlas-4B [[4]](#ref4)     | 79.22  | 58.57     | 42.62  |
+| OS-Atlas-7B [[4]](#ref4)     | 86.95  | 75.61     | 57.02  |
+| UI-R1-3B [[19]](#ref19)      | 75.89  | 79.43     | 67.31 (from [[10]](#ref10))  |
+| GUI-R1-3B [[16]](#ref16)     | 89.86  | 87.42     | 76.31  |
+| GUI-R1-7B [[16]](#ref16)     | 90.85  | 88.06     | 80.31  |
+| <u>**Vocaela-500M**</u>             | 90.28  | 79.71     | 80.43  |
+#### OmniAct-Web
+| Model                        | Type   | Grounding | SR     |
+|:----------------------------:|:------:|:---------:|:------:|
+| OS-Atlas-4B [[4]](#ref4)     | 46.74  | 49.24     | 22.99  |
+| OS-Atlas-7B [[4]](#ref4)     | 85.63  | 69.35     | 59.15  |
+| UI-R1-3B [[19]](#ref19)      | 75.42  | 61.35     | 61.33 (from [[16]](#ref16))  |
+| <u>**Vocaela-500M**</u>             | 88.16  | 72.42     | 67.13  |
+| GUI-R1-3B [[16]](#ref16)     | 88.58  | 75.10     | 75.08  |
+| GUI-R1-7B [[16]](#ref16)     | 91.16  | 77.29     | 77.35  |
+#### OmniAct-Desktop
+| Model                        | Type   | Grounding | SR     |
+|:----------------------------:|:------:|:---------:|:------:|
+| OS-Atlas-4B [[4]](#ref4)     | 63.30  | 42.55     | 26.94  |
+| OS-Atlas-7B [[4]](#ref4)     | 90.24  | 62.87     | 56.73  |
+| UI-R1-3B [[19]](#ref19)      | 73.41  | 64.12     | 63.98  (from [[16]](#ref16)) |
+| GUI-R1-3B [[16]](#ref16)     | 91.86  | 78.37     | 78.31  |
+| <u>**Vocaela-500M**</u>             | 89.23  | 83.05     | 79.12  |
+| GUI-R1-7B [[16]](#ref16)     | 92.20  | 83.36     | 83.33  |
+## Training strategy
+The model architecture and configuration remain identical to the base model, except for its slightly customized chat template (see Section [Special tokens & chat template](#special-tokens--chat-template)). Training followed three stages: Two stages of Supervised Fine-Tuning (SFT) and then Reinforcement Fine-Tuning (RFT) using GRPO.
+SFT Stage 1: ~7M examples from public datasets after extensive preprocessing, unifying action spaces, synthesis, and balancing.
+SFT Stage 2: ~256K examples after filtering, re-sampling to balance action distribution, synthesis to enrich rare actions
+RFT: ~40K examples
+## Limitations
+- **Not suitable for high-resolution images**
+  There are two factors in the base model [SmolVLM2-500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) which limit the capability of handling high resolution images:
+    1. Aggressive pixel shuffle (r=4), compressing 64 pixels into one token.
+    2. Fixed scaling to 2048px on the longest side.
+    Together, they severely impact grounding on high-resolution screens. We evaluated Vocaela-500M on ScreenSpotPro. The average score is 15.1. Although still better than several 2B/3B/7B models, the relative superiority is much worse compared to that on ScreenSpot/ScreenSpotV2. The result verifies the limitation.
+- **Not suitable for high-level agentic tasks / lacks reasoning capability**
+  The base model SmolVLM2-500M does not exhibit reasoning capabilities. Considering that the main goal of this model is for low-level command execution and the very compact model size, the model was not trained with reasoning capabilities. We evaluated Vocaela-500M on high level task of AndroidControl-High. Shown as below, the result verifies this limitation.
+  | Task                | Model         | Type   | Grounding | SR    |
+  |:-------------------:|:-------------:|:------:|:---------:|:-----:|
+  | AndroidControl-High | Vocaela-500M  | 25.9   | 13.1      | 13.1  |
+- **Loss of general-purpose capabilities**
+  The model is heavily-tuned on this specific scenario and hence loses general-purpose capabilities, such as chat, QA, instruct-following etc.
+- **No video support**
+  The model was not trained with any video data in SFT/RFT.
+## System messages
+Below system messages were used in training and hence recommended to use for inference.
+### System message for computer use
+```python
+Vocaelam_Computer_Use_System_Message = """
+You are an assistant trained to navigate the computer screen.
+Given a task instruction, a screen observation, and an action history sequence,
+output the next actions and wait for the next observation.
+## Allowed ACTION_TYPEs and parameters:
+1. `PRESS_KEY`: Press one specified key. Two parameters: `key`, string, the single key to press; `presses`, integer, the number of times to press the key (default is 1).
+2. `TYPE`: Type a string into an element. Parameter: `text`, string, the text to type.
+3. `MOUSE_MOVE`: Move the mouse cursor to a specified position. Parameter: `coordinate`, formatted as [x,y], the position to move the cursor to.
+4. `CLICK`: Click left mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to click on.
+5. `DRAG`: Drag the cursor with the left mouse button pressed, start and end positions are specified. Two parameters: `coordinate`, formatted as [x,y], the start position to drag from; `coordinate2`, formatted as [x2,y2], the end position to drag to.
+6. `RIGHT_CLICK`: Click right mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to right click on.
+7. `MIDDLE_CLICK`: Click middle mouse button once on an element. Parameter: `coordinate`, formatted as [x,y], the position to middle click on.
+8. `DOUBLE_CLICK`: Click left mouse button twice on an element. Parameter: `coordinate`, formatted as [x,y], the position to double click on.
+9. `SCROLL`: Scroll the screen (via mouse wheel). Parameter: `scroll_direction`, the direction (`up`/`down`/`left`/`right`) to scroll.
+10. `WAIT`: Wait for several seconds. Parameter: `time`, duration in seconds to wait.
+11. `TERMINATE`: Terminate the task. Parameter: `status`, the status of the task, `success`/`failure`.
+12. `REFUSE`: Refuse to perform the task if not feasible. No parameter.
+13. `HOTKEY`: Press a combination of keys simultaneously. Parameter: `hotkeys`, list of strings, the keys to press together.
+* NOTE *: The `coordinate` and `coordinate2` parameters (formatted as [x,y]) are the relative coordinates on the screenshot scaled to range of 0-1, [0,0] is the top-left corner and [1,1] is the bottom-right corner.
+## Format your response as
+<Action>the next actions</Action>
+`The next actions` can be one or multiple actions. Format `the next actions` as a JSON array of objects as below, each object is an action:
+[{"action": "<ACTION_TYPE>", "key": "<key>", "presses": <presses>, "hotkeys": ["<hotkeys>"], "text": "<text>", "coordinate": [x,y], "coordinate2": [x2,y2], "time": <time>, "status": "<status>", "scroll_direction": "<scroll_direction>"}]
+If a parameter is not applicable, don't include it in the JSON object.
+"""
+```
+### System message for mobile phone use
+```python
+Vocaela_Mobile_Use_System_Message = """
+You are an assistant trained to navigate the mobile phone.
+Given a task instruction, a screen observation, and an action history sequence,
+output the next actions and wait for the next observation.
+## Allowed ACTION_TYPEs and parameters:
+1. `CLICK`: Click/tap on the screen. Parameter: `coordinate`, formatted as [x,y], the position to click on.
+2. `LONG_PRESS`: Long press on the screen. Two parameters: `coordinate`, formatted as [x,y], the position to long press on; `time`, duration in seconds to long press.
+3. `SWIPE`: Swipe on the screen. Two parameters: `swipe_from`, the start area to swipe from, only allowed value in {'top', 'bottom', 'left', 'right', 'center', `top_left`, `top_right`, `bottom_left`, `bottom_right`}; `swipe_direction`, the direction (`up`/`down`/`left`/`right`) to swipe towards.
+4. `TYPE`: Type a string into an element. Parameter: `text`, string, the text to type.
+5. `SYSTEM_BUTTON`: Press a system button. Parameter: `button`, the system button to press, allowed button values: 'Back', 'Home', 'Menu', 'Enter'.
+6. `OPEN`: Open an app. Parameter: `text`, string, the app name to open.
+* NOTE *: The `coordinate` parameter (formatted as [x,y]) is the relative coordinates on the screenshot scaled to range of 0-1, [0,0] is the top-left corner and [1,1] is the bottom-right corner.
+## Format your response as
+<Action>the next actions</Action>
+`The next actions` can be one or multiple actions. Format `the next actions` as a JSON array of objects as below, each object is an action:
+[{"action": "<ACTION_TYPE>", "text": "<text>", "coordinate": [x,y], "swipe_from": "<swipe_from>", "swipe_direction": "<swipe_direction>", "button": "<button>"}]
+If a parameter is not applicable, don't include it in the JSON object.
+"""
+```
+## Special tokens & chat template
+The base model SmolVLM2-500M does not provided special token to identify user or assistant role. For convenience of accurately masking user-turn messages in SFT, two existing special tokens were used to mark the beginning and end of an assistant message, `<|reserved_special_token_50|>` for the beginning and `<|reserved_special_token_51|>` for the end. Consequently, if looking into the `chat_template.jinja` file of the model folder, you will find the chat template added the prefix token `<|reserved_special_token_50|>` for inference:
+```
+<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
+{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:<|reserved_special_token_50|>' }}{% endif %}
+```
+For a normal generation, if you configured tokenizer decoding not skipping special tokens, a completed sequence ends with two successive special tokens `<|reserved_special_token_51|><end_of_utterance>`, where `<end_of_utterance>` is the default end token of the base model and `<|reserved_special_token_51|>` is introduced in by our training process.
+## License
+This model is made available under the [CC BY-NC-SA 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/). To comply with the license, you may use, modify, and share the dataset or derivative works for non-commercial purposes only. **Any derivative works must be shared under the same license.**
+Reason of adopting `CC BY-NC-SA 4.0` license: model training used datasets under `CC BY-NC-SA 4.0` license.
+Please see the full license [here](./LICENSE.md).
+## Acknowledgements
+- Thanks to Microsoft [*Azure startup credit offer*](https://learn.microsoft.com/en-us/azure/signups/overview) for partially funding the computing
+- Thanks to related projects [Jedi](https://osworld-grounding.github.io/), [TongUI](https://tongui-agent.github.io/), [UGround](https://osu-nlp-group.github.io/UGround/), [Aguvis](https://aguvis-project.github.io/), [OS-ATLAS](https://osatlas.github.io/), [GTA-1](https://github.com/Yan98/GTA1), [OpenCUA](https://github.com/xlang-ai/OpenCUA), [GUI-R1](https://github.com/ritzz-ai/GUI-R1), etc. We leveraged datasets, code, and insights shared out from them.
+## References
+<a id="ref1">[1]</a>
+Bai, Shuai, et al. "Qwen2. 5-vl technical report." arXiv preprint arXiv:2502.13923 (2025).
+<a id="ref2">[2]</a>
+Wang, Weiyun, et al. "Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency." arXiv preprint arXiv:2508.18265 (2025).
+<a id="ref3">[3]</a>
+Qwen3-VL. https://github.com/QwenLM/Qwen3-VL
+<a id="ref4">[4]</a>
+Wu, Zhiyong, et al. "Os-atlas: A foundation action model for generalist gui agents." arXiv preprint arXiv:2410.23218 (2024).
+<a id="ref5">[5]</a>
+Lin, Kevin Qinghong, et al. "Showui: One vision-language-action model for gui visual agent." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
+<a id="ref6">[6]</a>
+Gou, Boyu, et al. "Navigating the digital world as humans do: Universal visual grounding for gui agents." arXiv preprint arXiv:2410.05243 (2024).
+<a id="ref7">[7]</a>
+Qin, Yujia, et al. "Ui-tars: Pioneering automated gui interaction with native agents." arXiv preprint arXiv:2501.12326 (2025).
+<a id="ref8">[8]</a>
+Zhang, Bofei, et al. "TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials." arXiv preprint arXiv:2504.12679 (2025).
+<a id="ref9">[9]</a>
+Wu, Qianhui, et al. "GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents." arXiv preprint arXiv:2506.03143 (2025).
+<a id="ref10">[10]</a>
+Liu, Yuhang, et al. "Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners." arXiv preprint arXiv:2504.14239 (2025).
+<a id="ref11">[11]</a>
+Zhang, Miaosen, et al. "Phi-ground tech report: Advancing perception in gui grounding." arXiv preprint arXiv:2507.23779 (2025).
+<a id="ref12">[12]</a>
+Holo1.5-3B. https://huggingface.co/Hcompany/Holo1.5-3B
+<a id="ref13">[13]</a>
+Holo1.5-7B. https://huggingface.co/Hcompany/Holo1.5-7B
+<a id="ref14">[14]</a>
+Hsieh, ZongHan, and Tzer-Jen Wei. "ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding" arXiv preprint arXiv:2506.23491 (2025).
+<a id="ref15">[15]</a>
+Cheng, Kanzhi, et al. "Seeclick: Harnessing gui grounding for advanced visual gui agents." arXiv preprint arXiv:2401.10935 (2024).
+<a id="ref16">[16]</a>
+Luo, Run, et al. "Gui-r1: A generalist r1-style vision-language action model for gui agents." arXiv preprint arXiv:2504.10458 (2025).
+<a id="ref17">[17]</a>
+Yang, Yuhao, et al. "Aria-ui: Visual grounding for gui instructions." arXiv preprint arXiv:2412.16256 (2024).
+<a id="ref18">[18]</a>
+Xu, Yiheng, et al. "Aguvis: Unified pure vision agents for autonomous gui interaction." arXiv preprint arXiv:2412.04454 (2024).
+<a id="ref19">[19]</a>
+Lu, Zhengxi, et al. "UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning." arXiv preprint arXiv:2503.21620 (2025).

added_tokens.json ADDED Viewed

	@@ -0,0 +1,130 @@

+{
+  "<end_of_utterance>": 49279,
+  "<fake_token_around_image>": 49189,
+  "<global-img>": 49152,
+  "<image>": 49190,
+  "<row_1_col_1>": 49153,
+  "<row_1_col_2>": 49154,
+  "<row_1_col_3>": 49155,
+  "<row_1_col_4>": 49156,
+  "<row_1_col_5>": 49157,
+  "<row_1_col_6>": 49158,
+  "<row_2_col_1>": 49159,
+  "<row_2_col_2>": 49160,
+  "<row_2_col_3>": 49161,
+  "<row_2_col_4>": 49162,
+  "<row_2_col_5>": 49163,
+  "<row_2_col_6>": 49164,
+  "<row_3_col_1>": 49165,
+  "<row_3_col_2>": 49166,
+  "<row_3_col_3>": 49167,
+  "<row_3_col_4>": 49168,
+  "<row_3_col_5>": 49169,
+  "<row_3_col_6>": 49170,
+  "<row_4_col_1>": 49171,
+  "<row_4_col_2>": 49172,
+  "<row_4_col_3>": 49173,
+  "<row_4_col_4>": 49174,
+  "<row_4_col_5>": 49175,
+  "<row_4_col_6>": 49176,
+  "<row_5_col_1>": 49177,
+  "<row_5_col_2>": 49178,
+  "<row_5_col_3>": 49179,
+  "<row_5_col_4>": 49180,
+  "<row_5_col_5>": 49181,
+  "<row_5_col_6>": 49182,
+  "<row_6_col_1>": 49183,
+  "<row_6_col_2>": 49184,
+  "<row_6_col_3>": 49185,
+  "<row_6_col_4>": 49186,
+  "<row_6_col_5>": 49187,
+  "<row_6_col_6>": 49188,
+  "<|reserved_special_token_0|>": 49191,
+  "<|reserved_special_token_10|>": 49201,
+  "<|reserved_special_token_11|>": 49202,
+  "<|reserved_special_token_12|>": 49203,
+  "<|reserved_special_token_13|>": 49204,
+  "<|reserved_special_token_14|>": 49205,
+  "<|reserved_special_token_15|>": 49206,
+  "<|reserved_special_token_16|>": 49207,
+  "<|reserved_special_token_17|>": 49208,
+  "<|reserved_special_token_18|>": 49209,
+  "<|reserved_special_token_19|>": 49210,
+  "<|reserved_special_token_1|>": 49192,
+  "<|reserved_special_token_20|>": 49211,
+  "<|reserved_special_token_21|>": 49212,
+  "<|reserved_special_token_22|>": 49213,
+  "<|reserved_special_token_23|>": 49214,
+  "<|reserved_special_token_24|>": 49215,
+  "<|reserved_special_token_25|>": 49216,
+  "<|reserved_special_token_26|>": 49217,
+  "<|reserved_special_token_27|>": 49218,
+  "<|reserved_special_token_28|>": 49219,
+  "<|reserved_special_token_29|>": 49220,
+  "<|reserved_special_token_2|>": 49193,
+  "<|reserved_special_token_30|>": 49221,
+  "<|reserved_special_token_31|>": 49222,
+  "<|reserved_special_token_32|>": 49223,
+  "<|reserved_special_token_33|>": 49224,
+  "<|reserved_special_token_34|>": 49225,
+  "<|reserved_special_token_35|>": 49226,
+  "<|reserved_special_token_36|>": 49227,
+  "<|reserved_special_token_37|>": 49228,
+  "<|reserved_special_token_38|>": 49229,
+  "<|reserved_special_token_39|>": 49230,
+  "<|reserved_special_token_3|>": 49194,
+  "<|reserved_special_token_40|>": 49231,
+  "<|reserved_special_token_41|>": 49232,
+  "<|reserved_special_token_42|>": 49233,
+  "<|reserved_special_token_43|>": 49234,
+  "<|reserved_special_token_44|>": 49235,
+  "<|reserved_special_token_45|>": 49236,
+  "<|reserved_special_token_46|>": 49237,
+  "<|reserved_special_token_47|>": 49238,
+  "<|reserved_special_token_48|>": 49239,
+  "<|reserved_special_token_49|>": 49240,
+  "<|reserved_special_token_4|>": 49195,
+  "<|reserved_special_token_50|>": 49241,
+  "<|reserved_special_token_51|>": 49242,
+  "<|reserved_special_token_52|>": 49243,
+  "<|reserved_special_token_53|>": 49244,
+  "<|reserved_special_token_54|>": 49245,
+  "<|reserved_special_token_55|>": 49246,
+  "<|reserved_special_token_56|>": 49247,
+  "<|reserved_special_token_57|>": 49248,
+  "<|reserved_special_token_58|>": 49249,
+  "<|reserved_special_token_59|>": 49250,
+  "<|reserved_special_token_5|>": 49196,
+  "<|reserved_special_token_60|>": 49251,
+  "<|reserved_special_token_61|>": 49252,
+  "<|reserved_special_token_62|>": 49253,
+  "<|reserved_special_token_63|>": 49254,
+  "<|reserved_special_token_64|>": 49255,
+  "<|reserved_special_token_65|>": 49256,
+  "<|reserved_special_token_66|>": 49257,
+  "<|reserved_special_token_67|>": 49258,
+  "<|reserved_special_token_68|>": 49259,
+  "<|reserved_special_token_69|>": 49260,
+  "<|reserved_special_token_6|>": 49197,
+  "<|reserved_special_token_70|>": 49261,
+  "<|reserved_special_token_71|>": 49262,
+  "<|reserved_special_token_72|>": 49263,
+  "<|reserved_special_token_73|>": 49264,
+  "<|reserved_special_token_74|>": 49265,
+  "<|reserved_special_token_75|>": 49266,
+  "<|reserved_special_token_76|>": 49267,
+  "<|reserved_special_token_77|>": 49268,
+  "<|reserved_special_token_78|>": 49269,
+  "<|reserved_special_token_79|>": 49270,
+  "<|reserved_special_token_7|>": 49198,
+  "<|reserved_special_token_80|>": 49271,
+  "<|reserved_special_token_81|>": 49272,
+  "<|reserved_special_token_82|>": 49273,
+  "<|reserved_special_token_83|>": 49274,
+  "<|reserved_special_token_84|>": 49275,
+  "<|reserved_special_token_85|>": 49276,
+  "<|reserved_special_token_86|>": 49277,
+  "<|reserved_special_token_87|>": 49278,
+  "<|reserved_special_token_8|>": 49199,
+  "<|reserved_special_token_9|>": 49200
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ <\|im_start\|>{% for message in messages %}{{message['role'] \| capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
2	+ {% endfor %}{% if add_generation_prompt %}{{ 'Assistant:<\|reserved_special_token_50\|>' }}{% endif %}

config.json ADDED Viewed

	@@ -0,0 +1,158 @@

+{
+  "architectures": [
+    "SmolVLMForConditionalGeneration"
+  ],
+  "bos_token_id": 1,
+  "dtype": "bfloat16",
+  "eos_token_id": 49279,
+  "image_token_id": 49190,
+  "model_type": "smolvlm",
+  "pad_token_id": 2,
+  "scale_factor": 4,
+  "text_config": {
+    "_flash_attn_2_enabled": true,
+    "_name_or_path": "None",
+    "architectures": [
+      "VLlama3ForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "dtype": "bfloat16",
+    "head_dim": 64,
+    "hidden_act": "silu",
+    "hidden_size": 960,
+    "initializer_range": 0.02,
+    "intermediate_size": 2560,
+    "is_llama_config": true,
+    "max_position_embeddings": 8192,
+    "mlp_bias": false,
+    "model_type": "llama",
+    "neftune_noise_alpha": 0.0,
+    "num_attention_heads": 15,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 5,
+    "pad_token_id": 2,
+    "perceiver_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_dropout": 0.0,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "silu",
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "vllama3",
+      "no_repeat_ngram_size": 0,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_key_value_heads": 1,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "pruned_heads": {},
+      "qk_layer_norms_perceiver": false,
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "resampler_depth": 6,
+      "resampler_head_dim": 96,
+      "resampler_n_heads": 16,
+      "resampler_n_latents": 64,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "torch_dtype": null,
+      "torchscript": false,
+      "transformers_version": "4.46.0",
+      "typical_p": 1.0,
+      "use_bfloat16": false
+    },
+    "pixel_shuffle_factor": 4,
+    "pretraining_tp": 1,
+    "qk_layer_norms": false,
+    "rms_norm_eps": 1e-05,
+    "rope_interleaved": false,
+    "rope_scaling": null,
+    "rope_theta": 100000,
+    "transformers.js_config": {
+      "kv_cache_dtype": {
+        "fp16": "float16",
+        "q4f16": "float16"
+      }
+    },
+    "use_cache": true,
+    "use_resampler": false,
+    "vocab_size": 49280
+  },
+  "tie_word_embeddings": false,
+  "transformers.js_config": {
+    "kv_cache_dtype": {
+      "fp16": "float16",
+      "q4f16": "float16"
+    }
+  },
+  "transformers_version": "4.56.1",
+  "use_cache": false,
+  "use_reentrant_checkpointing": false,
+  "vision_config": {
+    "attention_dropout": 0.0,
+    "dtype": "bfloat16",
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 768,
+    "image_size": 512,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "layer_norm_eps": 1e-06,
+    "max_image_size": {
+      "longest_edge": 512
+    },
+    "model_type": "smolvlm_vision",
+    "num_attention_heads": 12,
+    "num_channels": 3,
+    "num_hidden_layers": 12,
+    "patch_size": 16,
+    "size": {
+      "longest_edge": 2048
+    },
+    "tie_word_embeddings": false,
+    "use_base_siglip": false
+  },
+  "vocab_size": 49280
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": [
+    49279,
+    49279
+  ],
+  "pad_token_id": 2,
+  "transformers_version": "4.56.1"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b56a3303e3d4bbe2cc619d2a59b5f059fc2675adcdf8cfbd1a537e63d2668dca
+size 1015025832

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "do_convert_rgb": true,
+  "do_image_splitting": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "SmolVLMImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "max_image_size": {
+    "longest_edge": 512
+  },
+  "processor_class": "SmolVLMProcessor",
+  "resample": 1,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "longest_edge": 2048
+  },
+  "video_sampling": {
+    "fps": 1,
+    "max_frames": 64,
+    "video_size": {
+      "longest_edge": 512
+    }
+  }
+}

processor_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "image_seq_len": 64,
+  "processor_class": "SmolVLMProcessor"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "additional_special_tokens": [
+    "<fake_token_around_image>",
+    "<image>",
+    "<end_of_utterance>"
+  ],
+  "bos_token": {
+    "content": "<|im_start|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "end_of_utterance_token": "<end_of_utterance>",
+  "eos_token": {
+    "content": "<end_of_utterance>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "fake_image_token": "<fake_token_around_image>",
+  "global_image_token": "<global-img>",
+  "image_token": "<image>",
+  "pad_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,1191 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49152": {
+      "content": "<global-img>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49153": {
+      "content": "<row_1_col_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49154": {
+      "content": "<row_1_col_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49155": {
+      "content": "<row_1_col_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49156": {
+      "content": "<row_1_col_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49157": {
+      "content": "<row_1_col_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49158": {
+      "content": "<row_1_col_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49159": {
+      "content": "<row_2_col_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49160": {
+      "content": "<row_2_col_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49161": {
+      "content": "<row_2_col_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49162": {
+      "content": "<row_2_col_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49163": {
+      "content": "<row_2_col_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49164": {
+      "content": "<row_2_col_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49165": {
+      "content": "<row_3_col_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49166": {
+      "content": "<row_3_col_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49167": {
+      "content": "<row_3_col_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49168": {
+      "content": "<row_3_col_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49169": {
+      "content": "<row_3_col_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49170": {
+      "content": "<row_3_col_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49171": {
+      "content": "<row_4_col_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49172": {
+      "content": "<row_4_col_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49173": {
+      "content": "<row_4_col_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49174": {
+      "content": "<row_4_col_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49175": {
+      "content": "<row_4_col_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49176": {
+      "content": "<row_4_col_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49177": {
+      "content": "<row_5_col_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49178": {
+      "content": "<row_5_col_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49179": {
+      "content": "<row_5_col_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49180": {
+      "content": "<row_5_col_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49181": {
+      "content": "<row_5_col_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49182": {
+      "content": "<row_5_col_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49183": {
+      "content": "<row_6_col_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49184": {
+      "content": "<row_6_col_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49185": {
+      "content": "<row_6_col_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49186": {
+      "content": "<row_6_col_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49187": {
+      "content": "<row_6_col_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49188": {
+      "content": "<row_6_col_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49189": {
+      "content": "<fake_token_around_image>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49190": {
+      "content": "<image>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49191": {
+      "content": "<|reserved_special_token_0|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49192": {
+      "content": "<|reserved_special_token_1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49193": {
+      "content": "<|reserved_special_token_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49194": {
+      "content": "<|reserved_special_token_3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49195": {
+      "content": "<|reserved_special_token_4|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49196": {
+      "content": "<|reserved_special_token_5|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49197": {
+      "content": "<|reserved_special_token_6|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49198": {
+      "content": "<|reserved_special_token_7|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49199": {
+      "content": "<|reserved_special_token_8|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49200": {
+      "content": "<|reserved_special_token_9|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49201": {
+      "content": "<|reserved_special_token_10|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49202": {
+      "content": "<|reserved_special_token_11|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49203": {
+      "content": "<|reserved_special_token_12|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49204": {
+      "content": "<|reserved_special_token_13|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49205": {
+      "content": "<|reserved_special_token_14|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49206": {
+      "content": "<|reserved_special_token_15|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49207": {
+      "content": "<|reserved_special_token_16|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49208": {
+      "content": "<|reserved_special_token_17|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49209": {
+      "content": "<|reserved_special_token_18|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49210": {
+      "content": "<|reserved_special_token_19|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49211": {
+      "content": "<|reserved_special_token_20|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49212": {
+      "content": "<|reserved_special_token_21|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49213": {
+      "content": "<|reserved_special_token_22|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49214": {
+      "content": "<|reserved_special_token_23|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49215": {
+      "content": "<|reserved_special_token_24|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49216": {
+      "content": "<|reserved_special_token_25|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49217": {
+      "content": "<|reserved_special_token_26|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49218": {
+      "content": "<|reserved_special_token_27|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49219": {
+      "content": "<|reserved_special_token_28|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49220": {
+      "content": "<|reserved_special_token_29|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49221": {
+      "content": "<|reserved_special_token_30|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49222": {
+      "content": "<|reserved_special_token_31|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49223": {
+      "content": "<|reserved_special_token_32|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49224": {
+      "content": "<|reserved_special_token_33|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49225": {
+      "content": "<|reserved_special_token_34|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49226": {
+      "content": "<|reserved_special_token_35|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49227": {
+      "content": "<|reserved_special_token_36|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49228": {
+      "content": "<|reserved_special_token_37|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49229": {
+      "content": "<|reserved_special_token_38|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49230": {
+      "content": "<|reserved_special_token_39|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49231": {
+      "content": "<|reserved_special_token_40|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49232": {
+      "content": "<|reserved_special_token_41|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49233": {
+      "content": "<|reserved_special_token_42|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49234": {
+      "content": "<|reserved_special_token_43|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49235": {
+      "content": "<|reserved_special_token_44|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49236": {
+      "content": "<|reserved_special_token_45|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49237": {
+      "content": "<|reserved_special_token_46|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49238": {
+      "content": "<|reserved_special_token_47|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49239": {
+      "content": "<|reserved_special_token_48|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49240": {
+      "content": "<|reserved_special_token_49|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49241": {
+      "content": "<|reserved_special_token_50|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49242": {
+      "content": "<|reserved_special_token_51|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49243": {
+      "content": "<|reserved_special_token_52|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49244": {
+      "content": "<|reserved_special_token_53|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49245": {
+      "content": "<|reserved_special_token_54|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49246": {
+      "content": "<|reserved_special_token_55|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49247": {
+      "content": "<|reserved_special_token_56|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49248": {
+      "content": "<|reserved_special_token_57|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49249": {
+      "content": "<|reserved_special_token_58|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49250": {
+      "content": "<|reserved_special_token_59|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49251": {
+      "content": "<|reserved_special_token_60|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49252": {
+      "content": "<|reserved_special_token_61|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49253": {
+      "content": "<|reserved_special_token_62|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49254": {
+      "content": "<|reserved_special_token_63|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49255": {
+      "content": "<|reserved_special_token_64|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49256": {
+      "content": "<|reserved_special_token_65|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49257": {
+      "content": "<|reserved_special_token_66|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49258": {
+      "content": "<|reserved_special_token_67|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49259": {
+      "content": "<|reserved_special_token_68|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49260": {
+      "content": "<|reserved_special_token_69|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49261": {
+      "content": "<|reserved_special_token_70|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49262": {
+      "content": "<|reserved_special_token_71|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49263": {
+      "content": "<|reserved_special_token_72|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49264": {
+      "content": "<|reserved_special_token_73|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49265": {
+      "content": "<|reserved_special_token_74|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49266": {
+      "content": "<|reserved_special_token_75|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49267": {
+      "content": "<|reserved_special_token_76|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49268": {
+      "content": "<|reserved_special_token_77|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49269": {
+      "content": "<|reserved_special_token_78|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49270": {
+      "content": "<|reserved_special_token_79|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49271": {
+      "content": "<|reserved_special_token_80|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49272": {
+      "content": "<|reserved_special_token_81|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49273": {
+      "content": "<|reserved_special_token_82|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49274": {
+      "content": "<|reserved_special_token_83|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49275": {
+      "content": "<|reserved_special_token_84|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49276": {
+      "content": "<|reserved_special_token_85|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49277": {
+      "content": "<|reserved_special_token_86|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49278": {
+      "content": "<|reserved_special_token_87|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49279": {
+      "content": "<end_of_utterance>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<fake_token_around_image>",
+    "<image>",
+    "<end_of_utterance>"
+  ],
+  "bos_token": "<|im_start|>",
+  "clean_up_tokenization_spaces": false,
+  "end_of_utterance_token": "<end_of_utterance>",
+  "eos_token": "<end_of_utterance>",
+  "extra_special_tokens": {
+    "end_of_utterance_token": "<end_of_utterance>",
+    "fake_image_token": "<fake_token_around_image>",
+    "global_image_token": "<global-img>",
+    "image_token": "<image>"
+  },
+  "fake_image_token": "<fake_token_around_image>",
+  "global_image_token": "<global-img>",
+  "image_token": "<image>",
+  "legacy": false,
+  "model_max_length": 8192,
+  "pad_token": "<|im_end|>",
+  "processor_class": "SmolVLMProcessor",
+  "tokenizer_class": "GPT2Tokenizer",
+  "truncation_side": "left",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

video_preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "crop_size": null,
+  "data_format": "channels_first",
+  "default_to_square": true,
+  "device": null,
+  "do_center_crop": null,
+  "do_convert_rgb": true,
+  "do_image_splitting": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "do_sample_frames": false,
+  "fps": 1,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "SmolVLMImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "input_data_format": null,
+  "max_image_size": {
+    "longest_edge": 512
+  },
+  "num_frames": 64,
+  "processor_class": "SmolVLMProcessor",
+  "resample": 1,
+  "rescale_factor": 0.00392156862745098,
+  "return_metadata": false,
+  "size": {
+    "longest_edge": 2048
+  },
+  "size_divisor": null,
+  "video_metadata": null,
+  "video_processor_type": "SmolVLMVideoProcessor",
+  "video_sampling": {
+    "fps": 1,
+    "max_frames": 64,
+    "video_size": {
+      "longest_edge": 2048
+    }
+  }
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff