| ## vocab_file | |
| - ice_text.model | |
| - 二进制文件 | |
| - num_image_tokens = 20000 | |
| 文本词典大小=150528-20000 | |
| ``` | |
| tokens: ['▁good', '▁morning'] ; id: [20315, 21774] ; text: good morning | |
| tokens: ['▁good', '<|blank_2|>', 'morning'] ; id: [20315, 150009, 60813] ; text: good morning | |
| tokens: ['▁', 'goog', '▁morning', 'abc'] ; id: [20005, 46456, 21774, 27415] ; text: goog morningabc | |
| tokens: ['▁', '你是谁'] ; id: [20005, 128293] ; text: 你是谁 | |
| ``` | |
| `▁` 是啥,空格吗?注意区分 `_` | |
| ## | |
| ``` | |
| tokenizer = TextTokenizer(self.vocab_file) | |
| ``` | |