Update README.md
Browse files
README.md
CHANGED
|
@@ -5,137 +5,107 @@ datasets:
|
|
| 5 |
language:
|
| 6 |
- zh
|
| 7 |
pipeline_tag: text-classification
|
| 8 |
-
library_name: transformers
|
| 9 |
tags:
|
| 10 |
- chinse
|
| 11 |
- tokenization
|
| 12 |
- bpe
|
| 13 |
---
|
| 14 |
|
| 15 |
-
### **Dataset Card for Chinese BBPE Vocabulary (BAAI CCI3-HQ Based)**
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
#### **1. Dataset Summary**
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
-
|
| 30 |
-
-
|
| 31 |
-
-
|
| 32 |
-
|
| 33 |
-
##### **Tokenization Method**
|
| 34 |
-
- **Algorithm**: Byte-Level Byte Pair Encoding (BBPE), which extends traditional BPE by operating on UTF-8 byte sequences.
|
| 35 |
-
- **Key Features**:
|
| 36 |
-
- Handles rare and unseen characters via byte-level fallback.
|
| 37 |
-
- Optimized for Chinese text, capturing common character combinations and subwords.
|
| 38 |
-
- Includes special tokens (" s ", " pad ", " /s ", " unk ", " mask ").
|
| 39 |
-
|
| 40 |
-
##### **Vocabulary Composition**
|
| 41 |
-
- **Size**: 30,000 subword tokens.
|
| 42 |
-
- **Language Focus**: Chinese (simplified and traditional characters, pinyin, etc.).
|
| 43 |
-
- **Coverage**: Designed to cover **95%+ of common Chinese text** in web, literature, and communication domains.
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
"èĥĮ": 1432,
|
| 55 |
-
"çļĩ": 1433,
|
| 56 |
-
"çĶļèĩ³": 1434,
|
| 57 |
-
"åħ¶ä¸Ń": 1435,
|
| 58 |
-
...
|
| 59 |
-
}
|
| 60 |
-
```
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
- Tokenize Chinese text for NLP tasks like machine translation, text summarization, or question-answering.
|
| 68 |
-
- Compatible with Hugging Face's `tokenizers` library for integration into transformer models.
|
| 69 |
|
| 70 |
```python
|
| 71 |
-
from
|
| 72 |
|
| 73 |
-
#
|
| 74 |
-
tokenizer =
|
| 75 |
|
| 76 |
-
#
|
| 77 |
-
|
| 78 |
-
tokens
|
| 79 |
-
print(
|
| 80 |
```
|
| 81 |
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
- Reduce OOV rates in downstream tasks by leveraging subword segmentation.
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
---
|
| 90 |
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
-
|
| 96 |
-
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
|
|
|
| 101 |
|
| 102 |
---
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
-
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
---
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
- **v1.0.0** (2025-06-01): Initial release based on 5GB of BAAI CCI3-HQ data.
|
| 114 |
|
| 115 |
---
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
-
|
|
|
|
| 121 |
|
| 122 |
---
|
| 123 |
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
@misc{bbpe_chinese_vocab_2025,
|
| 130 |
-
author = {Xia Ziye},
|
| 131 |
-
title = {Chinese BBPE Vocabulary Trained on BAAI CCI3-HQ},
|
| 132 |
-
year = {2025},
|
| 133 |
-
publisher = {Hugging Face},
|
| 134 |
-
journal = {Dataset},
|
| 135 |
-
howpublished = {\url{https://huggingface.co/datasets/Hengzongshu/Chinese_BBPE_Vocab}}
|
| 136 |
-
}
|
| 137 |
-
```
|
| 138 |
-
|
| 139 |
-
---
|
| 140 |
|
| 141 |
-
|
|
|
|
| 5 |
language:
|
| 6 |
- zh
|
| 7 |
pipeline_tag: text-classification
|
|
|
|
| 8 |
tags:
|
| 9 |
- chinse
|
| 10 |
- tokenization
|
| 11 |
- bpe
|
| 12 |
---
|
| 13 |
|
|
|
|
| 14 |
|
| 15 |
+
## 📄 Model Card: 中文 BBPE 分词器
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
### 🧠 简介
|
| 18 |
+
本仓库提供一个基于 **Byte Pair Encoding (BPE)** 的中文分词器(Tokenizer),专为中文文本设计。该分词器通过子词(Subword)切分技术,将中文文本拆分为更细粒度的 token,适用于大语言模型(LLM)的预处理任务。
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
+
### 🔧 用途说明
|
| 23 |
+
- **目标**:
|
| 24 |
+
本分词器旨在将中文文本转换为模型可处理的 token 序列(ID 列表),是训练和推理阶段的重要工具。
|
| 25 |
+
- **适用场景**:
|
| 26 |
+
- 中文自然语言处理(NLP)任务(如文本分类、问答系统、机器翻译等)。
|
| 27 |
+
- 与基于 BPE 的语言模型(如 GPT、RoBERTa 等)配套使用。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
+
### 🗂️ 文件结构
|
| 32 |
+
本仓库仅包含以下文件:
|
| 33 |
+
```
|
| 34 |
+
Hengzongshu/chinese-bbpe-vocab/
|
| 35 |
+
├── tokenizer.json # 分词器配置文件(核心文件)
|
| 36 |
+
└── README.md # 当前 Model Card
|
| 37 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
+
### 🛠️ 使用方法
|
| 42 |
+
#### ✅ 正确加载方式(推荐)
|
| 43 |
+
由于本仓库为**独立分词器仓库**,请使用 `tokenizers` 库直接加载 `tokenizer.json` 文件(需要下载到本地):
|
|
|
|
|
|
|
| 44 |
|
| 45 |
```python
|
| 46 |
+
from tokenizers import Tokenizer
|
| 47 |
|
| 48 |
+
# 加载分词器
|
| 49 |
+
tokenizer = Tokenizer.from_file("tokenizer.json") #你的tokenizer.json文件位置
|
| 50 |
|
| 51 |
+
# 分词示例
|
| 52 |
+
encoded = tokenizer.encode("自然语言处理")
|
| 53 |
+
print(encoded.tokens)
|
| 54 |
+
print(encoded.ids)
|
| 55 |
```
|
| 56 |
|
| 57 |
+
#### ❌ 错误加载方式(不推荐)
|
| 58 |
+
**不要使用 `transformers.AutoTokenizer`** 加载本仓库,因为其需要模型配置文件(`config.json`),而本仓库未提供:
|
|
|
|
| 59 |
|
| 60 |
+
```python
|
| 61 |
+
# ❌ 报错示例(缺少 config.json)
|
| 62 |
+
from transformers import AutoTokenizer
|
| 63 |
+
tokenizer = AutoTokenizer.from_pretrained("Hengzongshu/chinese-bbpe-vocab")
|
| 64 |
+
```
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
+
### ⚠️ 注意事项
|
| 69 |
+
1. **仅分词器仓库**:
|
| 70 |
+
本仓库仅包含分词器文件(`tokenizer.json`),**不包含模型权重**。请勿将其与完整模型仓库混淆。
|
| 71 |
+
2. **依赖库**:
|
| 72 |
+
- 使用 `tokenizers` 库(Hugging Face 官方库)加载分词器。
|
| 73 |
+
- 安装命令:
|
| 74 |
+
```bash
|
| 75 |
+
pip install tokenizers
|
| 76 |
+
```
|
| 77 |
+
3. **路径验证**:
|
| 78 |
+
确保 `tokenizer.json` 文件实际存在于指定路径,否则会报 `FileNotFoundError`。
|
| 79 |
|
| 80 |
---
|
| 81 |
|
| 82 |
+
### 📚 技术细节
|
| 83 |
+
- **分词算法**:
|
| 84 |
+
基于 **Byte Pair Encoding (BPE)** 及其改进版本 **BBPE**(Byte-level BPE),通过统计高频字符组合进行子词切分。
|
| 85 |
+
- **词汇表大小**:
|
| 86 |
+
词汇表包含中文常用字符及子词单元,具体大小可通过 `tokenizer.get_vocab_size()` 查看。
|
| 87 |
+
- **特殊标记**:
|
| 88 |
+
包含 `[unk]`、`[s]`、`[pad]` 等常见特殊标记(如需自定义,请修改 `tokenizer.json`)。
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
+
### 🧾 许可证
|
| 93 |
+
本仓库采用 **MIT License**,允许自由使用、修改和分发,但需保留原始版权声明。详情请参见 [LICENSE](LICENSE) 文件。
|
|
|
|
| 94 |
|
| 95 |
---
|
| 96 |
|
| 97 |
+
### 🤝 贡献与反馈
|
| 98 |
+
- **提交 Issues**:
|
| 99 |
+
如果发现分词器问题或有改进建议,请通过 GitHub Issues 提交。
|
| 100 |
+
- **贡献代码**:
|
| 101 |
+
欢迎提交 Pull Request 优化分词器配置或扩展功能。
|
| 102 |
|
| 103 |
---
|
| 104 |
|
| 105 |
+
### 📌 相关链接
|
| 106 |
+
- **Hugging Face 仓库地址**:
|
| 107 |
+
[https://huggingface.co/Hengzongshu/chinese-bbpe-vocab](https://huggingface.co/Hengzongshu/chinese-bbpe-vocab)
|
| 108 |
+
- **技术博客/文档**:
|
| 109 |
+
[分词器详解(CSDN)](https://blog.csdn.net/xxx)(可替换为你的技术博客链接)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
+
---
|