Hengzongshu commited on
Commit
50785b5
·
verified ·
1 Parent(s): 5db4899

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -94
README.md CHANGED
@@ -5,137 +5,107 @@ datasets:
5
  language:
6
  - zh
7
  pipeline_tag: text-classification
8
- library_name: transformers
9
  tags:
10
  - chinse
11
  - tokenization
12
  - bpe
13
  ---
14
 
15
- ### **Dataset Card for Chinese BBPE Vocabulary (BAAI CCI3-HQ Based)**
16
 
17
- ---
18
-
19
- #### **1. Dataset Summary**
20
 
21
- This dataset provides a **BBPE (Byte-Level Byte Pair Encoding) vocabulary** specifically trained on the **BAAI CCI3-HQ corpus**, containing **30,000 subword tokens** for Chinese text processing. The vocabulary is stored in a single JSON file and is designed to efficiently tokenize Chinese text while minimizing out-of-vocabulary (OOV) issues.
22
- 本模型基于 BAAI CCI3-HQ 数据库训练,包含 3 万条中文分词词汇,适用于中文文本处理。词汇表储存在**tokenizer.json**中
23
 
24
  ---
25
 
26
- #### **2. Dataset Description**
27
-
28
- ##### **Source Data**
29
- - **Corpus**: BAAI CCI3-HQ (High-Quality Corpus for Chinese Pretraining)
30
- - **Subset Used**: 5GB of filtered and cleaned Chinese text from the CCI3-HQ database.
31
- - **Purpose**: The subset was selected to ensure high-quality training data for BBPE tokenization, focusing on general-purpose Chinese text coverage.
32
-
33
- ##### **Tokenization Method**
34
- - **Algorithm**: Byte-Level Byte Pair Encoding (BBPE), which extends traditional BPE by operating on UTF-8 byte sequences.
35
- - **Key Features**:
36
- - Handles rare and unseen characters via byte-level fallback.
37
- - Optimized for Chinese text, capturing common character combinations and subwords.
38
- - Includes special tokens (" s ", " pad ", " /s ", " unk ", " mask ").
39
-
40
- ##### **Vocabulary Composition**
41
- - **Size**: 30,000 subword tokens.
42
- - **Language Focus**: Chinese (simplified and traditional characters, pinyin, etc.).
43
- - **Coverage**: Designed to cover **95%+ of common Chinese text** in web, literature, and communication domains.
44
 
45
  ---
46
 
47
- #### **3. Dataset Structure**
48
-
49
- The dataset consists of a single file:
50
- - `vocab.json`: A JSON dictionary mapping subword tokens to unique integer IDs.
51
- Example structure:
52
- ```json
53
- {
54
- "èĥĮ": 1432,
55
- "çļĩ": 1433,
56
- "çĶļèĩ³": 1434,
57
- "åħ¶ä¸Ń": 1435,
58
- ...
59
- }
60
- ```
61
 
62
  ---
63
 
64
- #### **4. Use Cases**
65
-
66
- 1. **Chinese Text Tokenization**
67
- - Tokenize Chinese text for NLP tasks like machine translation, text summarization, or question-answering.
68
- - Compatible with Hugging Face's `tokenizers` library for integration into transformer models.
69
 
70
  ```python
71
- from transformers import AutoTokenizer
72
 
73
- # 直接从 Hugging Face 加载
74
- tokenizer = AutoTokenizer.from_pretrained("your-username/chinese-bbpe-tokenizer")
75
 
76
- # 测试中文分词
77
- text = "自然语言处理是人工智能的重要领域"
78
- tokens = tokenizer.tokenize(text)
79
- print(tokens) # 输出分词结果
80
  ```
81
 
82
- 2. **Model Training**
83
- - Use as a pre-trained BBPE vocabulary for fine-tuning Chinese language models (e.g., BERT, RoBERTa).
84
- - Reduce OOV rates in downstream tasks by leveraging subword segmentation.
85
 
86
- 3. **Cross-Domain Adaptation**
87
- - Extend the vocabulary for domain-specific applications (e.g., medical or legal Chinese) by adding specialized terms.
 
 
 
88
 
89
  ---
90
 
91
-
92
- #### **5. References**
93
-
94
- 1. **BAAI CCI3-HQ Corpus**
95
- - Original source: [BAAI CCI3-HQ Documentation](https://huggingface.co/datasets/BAAI/CCI3-HQ)
96
- - License: [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
97
-
98
- 2. **BBPE Algorithm**
99
- - Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. *ACL*.
100
- - Google's implementation of byte-level BPE: [SentencePiece](https://github.com/google/sentencepiece).
 
101
 
102
  ---
103
 
104
- #### **6. License**
105
-
106
- - **Vocabulary File (`vocab.json`)**: [MIT License](https://opensource.org/licenses/MIT)
107
- - **Usage Restrictions**: Free for research and commercial use. Redistribution must include this license.
 
 
 
108
 
109
  ---
110
 
111
- #### **7. Versioning**
112
-
113
- - **v1.0.0** (2025-06-01): Initial release based on 5GB of BAAI CCI3-HQ data.
114
 
115
  ---
116
 
117
- #### **8. Contributors**
118
-
119
- - **Primary Author**: [Xia Ziye]
120
- - **Data Curators**: BAAI Team (for providing the CCI3-HQ corpus)
 
121
 
122
  ---
123
 
124
- #### **9. Citation**
125
-
126
- If you use this dataset in your research, please cite it as:
127
-
128
- ```bibtex
129
- @misc{bbpe_chinese_vocab_2025,
130
- author = {Xia Ziye},
131
- title = {Chinese BBPE Vocabulary Trained on BAAI CCI3-HQ},
132
- year = {2025},
133
- publisher = {Hugging Face},
134
- journal = {Dataset},
135
- howpublished = {\url{https://huggingface.co/datasets/Hengzongshu/Chinese_BBPE_Vocab}}
136
- }
137
- ```
138
-
139
- ---
140
 
141
- Let me know if you'd like to adjust the structure or add specific details!
 
5
  language:
6
  - zh
7
  pipeline_tag: text-classification
 
8
  tags:
9
  - chinse
10
  - tokenization
11
  - bpe
12
  ---
13
 
 
14
 
15
+ ## 📄 Model Card: 中文 BBPE 分词器
 
 
16
 
17
+ ### 🧠 简介
18
+ 本仓库提供一个基于 **Byte Pair Encoding (BPE)** 的中文分词器(Tokenizer),专为中文文本设计。该分词器通过子词(Subword)切分技术,将中文文本拆分为更细粒度的 token,适用于大语言模型(LLM)的预处理任务。
19
 
20
  ---
21
 
22
+ ### 🔧 用途说明
23
+ - **目标**:
24
+ 本分词器旨在将中文文本转换为模型可处理的 token 序列(ID 列表),是训练和推理阶段的重要工具。
25
+ - **适用场景**:
26
+ - 中文自然语言处理(NLP)任务(如文本分类、问答系统、机器翻译等)。
27
+ - 与基于 BPE 的语言模型(如 GPT、RoBERTa 等)配套使用。
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ---
30
 
31
+ ### 🗂️ 文件结构
32
+ 本仓库仅包含以下文件:
33
+ ```
34
+ Hengzongshu/chinese-bbpe-vocab/
35
+ ├── tokenizer.json # 分词器配置文件(核心文件)
36
+ └── README.md # 当前 Model Card
37
+ ```
 
 
 
 
 
 
 
38
 
39
  ---
40
 
41
+ ### 🛠️ 使用方法
42
+ #### ✅ 正确加载方式(推荐)
43
+ 由于本仓库为**独立分词器仓库**,请使用 `tokenizers` 库直接加载 `tokenizer.json` 文件(需要下载到本地):
 
 
44
 
45
  ```python
46
+ from tokenizers import Tokenizer
47
 
48
+ # 加载分词器
49
+ tokenizer = Tokenizer.from_file("tokenizer.json") #你的tokenizer.json文件位置
50
 
51
+ # 分词示例
52
+ encoded = tokenizer.encode("自然语言处理")
53
+ print(encoded.tokens)
54
+ print(encoded.ids)
55
  ```
56
 
57
+ #### 错误加载方式(不推荐)
58
+ **不要使用 `transformers.AutoTokenizer`** 加载本仓库,因为其需要模型配置文件(`config.json`),而本仓库未提供:
 
59
 
60
+ ```python
61
+ # 报错示例(缺少 config.json)
62
+ from transformers import AutoTokenizer
63
+ tokenizer = AutoTokenizer.from_pretrained("Hengzongshu/chinese-bbpe-vocab")
64
+ ```
65
 
66
  ---
67
 
68
+ ### ⚠️ 注意事项
69
+ 1. **仅分词器仓库**:
70
+ 本仓库仅包含分词器文件(`tokenizer.json`),**不包含模型权重**。请勿将其与完整模型仓库混淆。
71
+ 2. **依赖库**:
72
+ - 使用 `tokenizers` 库(Hugging Face 官方库)加载分词器。
73
+ - 安装命令:
74
+ ```bash
75
+ pip install tokenizers
76
+ ```
77
+ 3. **路径验证**:
78
+ 确保 `tokenizer.json` 文件实际存在于指定路径,否则会报 `FileNotFoundError`。
79
 
80
  ---
81
 
82
+ ### 📚 技术细节
83
+ - **分词算法**:
84
+ 基于 **Byte Pair Encoding (BPE)** 及其改进版本 **BBPE**(Byte-level BPE),通过统计高频字符组合进行子词切分。
85
+ - **词汇表大小**:
86
+ 词汇表包含中文常用字符及子词单元,具体大小可通过 `tokenizer.get_vocab_size()` 查看。
87
+ - **特殊标记**:
88
+ 包含 `[unk]`、`[s]`、`[pad]` 等常见特殊标记(如需自定义,请修改 `tokenizer.json`)。
89
 
90
  ---
91
 
92
+ ### 🧾 许可证
93
+ 本仓库采用 **MIT License**,允许自由使用、修改和分发,但需保留原始版权声明。详情请参见 [LICENSE](LICENSE) 文件。
 
94
 
95
  ---
96
 
97
+ ### 🤝 贡献与反馈
98
+ - **提交 Issues**:
99
+ 如果发现分词器问题或有改进建议,请通过 GitHub Issues 提交。
100
+ - **贡献代码**:
101
+ 欢迎提交 Pull Request 优化分词器配置或扩展功能。
102
 
103
  ---
104
 
105
+ ### 📌 相关链接
106
+ - **Hugging Face 仓库地址**:
107
+ [https://huggingface.co/Hengzongshu/chinese-bbpe-vocab](https://huggingface.co/Hengzongshu/chinese-bbpe-vocab)
108
+ - **技术博客/文档**:
109
+ [分词器详解(CSDN)](https://blog.csdn.net/xxx)(可替换为你的技术博客链接)
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ ---