Speech Semantic Tokenizer

As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in thirdparty/G2P. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a hubert-large model trained on about 450K hours of unlabeled speech data with the recipe provided by fairseq. On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer.

To run this semantic tokenizer alone, the required packages should be installed.

# install requirements for this semantic tokenizer on Ascend 910B
# for GPUs, just remove torch-npu==2.5.1
pip install -r requirements_npu.txt