DSTK / README_CN.md
gooorillax's picture
refine readme, add logo, and fix a punct normalization problem in tn
bdecca1

We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard ...

Discrete Speech Tokenization Toolkit [English|Chinese]

Discrete Speech Tokenization Toolkit (DSTK) 是一个开源语音处理工具包,旨在提供完整的语音离散化解决方案。它支持将连续语音信号转换为离散的语音token、从离散语音token重建语音波形,以及将文本内容转换为语音token。DSTK为语音理解、语音合成、多模态学习等任务提供高效、灵活、模块化的基础组件。

Release Notes:

V1.0

本次发布的DSTK包含三个模块:

  1. 语音Tokenizer模块(Semantic Tokenizer)
    • 将语音的语义信息编码为离散的语音token
    • frame rate: 25Hz; codebook size: 4096, 支持中英文
  2. 语音Detokenizer模块(Semantic Detokenizer)
    • 将离散语音token解码为可听的语音波形,完成语音的重建
    • 支持中英文
  3. 文本转语音Token模块(Text2Token)
    • 将文本转换为语音token

TTS pipeline

串联使用上述三个模型实现TTS的功能

Non-parallel Speech Reconstruction Pipeline

串联使用tokenizer和detokenizer实现语音重建的功能

上述pipeline在seed-tts-eval数据集的TTS和语音重建任务上上达到了一流的水平,而模型的参数量和训练中使用的监督数据都远小于对照基线模型:

我们基于LLM测试了本语音tokenizer的ASR精度,我们的tokenizer达到了与采用连续语音表征的模型相近的水平.

更多关于三个模块的信息:

Installation

Hardware: Ascend 910B with CANN 8.1 RC1 or GPU

Create a separate environment if needed

# Create a conda env with python_version>=3.10  (you could also use virtualenv)
conda create -n dstk python=3.10
conda activate dstk

# run install_requirements.sh to setup enviroment for DSTK inference for Ascend 910B
# for GPUs, just remove torch-npu==2.5.1 from requirements_npu.txt
sh install_requirements.sh

# patch for G2P
# modify the first line in thirdparty/G2P/patch_for_deps.sh:
# SITE_PATH=/path/to/your/own/site-packages
# run thirdparty/G2P/patch_for_deps.sh to fix problems in LangSegment 0.2.0, pypinyin and tn
sh thirdparty/G2P/patch_for_deps.sh

Download the vocos vocoder from vocos-mel-24khz

Usage:

Pipelines

import sys
import soundfile as sf

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from reconstuction_example import ReconstructionPipeline
from tts_example import TTSPipeline

ref_wav_path = dstk_path + "/00004557-00000030.wav"
input_wav_path = dstk_path + "/004892.wav"
vocoder_path = "/path/to/vocos-mel-24khz"

reconsturctor = ReconstructionPipeline(
    detok_vocoder=vocoder_path,
)

tts = TTSPipeline(
    detok_vocoder=vocoder_path,
    max_seg_len=30,
)

# for non-parallel speech reconstruction
generated_wave, target_sample_rate = reconsturctor.reconstruct(
    ref_wav_path, input_wav_path
)

with open("./recon.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

# for tts
ref_wav_path = input_wav_path
generated_wave, target_sample_rate = tts.synthesize(
    ref_wav_path,
    "荷花未全谢,又到中秋节。家家户户把月饼切,庆中秋。美酒多欢乐,整杯盘,猜拳行令,同赏月。",
)
with open("./tts.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

print("Finished")

Tokenization

import sys
import librosa

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

input_wav_path = dstk_path + "/004892.wav"

from semantic_tokenizer.f40ms.simple_tokenizer_infer import SpeechTokenizer

tokenizer = SpeechTokenizer()

raw_wav, sr = librosa.load(input_wav_path, sr=16000)
token_list, token_info_list = tokenizer.extract([raw_wav])  # 传入波形数据
for token_info in token_info_list:
    print(token_info["unit_sequence"] + "\n")
    print(token_info["reduced_unit_sequence"] + "\n")

Text2Token

import sys
import librosa

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from text2token.simple_infer import Text2TokenGenerator

input_text = "从离散语音token重建语音波形"
MAX_SEG_LEN = 30

t2u = Text2TokenGenerator()

phones = t2u.text2phone(input_text.strip())
print("phonemes of input text: %s are [%s]" % (input_text, phones))

speech_tokens_info = t2u.generate_for_long_input_text(
    [phones], max_segment_len=MAX_SEG_LEN
)

for infor in speech_tokens_info[0]:
    print(" ".join(infor) + "\n")

Detokenization

import sys
import soundfile as sf

dstk_path = "/path/to/DSTK"
sys.path.append(dstk_path)

from semantic_detokenizer.chunk_infer import SpeechDetokenizer

# 从离散语音token重建语音波形
input_tokens = "3953 3890 3489 456 2693 3239 3692 3810 3874 3882 2749 548 3202 4012 3490 3939 3988 411 722 826 2812 3883 3874 3810 3983 4086 3946 3747 3469 2537 3689 3434 1816 1242 2415 3942 3363 3865 2841 1700 1652 3241 3362 3363 3874 3882 2792 933 2253 2799 3692 3746 3882 2809 1001 2449 1016 3762 3882 3874 3810 3809 3983 4086 4018 3747 3461 2537 3624 3882 3382 581 1837 2413 3435 4005 2003 2890 3884 3690 3746 3938 3874 3873 3856"
vocoder_path = "/path/to/vocos-mel-24khz"
ref_wav_path = dstk_path + "/004892.wav"
# output of tokenizer given ref_wav as input
ref_tokens = "3936 3872 3809 3873 3817 3639 2591 539 1021 3641 3890 4069 2002 3537 2303 3773 3827 3875 3969 4072 2425 97 2537 3633 3690 3865 3920 3069 3582 3883 3818 3997 4031 4029 3946 3874 3733 3727 3214 506 3892 3787 3457 3552 3490 4014 991 1991 3885 3947 4069 1488 1016 3258 3710 52 2362 3961 2680 1569 1851 3897 3825 3752 3808 3800 3873 3808 3792"

token_chunk_len = 75
chunk_cond_proportion = 0.3
chunk_look_ahead = 10
max_ref_duration = 4.5
ref_audio_cut_from_head = False

detoker = SpeechDetokenizer(
    vocoder_path=vocoder_path,
)

generated_wave, target_sample_rate = detoker.chunk_generate(
    ref_wav_path,
    ref_tokens.split(),
    input_tokens.split(),
    token_chunk_len,
    chunk_cond_proportion,
    chunk_look_ahead,
    max_ref_duration,
    ref_audio_cut_from_head,
)

with open("./detok.wav", "wb") as f:
    sf.write(f.name, generated_wave, target_sample_rate)
    print(f"write output to: {f.name}")

Core Developers:

Daxin Tan, Dehua Tao, Yusen Sun and Xiao Chen

Contributors:

Hanlin Zhang

Former Contributors:

Jingcheng Tian, Xinshan Zeng, Liangyou Li, Jing Xu, Mingyu Cui, Dingdong Wang