File size: 4,670 Bytes
6d1310f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
481f54d
 
6d1310f
75c86c9
 
6d1310f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9715ce1
 
 
 
 
481f54d
6d1310f
 
 
20e62b0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: apache-2.0
language:
- en
- zh
pipeline_tag: audio-to-audio
tags:
- pytorch
- codec
---

<div align="center">
  <h1>
  SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
  </h1>

<p>
    <a href="https://github.com/Soul-AILab/SAC">
        <img src="https://img.shields.io/badge/SAC-GitHub-black?logo=github&logoColor=white" alt="GitHub Repo">
    </a>
    <a href="https://sac-codec.github.io/">
        <img src="https://img.shields.io/badge/๐ŸŒ%20Demo-Page-brightgreen" alt="Demo Page">
    </a>
    <a href="https://arxiv.org/abs/2510.16841">
      <img src="https://img.shields.io/badge/arXiv-2510.16841-blueviolet?logo=arxiv&logoColor=white" alt="arXiv">
    </a>
    <a href="https://huggingface.co/collections/Soul-AILab/sac-68f1df9572a6314d1dc1f91e">
      <img src="https://img.shields.io/badge/๐Ÿค—%20SAC-Models-yellow" alt="Hugging Face">
    </a>
</p>

  <p align="center">
    <i>A semanticโ€“acoustic dual-stream speech codec achieving state-of-the-art performance in speech reconstruction and semantic representation across bitrates.</i>
  </p>
</div>


## ๐Ÿ› ๏ธ Environment Setup
```bash
conda create -n sac python=3.10
conda activate sac
pip install -r requirements.txt  # pip version == 24.0
```


## ๐Ÿงฉ Model Checkpoints

To use SAC, you need to prepare the pretrained dependencies, including the [GLM-4-Voice-Tokenizer](https://huggingface.co/zai-org/glm-4-voice-tokenizer) for semantic tokenization and the [ERes2Net](https://modelscope.cn/models/iic/speech_eres2net_sv_en_voxceleb_16k) speaker encoder for speaker feature extraction (during codec training). Make sure the corresponding model paths are correctly set in your configuration file (e.g., `configs/xxx.yaml`).

The following table lists the available SAC checkpoints:

| Model Name | Hugging Face | Sample Rate | Token Rate | BPS |
|:-----------:|:------------:|:------------:|:-----------:|:---:|
| SAC | [๐Ÿค— Soul-AILab/SAC-16k-37_5Hz](https://huggingface.co/Soul-AILab/SAC-16k-37_5Hz) | 16 kHz | 37.5 Hz | 525 |
| SAC | [๐Ÿค— Soul-AILab/SAC-16k-62_5Hz](https://huggingface.co/Soul-AILab/SAC-16k-62_5Hz) | 16 kHz | 62.5 Hz | 875 |


## ๐ŸŽง Inference

To perform audio reconstruction, you can use the following command:

```bash
python -m bins.infer
```

We also provide batch scripts for [audio reconstruction](./scripts/batch/reconstruct.sh), [encoding](./scripts/batch/encode.sh), [decoding](./scripts/batch/decode.sh), and [embedding extraction](./scripts/batch/extract_embeddings.sh) in the `scripts/batch` directory as references (you can refer to the [batch scripts guide](./docs/batch_scripts_guide.md) for details).


## ๐Ÿงช Evaluation

You can run the following command to perform evaluation:

```bash
bash scripts/eval.sh
```

For details on dataset preparation and evaluation setup, please first refer to the [evaluation guide](./docs/evaluation_guide.md).


## ๐Ÿš€ Training
### Step 1: Prepare training data
Before training, organize your dataset in **JSONL** format. You can refer to `example/training_data.jsonl`. Each entry should include:
- **utt** โ€” unique utterance ID (customizable)
- **wav_path** โ€” path to raw audio
- **ssl_path** โ€” path to offline-extracted Whisper features (for semantic supervision)
- **semantic_token_path** โ€” path to offline-extracted semantic tokens

To accelerate training, you need to **extract semantic tokens and Whisper features offline** first before starting.  Refer to the [feature extraction guide](./docs/feature_extraction_guide.md) for detailed instructions.

### Step 2: Modify configuration files
You can adjust training and DeepSpeed configurations by editing:
- [`configs/xxx.yaml`](./configs) โ€” main training configuration  
- [`configs/ds_stage2.json`](./configs/ds_stage2.json) โ€” DeepSpeed configuration

### Step 3: Start training
Run the following script to start SAC training:

```bash
bash scripts/train.sh
```


## ๐Ÿ™ Acknowledgement
Our codebase builds upon the awesome [SparkVox](https://github.com/SparkAudio/SparkVox) and [DAC](https://github.com/descriptinc/descript-audio-codec). We thank the authors for their excellent work.

## ๐Ÿ”– Citation
If you find this work useful in your research, please consider citing:
```bibtex
@article{chen2025sac,
  title={SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization},
  author={Chen, Wenxi and Wang, Xinsheng and Yan, Ruiqi and Chen, Yushen and Niu, Zhikang and Ma, Ziyang and Li, Xiquan and Liang, Yuzhe and Wen, Hanlin and Yin, Shunshun and others},
  journal={arXiv preprint arXiv:2510.16841},
  year={2025}
}
```

## ๐Ÿ“œ License
This project is licensed under the Apache 2.0 License.