mrfakename commited on
Commit
8a52737
·
verified ·
1 Parent(s): 91bf5f9

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ diarization.gif filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pyannote
4
+ - pyannote-audio
5
+ - pyannote-audio-pipeline
6
+ - audio
7
+ - voice
8
+ - speech
9
+ - speaker
10
+ - speaker-diarization
11
+ - speaker-change-detection
12
+ - voice-activity-detection
13
+ - overlapped-speech-detection
14
+ - automatic-speech-recognition
15
+ license: cc-by-4.0
16
+ extra_gated_prompt: "Your input helps us strengthen the pyannote community and improve our open-source offerings. This pipeline is released under the CC-BY-4.0 license and will always remain freely accessible. By providing your details, you agree that we may email you occasionally with important news about pyannote models, invitations to try premium pipelines, and information about specific services designed for researchers and professionals like you."
17
+ extra_gated_fields:
18
+ Company/university: text
19
+ Use case:
20
+ type: select
21
+ options:
22
+ - label: Meeting note taker (automated meeting transcription, action item extraction, and speaker identification in recordings)
23
+ value: meeting
24
+ - label: Conversation AI (chatbots, voice assistants, multi-turn dialogue systems with speaker awareness)
25
+ value: conversation
26
+ - label: CCaaS and customer experience (call center analytics, customer service optimization, and interaction quality monitoring)
27
+ value: ccaas
28
+ - label: Voice agents (AI-powered phone systems, automated customer service, voice-based interactions)
29
+ value: agent
30
+ - label: Media and automated dubbing (content creation, podcast processing, video production, and multilingual media)
31
+ value: dubbing
32
+ - label: Training and development (educational content analysis, corporate training evaluation, and learning assessment tools)
33
+ value: training
34
+ - label: Other
35
+ value: other
36
+ ---
37
+
38
+ # `community-1` speaker diarization
39
+
40
+ This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization.
41
+
42
+ - stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
43
+ - audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
44
+
45
+ The [main improvements brought by `Community-1`](https://www.pyannote.ai/blog/community-1) are:
46
+
47
+ - [improved](#benchmark) speaker assignment and counting
48
+ - simpler reconciliation with transcription timestamps with [*exclusive*](#exclusive-speaker-diarization) speaker diarization
49
+ - easy [offline use](#offline-use) (i.e. without internet connection)
50
+ - (optionally) [hosted](https://hf.co/pyannote/speaker-diarization-community-1-cloud) on pyannoteAI cloud
51
+
52
+
53
+ ## Setup
54
+
55
+ 1. `pip install pyannote.audio`
56
+ 2. Accept user conditions
57
+ 3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
58
+
59
+ ## Quick start
60
+
61
+ ```python
62
+ # download the pipeline from Huggingface
63
+ from pyannote.audio import Pipeline
64
+ pipeline = Pipeline.from_pretrained(
65
+ "pyannote/speaker-diarization-community-1",
66
+ token="{huggingface-token}")
67
+
68
+ # run the pipeline locally on your computer
69
+ output = pipeline("audio.wav")
70
+
71
+ # print the predicted speaker diarization
72
+ for turn, speaker in output.speaker_diarization:
73
+ print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
74
+ ```
75
+
76
+ ## Benchmark
77
+
78
+ Out of the box, `Community-1` is much better than `speaker-diarization-3.1`.
79
+
80
+ We report [diarization error rates](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
81
+
82
+ | Benchmark (last updated in 2025-09) | <a href="https://hf.co/pyannote/speaker-diarization-3.1">`legacy` (3.1)</a>| <a href="https://www.pyannote.ai/blog/community-1">`community-1`</a> | <a href="https://www.pyannote.ai/blog/precision-2">`precision-2`</a> |
83
+ | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------| ------------------------------------------------ |
84
+ | [AISHELL-4](https://arxiv.org/abs/2104.03603) | 12.2 | 11.7 | 11.4 |
85
+ | [AliMeeting](https://www.openslr.org/119/) (channel 1) | 24.5 | 20.3 | 15.2 |
86
+ | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.8 | 17.0 | 12.9 |
87
+ | [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 22.7 | 19.9 | 15.6 |
88
+ | [AVA-AVD](https://arxiv.org/abs/2111.14448) | 49.7 | 44.6 | 37.1 |
89
+ | [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 28.5 | 26.7 | 16.6 |
90
+ | [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 21.4 | 20.2 | 14.7 |
91
+ | [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 51.2 | 46.8 | 39.0 |
92
+ | [MSDWild](https://github.com/X-LANCE/MSDWILD) | 25.4 | 22.8 | 17.3 |
93
+ | [RAMC](https://www.openslr.org/123/) | 22.2 | 20.8 | 10.5 |
94
+ | [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 7.9 | 8.9 | 7.4 |
95
+ | [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.2 | 8.5 |
96
+
97
+ `Precision-2` model is even better and can be tested like this:
98
+
99
+ 1. Create an API key on [pyannoteAI dashboard]((https://dashboard.pyannote.ai)) (free credits included)
100
+ 2. Change one line of code
101
+
102
+ ```diff
103
+ from pyannote.audio import Pipeline
104
+ pipeline = Pipeline.from_pretrained(
105
+ - 'pyannote/speaker-diarization-community-1', token="{huggingface-token}")
106
+ + 'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}")
107
+ diarization = pipeline("audio.wav") # runs on pyannoteAI servers
108
+ ```
109
+
110
+ ## Processing on GPU
111
+
112
+ `pyannote.audio` pipelines run on CPU by default.
113
+ You can send them to GPU with the following lines:
114
+
115
+ ```python
116
+ import torch
117
+ pipeline.to(torch.device("cuda"))
118
+ ```
119
+
120
+ ## Processing from memory
121
+
122
+ Pre-loading audio files in memory may result in faster processing:
123
+
124
+ ```python
125
+ waveform, sample_rate = torchaudio.load("audio.wav")
126
+ output = pipeline({"waveform": waveform, "sample_rate": sample_rate})
127
+ ```
128
+
129
+ ## Monitoring progress
130
+
131
+ Hooks are available to monitor the progress of the pipeline:
132
+
133
+ ```python
134
+ from pyannote.audio.pipelines.utils.hook import ProgressHook
135
+ with ProgressHook() as hook:
136
+ output = pipeline("audio.wav", hook=hook)
137
+ ```
138
+
139
+ ## Controlling the number of speakers
140
+
141
+ In case the number of speakers is known in advance, one can use the `num_speakers` option:
142
+
143
+ ```python
144
+ output = pipeline("audio.wav", num_speakers=2)
145
+ ```
146
+
147
+ One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
148
+
149
+ ```python
150
+ output = pipeline("audio.wav", min_speakers=2, max_speakers=5)
151
+ ```
152
+
153
+ ## Exclusive speaker diarization
154
+
155
+ `Community-1` pretrained pipeline returns a new *exclusive* speaker diarization, on top of the regular speaker diarization, available as `output.exclusive_speaker_diarization`.
156
+
157
+ This is a feature which is [backported from our latest commercial model](https://www.pyannote.ai/blog/precision-2) that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.
158
+
159
+ ## Offline use
160
+
161
+ 1. In the terminal, copy the pipeline on disk:
162
+
163
+ ```bash
164
+ # make sure git-lfs is installed (https://git-lfs.com)
165
+ git lfs install
166
+
167
+ # create a directory on disk
168
+ mkdir /path/to/directory
169
+
170
+ # when prompted for a password, use an access token with write permissions.
171
+ # generate one from your settings: https://huggingface.co/settings/tokens
172
+ git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
173
+ ```
174
+
175
+ 2. In Python, use the pipeline without internet connection:
176
+
177
+ ```python
178
+ # load pipeline from disk (works without internet connection)
179
+ from pyannote.audio import Pipeline
180
+ pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')
181
+
182
+ # run the pipeline locally on your computer
183
+ output = pipeline("audio.wav")
184
+ ```
185
+
186
+ ## Citations
187
+
188
+ 1. Speaker segmentation model
189
+
190
+ ```bibtex
191
+ @inproceedings{Plaquet23,
192
+ author={Alexis Plaquet and Hervé Bredin},
193
+ title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
194
+ year=2023,
195
+ booktitle={Proc. INTERSPEECH 2023},
196
+ }
197
+ ```
198
+
199
+ 2. Speaker embedding model
200
+
201
+ ```bibtex
202
+ @inproceedings{Wang2023,
203
+ title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
204
+ author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
205
+ booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
206
+ pages={1--5},
207
+ year={2023},
208
+ organization={IEEE}
209
+ }
210
+ ```
211
+
212
+
213
+ 3. Speaker clustering
214
+
215
+ ```bibtex
216
+ @article{Landini2022,
217
+ author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
218
+ title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}},
219
+ year={2022},
220
+ journal={Computer Speech \& Language},
221
+ }
222
+ ```
223
+
224
+ ## Acknowledgment
225
+
226
+ Training and tuning made possible thanks to [GENCI](https://www.genci.fr/) on the [**Jean Zay**](http://www.idris.fr/eng/jean-zay/) supercomputer.
227
+
config.yaml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dependencies:
2
+ pyannote.audio: 4.0.0
3
+
4
+ pipeline:
5
+ name: pyannote.audio.pipelines.SpeakerDiarization
6
+ params:
7
+ clustering: VBxClustering
8
+ segmentation: $model/segmentation
9
+ segmentation_batch_size: 32
10
+ embedding: $model/embedding
11
+ embedding_batch_size: 32
12
+ embedding_exclude_overlap: true
13
+ plda: $model/plda
14
+
15
+ params:
16
+ clustering:
17
+ threshold: 0.6
18
+ Fa: 0.07
19
+ Fb: 0.8
20
+ segmentation:
21
+ min_duration_off: 0.0
diarization.gif ADDED

Git LFS Details

  • SHA256: 0d925ad38995d89009260e493b0ae2e684c3e1397f495265ed841c45c4f73a35
  • Pointer size: 131 Bytes
  • Size of remote file: 861 kB
embedding/README.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copied from https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM
2
+
3
+ ## License
4
+
5
+ According to [this page](https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md):
6
+
7
+ > The pretrained model in WeNet follows the license of it's corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.
8
+
9
+ ## Citation
10
+
11
+ ```bibtex
12
+ @inproceedings{Wang2023,
13
+ title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
14
+ author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
15
+ booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
16
+ pages={1--5},
17
+ year={2023},
18
+ organization={IEEE}
19
+ }
20
+ ```
embedding/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f10ff60898a1d185fa22e1d11e0bfa8a92efec811f11bca48cb8cafebefd929
3
+ size 26646242
plda/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ PLDA model trained by [BUT Speech@FIT](https://speech.fit.vut.cz/) group.
2
+
3
+ Thanks to [Jiangyu Han](https://github.com/jyhan03) and [Petr Pálka](https://github.com/Selesnyan) for the integration of VBx in pyannote.audio.
plda/plda.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b77bcd840692710dd3496f62ecfeed8d8e5f002fd991b785079b244eab7d255
3
+ size 133852
plda/xvec_transform.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:325f1ce8e48f7e55e9c8aa47e05d2766b7c48c4b25b8de8dd751e7a4cc5fbe8f
3
+ size 134376
segmentation/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ad24338d844fb95985486eb1a464e32d229f6d7a03c9abe60f978bacf3f816e
3
+ size 5906507