cnmoro commited on
Commit
f3c5f28
·
verified ·
1 Parent(s): b4b00e9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +382 -0
README.md ADDED
@@ -0,0 +1,382 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ tags:
5
+ - ColBERT
6
+ - PyLate
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - generated_from_trainer
11
+ - dataset_size:25863649
12
+ - loss:Contrastive
13
+ base_model: ibm-granite/granite-embedding-107m-multilingual
14
+ datasets:
15
+ - cnmoro/AllTripletsMsMarco-PTBR
16
+ pipeline_tag: sentence-similarity
17
+ library_name: PyLate
18
+ metrics:
19
+ - accuracy
20
+ model-index:
21
+ - name: PyLate model based on ibm-granite/granite-embedding-107m-multilingual
22
+ results:
23
+ - task:
24
+ type: col-berttriplet
25
+ name: Col BERTTriplet
26
+ dataset:
27
+ name: Unknown
28
+ type: unknown
29
+ metrics:
30
+ - type: accuracy
31
+ value: 0.8173714280128479
32
+ name: Accuracy
33
+ ---
34
+
35
+ # PyLate model based on ibm-granite/granite-embedding-107m-multilingual
36
+
37
+ This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [ibm-granite/granite-embedding-107m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-107m-multilingual) on the [all_triplets_ms_marco-ptbr](https://huggingface.co/datasets/cnmoro/AllTripletsMsMarco-PTBR) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
38
+
39
+ ## Model Details
40
+
41
+ ### Model Description
42
+ - **Model Type:** PyLate model
43
+ - **Base model:** [ibm-granite/granite-embedding-107m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-107m-multilingual) <!-- at revision 5c793ec061753b0d0816865e1af7db3f675d65af -->
44
+ - **Document Length:** 180 tokens
45
+ - **Query Length:** 32 tokens
46
+ - **Output Dimensionality:** 128 tokens
47
+ - **Similarity Function:** MaxSim
48
+ - **Training Dataset:**
49
+ - [all_triplets_ms_marco-ptbr](https://huggingface.co/datasets/cnmoro/AllTripletsMsMarco-PTBR)
50
+ - **Language:** pt
51
+
52
+ ### Model Sources
53
+
54
+ - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
55
+ - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
56
+ - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
57
+
58
+ ### Full Model Architecture
59
+
60
+ ```
61
+ ColBERT(
62
+ (0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
63
+ (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
64
+ )
65
+ ```
66
+
67
+ ## Usage
68
+ First install the PyLate library:
69
+
70
+ ```bash
71
+ pip install -U pylate
72
+ ```
73
+
74
+ ### Retrieval
75
+
76
+ PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
77
+
78
+ #### Indexing documents
79
+
80
+ First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
81
+
82
+ ```python
83
+ from pylate import indexes, models, retrieve
84
+
85
+ # Step 1: Load the ColBERT model
86
+ model = models.ColBERT(
87
+ model_name_or_path=pylate_model_id,
88
+ )
89
+
90
+ # Step 2: Initialize the Voyager index
91
+ index = indexes.Voyager(
92
+ index_folder="pylate-index",
93
+ index_name="index",
94
+ override=True, # This overwrites the existing index if any
95
+ )
96
+
97
+ # Step 3: Encode the documents
98
+ documents_ids = ["1", "2", "3"]
99
+ documents = ["document 1 text", "document 2 text", "document 3 text"]
100
+
101
+ documents_embeddings = model.encode(
102
+ documents,
103
+ batch_size=32,
104
+ is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
105
+ show_progress_bar=True,
106
+ )
107
+
108
+ # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
109
+ index.add_documents(
110
+ documents_ids=documents_ids,
111
+ documents_embeddings=documents_embeddings,
112
+ )
113
+ ```
114
+
115
+ Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
116
+
117
+ ```python
118
+ # To load an index, simply instantiate it with the correct folder/name and without overriding it
119
+ index = indexes.Voyager(
120
+ index_folder="pylate-index",
121
+ index_name="index",
122
+ )
123
+ ```
124
+
125
+ #### Retrieving top-k documents for queries
126
+
127
+ Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
128
+ To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
129
+
130
+ ```python
131
+ # Step 1: Initialize the ColBERT retriever
132
+ retriever = retrieve.ColBERT(index=index)
133
+
134
+ # Step 2: Encode the queries
135
+ queries_embeddings = model.encode(
136
+ ["query for document 3", "query for document 1"],
137
+ batch_size=32,
138
+ is_query=True, # # Ensure that it is set to False to indicate that these are queries
139
+ show_progress_bar=True,
140
+ )
141
+
142
+ # Step 3: Retrieve top-k documents
143
+ scores = retriever.retrieve(
144
+ queries_embeddings=queries_embeddings,
145
+ k=10, # Retrieve the top 10 matches for each query
146
+ )
147
+ ```
148
+
149
+ ### Reranking
150
+ If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
151
+
152
+ ```python
153
+ from pylate import rank, models
154
+
155
+ queries = [
156
+ "query A",
157
+ "query B",
158
+ ]
159
+
160
+ documents = [
161
+ ["document A", "document B"],
162
+ ["document 1", "document C", "document B"],
163
+ ]
164
+
165
+ documents_ids = [
166
+ [1, 2],
167
+ [1, 3, 2],
168
+ ]
169
+
170
+ model = models.ColBERT(
171
+ model_name_or_path=pylate_model_id,
172
+ )
173
+
174
+ queries_embeddings = model.encode(
175
+ queries,
176
+ is_query=True,
177
+ )
178
+
179
+ documents_embeddings = model.encode(
180
+ documents,
181
+ is_query=False,
182
+ )
183
+
184
+ reranked_documents = rank.rerank(
185
+ documents_ids=documents_ids,
186
+ queries_embeddings=queries_embeddings,
187
+ documents_embeddings=documents_embeddings,
188
+ )
189
+ ```
190
+
191
+ ## Evaluation
192
+
193
+ ### Metrics
194
+
195
+ #### Col BERTTriplet
196
+
197
+ * Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>
198
+
199
+ | Metric | Value |
200
+ |:-------------|:-----------|
201
+ | **accuracy** | **0.8174** |
202
+
203
+ ## Training Details
204
+
205
+ ### Training Dataset
206
+
207
+ #### all_triplets_ms_marco-ptbr
208
+
209
+ * Dataset: [all_triplets_ms_marco-ptbr](https://huggingface.co/datasets/cnmoro/AllTripletsMsMarco-PTBR) at [f934503](https://huggingface.co/datasets/cnmoro/AllTripletsMsMarco-PTBR/tree/f934503cfbb69901217f12c87f28767354e597ea)
210
+ * Size: 25,863,649 training samples
211
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
212
+ * Loss: <code>pylate.losses.contrastive.Contrastive</code>
213
+
214
+ ### Training Hyperparameters
215
+ #### Non-Default Hyperparameters
216
+
217
+ - `eval_strategy`: epoch
218
+ - `per_device_train_batch_size`: 16
219
+ - `per_device_eval_batch_size`: 16
220
+ - `learning_rate`: 3e-06
221
+ - `num_train_epochs`: 1
222
+ - `max_steps`: 1562500
223
+ - `fp16`: True
224
+
225
+ #### All Hyperparameters
226
+ <details><summary>Click to expand</summary>
227
+
228
+ - `overwrite_output_dir`: False
229
+ - `do_predict`: False
230
+ - `eval_strategy`: epoch
231
+ - `prediction_loss_only`: True
232
+ - `per_device_train_batch_size`: 16
233
+ - `per_device_eval_batch_size`: 16
234
+ - `per_gpu_train_batch_size`: None
235
+ - `per_gpu_eval_batch_size`: None
236
+ - `gradient_accumulation_steps`: 1
237
+ - `eval_accumulation_steps`: None
238
+ - `torch_empty_cache_steps`: None
239
+ - `learning_rate`: 3e-06
240
+ - `weight_decay`: 0.0
241
+ - `adam_beta1`: 0.9
242
+ - `adam_beta2`: 0.999
243
+ - `adam_epsilon`: 1e-08
244
+ - `max_grad_norm`: 1.0
245
+ - `num_train_epochs`: 1
246
+ - `max_steps`: 1562500
247
+ - `lr_scheduler_type`: linear
248
+ - `lr_scheduler_kwargs`: {}
249
+ - `warmup_ratio`: 0.0
250
+ - `warmup_steps`: 0
251
+ - `log_level`: passive
252
+ - `log_level_replica`: warning
253
+ - `log_on_each_node`: True
254
+ - `logging_nan_inf_filter`: True
255
+ - `save_safetensors`: True
256
+ - `save_on_each_node`: False
257
+ - `save_only_model`: False
258
+ - `restore_callback_states_from_checkpoint`: False
259
+ - `no_cuda`: False
260
+ - `use_cpu`: False
261
+ - `use_mps_device`: False
262
+ - `seed`: 42
263
+ - `data_seed`: None
264
+ - `jit_mode_eval`: False
265
+ - `use_ipex`: False
266
+ - `bf16`: False
267
+ - `fp16`: True
268
+ - `fp16_opt_level`: O1
269
+ - `half_precision_backend`: auto
270
+ - `bf16_full_eval`: False
271
+ - `fp16_full_eval`: False
272
+ - `tf32`: None
273
+ - `local_rank`: 0
274
+ - `ddp_backend`: None
275
+ - `tpu_num_cores`: None
276
+ - `tpu_metrics_debug`: False
277
+ - `debug`: []
278
+ - `dataloader_drop_last`: False
279
+ - `dataloader_num_workers`: 0
280
+ - `dataloader_prefetch_factor`: None
281
+ - `past_index`: -1
282
+ - `disable_tqdm`: False
283
+ - `remove_unused_columns`: True
284
+ - `label_names`: None
285
+ - `load_best_model_at_end`: False
286
+ - `ignore_data_skip`: False
287
+ - `fsdp`: []
288
+ - `fsdp_min_num_params`: 0
289
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
290
+ - `fsdp_transformer_layer_cls_to_wrap`: None
291
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
292
+ - `deepspeed`: None
293
+ - `label_smoothing_factor`: 0.0
294
+ - `optim`: adamw_torch
295
+ - `optim_args`: None
296
+ - `adafactor`: False
297
+ - `group_by_length`: False
298
+ - `length_column_name`: length
299
+ - `ddp_find_unused_parameters`: None
300
+ - `ddp_bucket_cap_mb`: None
301
+ - `ddp_broadcast_buffers`: False
302
+ - `dataloader_pin_memory`: True
303
+ - `dataloader_persistent_workers`: False
304
+ - `skip_memory_metrics`: True
305
+ - `use_legacy_prediction_loop`: False
306
+ - `push_to_hub`: False
307
+ - `resume_from_checkpoint`: None
308
+ - `hub_model_id`: None
309
+ - `hub_strategy`: every_save
310
+ - `hub_private_repo`: None
311
+ - `hub_always_push`: False
312
+ - `gradient_checkpointing`: False
313
+ - `gradient_checkpointing_kwargs`: None
314
+ - `include_inputs_for_metrics`: False
315
+ - `include_for_metrics`: []
316
+ - `eval_do_concat_batches`: True
317
+ - `fp16_backend`: auto
318
+ - `push_to_hub_model_id`: None
319
+ - `push_to_hub_organization`: None
320
+ - `mp_parameters`:
321
+ - `auto_find_batch_size`: False
322
+ - `full_determinism`: False
323
+ - `torchdynamo`: None
324
+ - `ray_scope`: last
325
+ - `ddp_timeout`: 1800
326
+ - `torch_compile`: False
327
+ - `torch_compile_backend`: None
328
+ - `torch_compile_mode`: None
329
+ - `dispatch_batches`: None
330
+ - `split_batches`: None
331
+ - `include_tokens_per_second`: False
332
+ - `include_num_input_tokens_seen`: False
333
+ - `neftune_noise_alpha`: None
334
+ - `optim_target_modules`: None
335
+ - `batch_eval_metrics`: False
336
+ - `eval_on_start`: False
337
+ - `use_liger_kernel`: False
338
+ - `eval_use_gather_object`: False
339
+ - `average_tokens_across_devices`: False
340
+ - `prompts`: None
341
+ - `batch_sampler`: batch_sampler
342
+ - `multi_dataset_batch_sampler`: proportional
343
+
344
+ </details>
345
+
346
+ ### Framework Versions
347
+ - Python: 3.10.18
348
+ - Sentence Transformers: 4.0.2
349
+ - PyLate: 1.2.0
350
+ - Transformers: 4.48.2
351
+ - PyTorch: 2.5.1+cu121
352
+ - Accelerate: 1.7.0
353
+ - Datasets: 3.6.0
354
+ - Tokenizers: 0.21.1
355
+
356
+
357
+ ## Citation
358
+
359
+ ### BibTeX
360
+
361
+ #### Sentence Transformers
362
+ ```bibtex
363
+ @inproceedings{reimers-2019-sentence-bert,
364
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
365
+ author = "Reimers, Nils and Gurevych, Iryna",
366
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
367
+ month = "11",
368
+ year = "2019",
369
+ publisher = "Association for Computational Linguistics",
370
+ url = "https://arxiv.org/abs/1908.10084"
371
+ }
372
+ ```
373
+
374
+ #### PyLate
375
+ ```bibtex
376
+ @misc{PyLate,
377
+ title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
378
+ author={Chaffin, Antoine and Sourty, Raphaël},
379
+ url={https://github.com/lightonai/pylate},
380
+ year={2024}
381
+ }
382
+ ```