99eren99 commited on
Commit
0375fa5
·
verified ·
1 Parent(s): 051dcc5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +276 -0
README.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ColBERT
4
+ - PyLate
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - generated_from_trainer
9
+ - loss:Distillation
10
+ - turkish
11
+ pipeline_tag: sentence-similarity
12
+ library_name: PyLate
13
+ ---
14
+
15
+ # PyLate
16
+
17
+ This is a [PyLate](https://github.com/lightonai/pylate) model trained. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+ - **Model Type:** PyLate model
23
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
24
+ - **Document Length:** 8192 tokens
25
+ - **Query Length:** 32 tokens
26
+ - **Output Dimensionality:** 128 tokens
27
+ - **Similarity Function:** MaxSim
28
+ <!-- - **Training Dataset:** Unknown -->
29
+ <!-- - **Language:** Unknown -->
30
+ <!-- - **License:** Unknown -->
31
+
32
+ ### Model Sources
33
+
34
+ - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
35
+ - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
36
+ - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
37
+
38
+ ### Full Model Architecture
39
+
40
+ ```
41
+ ColBERT(
42
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
43
+ (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
44
+ )
45
+ ```
46
+ # Evaluation
47
+ nDCG and Recall scores of this model(out-of-domain predictions) and other multilingual late interaction retrieval models on [Tr-NanoBEIR](https://huggingface.co/datasets/99eren99/Tr-NanoBEIR).
48
+ <img src="https://huggingface.co/99eren99/TrColBERT-Long/resolve/main/assets/scores.png"
49
+ alt="drawing"/>
50
+
51
+ ## Usage
52
+ First install required libraries (Flash Attention 2 supporting GPU is a must for consistency otherwise you need to mask query expansion token in the output layer manually):
53
+
54
+ ```bash
55
+ pip install -U einops flash_attn
56
+ pip install -U pylate
57
+ ```
58
+
59
+ Then normalize your text ---> lambda x: x.replace("İ", "i").replace("I", "ı").lower()
60
+
61
+ ### Retrieval
62
+
63
+ PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
64
+
65
+ #### Indexing documents
66
+
67
+ First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
68
+
69
+ ```python
70
+ from pylate import indexes, models, retrieve
71
+
72
+ # Step 1: Load the ColBERT model
73
+ document_length = 8192 #[1,8192] for truncating documents
74
+ model = models.ColBERT(
75
+ model_name_or_path="99eren99/TrColbert-Long",document_length=document_length
76
+ )
77
+ try:
78
+ model.tokenizer.model_input_names.remove("token_type_ids")
79
+ except:
80
+ pass
81
+ model.eval()
82
+ model.to("cuda")
83
+
84
+ # Step 2: Initialize the Voyager index
85
+ index = indexes.Voyager(
86
+ index_folder="pylate-index",
87
+ index_name="index",
88
+ override=True, # This overwrites the existing index if any
89
+ )
90
+
91
+ # Step 3: Encode the documents
92
+ documents_ids = ["1", "2", "3"]
93
+ documents = ["document 1 text", "document 2 text", "document 3 text"]
94
+
95
+ documents_embeddings = model.encode(
96
+ documents,
97
+ batch_size=32,
98
+ is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
99
+ show_progress_bar=True,
100
+ )
101
+
102
+ # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
103
+ index.add_documents(
104
+ documents_ids=documents_ids,
105
+ documents_embeddings=documents_embeddings,
106
+ )
107
+ ```
108
+
109
+ Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
110
+
111
+ ```python
112
+ # To load an index, simply instantiate it with the correct folder/name and without overriding it
113
+ index = indexes.Voyager(
114
+ index_folder="pylate-index",
115
+ index_name="index",
116
+ )
117
+ ```
118
+
119
+ #### Retrieving top-k documents for queries
120
+
121
+ Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
122
+ To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
123
+
124
+ ```python
125
+ # Step 1: Initialize the ColBERT retriever
126
+ retriever = retrieve.ColBERT(index=index)
127
+
128
+ # Step 2: Encode the queries
129
+ queries_embeddings = model.encode(
130
+ ["query for document 3", "query for document 1"],
131
+ batch_size=32,
132
+ is_query=True, # Ensure that it is set to True to indicate that these are queries
133
+ show_progress_bar=True,
134
+ )
135
+
136
+ # Step 3: Retrieve top-k documents
137
+ scores = retriever.retrieve(
138
+ queries_embeddings=queries_embeddings,
139
+ k=10, # Retrieve the top 10 matches for each query
140
+ )
141
+ ```
142
+
143
+ ### Reranking
144
+ If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
145
+
146
+ ```python
147
+ from pylate import rank, models
148
+
149
+ queries = [
150
+ "query A",
151
+ "query B",
152
+ ]
153
+
154
+ documents = [
155
+ ["document A", "document B"],
156
+ ["document 1", "document C", "document B"],
157
+ ]
158
+
159
+ documents_ids = [
160
+ [1, 2],
161
+ [1, 3, 2],
162
+ ]
163
+
164
+ model = models.ColBERT(
165
+ model_name_or_path=pylate_model_id,
166
+ )
167
+
168
+ queries_embeddings = model.encode(
169
+ queries,
170
+ is_query=True,
171
+ )
172
+
173
+ documents_embeddings = model.encode(
174
+ documents,
175
+ is_query=False,
176
+ )
177
+
178
+ reranked_documents = rank.rerank(
179
+ documents_ids=documents_ids,
180
+ queries_embeddings=queries_embeddings,
181
+ documents_embeddings=documents_embeddings,
182
+ )
183
+ ```
184
+
185
+ <!--
186
+ ### Direct Usage (Transformers)
187
+
188
+ <details><summary>Click to see the direct usage in Transformers</summary>
189
+
190
+ </details>
191
+ -->
192
+
193
+ <!--
194
+ ### Downstream Usage (Sentence Transformers)
195
+
196
+ You can finetune this model on your own dataset.
197
+
198
+ <details><summary>Click to expand</summary>
199
+
200
+ </details>
201
+ -->
202
+
203
+ <!--
204
+ ### Out-of-Scope Use
205
+
206
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
207
+ -->
208
+
209
+ <!--
210
+ ## Bias, Risks and Limitations
211
+
212
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
213
+ -->
214
+
215
+ <!--
216
+ ### Recommendations
217
+
218
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
219
+ -->
220
+
221
+
222
+ ### Framework Versions
223
+ - Python: 3.10.16
224
+ - Sentence Transformers: 4.0.2
225
+ - PyLate: 1.1.7
226
+ - Transformers: 4.48.2
227
+ - PyTorch: 2.5.1+cu124
228
+ - Accelerate: 1.2.1
229
+ - Datasets: 2.21.0
230
+ - Tokenizers: 0.21.0
231
+
232
+
233
+ ## Citation
234
+
235
+ ### BibTeX
236
+
237
+ #### Sentence Transformers
238
+ ```bibtex
239
+ @inproceedings{reimers-2019-sentence-bert,
240
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
241
+ author = "Reimers, Nils and Gurevych, Iryna",
242
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
243
+ month = "11",
244
+ year = "2019",
245
+ publisher = "Association for Computational Linguistics",
246
+ url = "https://arxiv.org/abs/1908.10084"
247
+ }
248
+ ```
249
+
250
+ #### PyLate
251
+ ```bibtex
252
+ @misc{PyLate,
253
+ title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
254
+ author={Chaffin, Antoine and Sourty, Raphaël},
255
+ url={https://github.com/lightonai/pylate},
256
+ year={2024}
257
+ }
258
+ ```
259
+
260
+ <!--
261
+ ## Glossary
262
+
263
+ *Clearly define terms in order to be accessible across audiences.*
264
+ -->
265
+
266
+ <!--
267
+ ## Model Card Authors
268
+
269
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
270
+ -->
271
+
272
+ <!--
273
+ ## Model Card Contact
274
+
275
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
276
+ -->