dragonkue commited on
Commit
181ccb9
·
verified ·
1 Parent(s): fa785ab

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,3 +1,463 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ base_model: intfloat/multilingual-e5-small
8
+ pipeline_tag: sentence-similarity
9
+ library_name: sentence-transformers
10
+ license: apache-2.0
11
+ language:
12
+ - ko
13
+ - en
14
+ ---
15
+
16
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/IxcqY5qbGNuGpqDciIcOI.webp" width="600">
17
+
18
+ # SentenceTransformer based on intfloat/multilingual-e5-small
19
+
20
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) on datasets that include Korean query-passage pairs for improved performance on Korean retrieval tasks. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
21
+
22
+ This model is a lightweight Korean retriever, designed for ease of use and strong performance in practical retrieval tasks.
23
+ It is ideal for running demos or lightweight applications, offering a good balance between speed and accuracy.
24
+
25
+ For even higher retrieval performance, we recommend combining it with a reranker.
26
+ Suggested reranker models:
27
+
28
+ - dragonkue/bge-reranker-v2-m3-ko
29
+
30
+ - BAAI/bge-reranker-v2-m3
31
+
32
+ ## Model Details
33
+
34
+ ### Model Description
35
+ - **Model Type:** Sentence Transformer
36
+ - **Base model:** [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) <!-- at revision c007d7ef6fd86656326059b28395a7a03a7c5846 -->
37
+ - **Maximum Sequence Length:** 512 tokens
38
+ - **Output Dimensionality:** 384 dimensions
39
+ - **Similarity Function:** Cosine Similarity
40
+ - **Training Datasets:**
41
+
42
+ <!-- - **Language:** Unknown -->
43
+ <!-- - **License:** Unknown -->
44
+
45
+ ### Model Sources
46
+
47
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
48
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
49
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
50
+
51
+ ### Full Model Architecture
52
+
53
+ ```
54
+ SentenceTransformer(
55
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
56
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
57
+ (2): Normalize()
58
+ )
59
+ ```
60
+
61
+ ## Usage
62
+
63
+ **🪶 Lightweight Version Available**
64
+
65
+ We also introduce a lightweight variant of this model:
66
+ [`exp-models/dragonkue-KoEn-E5-Tiny`](https://huggingface.co/exp-models/dragonkue-KoEn-E5-Tiny),
67
+ which removes all tokens **except Korean and English** to reduce model size while maintaining performance.
68
+
69
+ The repository also includes a **GGUF-quantized version**, making it suitable for efficient local or on-device embedding model serving.
70
+
71
+ > 🔧 For practical deployment, we highly recommend using this **lightweight retriever** in combination with a **reranker** model — it forms a powerful and resource-efficient retrieval setup.
72
+
73
+
74
+ ### Direct Usage (Sentence Transformers)
75
+
76
+ First install the Sentence Transformers library:
77
+
78
+ ```bash
79
+ pip install -U sentence-transformers
80
+ ```
81
+
82
+ Then you can load this model and run inference.
83
+ ```python
84
+ from sentence_transformers import SentenceTransformer
85
+
86
+ # Download from the 🤗 Hub
87
+ model = SentenceTransformer("dragonkue/multilingual-e5-small-ko")
88
+ # Run inference
89
+ sentences = [
90
+ 'query: 북한가족법 몇 차 개정에서 이혼판결 확정 후 3개월 내에 등록시에만 유효하다는 조항을 확실히 했을까?',
91
+ 'passage: 1990년에 제정된 북한 가족법은 지금까지 4차례 개정되어 현재에 이르고 있다. 1993년에 이루어진 제1차 개정은 주로 규정의 정확성을 기하기 위하여 몇몇 조문을 수정한 것이며, 실체적인 내용을 보완한 것은 상속의 승인과 포기기간을 설정한 제52조 정도라고 할 수 있다. 2004년에 이루어진 제2차에 개정에서는 제20조제3항을 신설하여 재판상 확정된 이혼판결을 3개월 내에 등록해야 이혼의 효력이 발생한다는 것을 명확하게 하였다. 2007년에 이루어진 제3차 개정에서는 부모와 자녀 관계 또한 신분등록기관에 등록한 때부터 법적 효력이 발생한다는 것을 신설(제25조제2항)하였다. 또한 미성년자, 노동능력 없는 자의 부양과 관련(제37조제2항)하여 기존에는 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀, 조부모나 손자녀, 형제자매가 부양한다”고 규정하고 있었던 것을 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀가 부양하며 그들이 없을 경우에��� 조부모나 손자녀, 형제자매가 부양한다”로 개정하였다.',
92
+ 'passage: 환경마크 제도, 인증기준 변경으로 기업부담 줄인다\n환경마크 제도 소개\n□ 개요\n○ 동일 용도의 다른 제품에 비해 ‘제품의 환경성*’을 개선한 제품에 로고와 설명을 표시할 수 있도록하는 인증 제도\n※ 제품의 환경성 : 재료와 제품을 제조․소비 폐기하는 전과정에서 오염물질이나 온실가스 등을 배출하는 정도 및 자원과 에너지를 소비하는 정도 등 환경에 미치는 영향력의 정도(「환경기술 및 환경산업 지원법」제2조제5호)\n□ 법적근거\n○ 「환경기술 및 환경산업 지원법」제17조(환경표지의 인증)\n□ 관련 국제표준\n○ ISO 14024(제1유형 환경라벨링)\n□ 적용대상\n○ 사무기기, 가전제품, 생활용품, 건축자재 등 156개 대상제품군\n□ 인증현황\n○ 2,737개 기업의 16,647개 제품(2015.12월말 기준)',
93
+ ]
94
+ embeddings = model.encode(sentences)
95
+ print(embeddings.shape)
96
+ # [3, 384]
97
+
98
+ # Get the similarity scores for the embeddings
99
+ similarities = model.similarity(embeddings, embeddings)
100
+ print(similarities.shape)
101
+ # [3, 3]
102
+ ```
103
+
104
+ ### Direct Usage (Transformers)
105
+
106
+ ```python
107
+ import torch.nn.functional as F
108
+
109
+ from torch import Tensor
110
+ from transformers import AutoTokenizer, AutoModel
111
+
112
+
113
+ def average_pool(last_hidden_states: Tensor,
114
+ attention_mask: Tensor) -> Tensor:
115
+ last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
116
+ return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
117
+
118
+
119
+ # Each input text should start with "query: " or "passage: ", even for non-English texts.
120
+ # For tasks other than retrieval, you can simply use the "query: " prefix.
121
+ input_texts = ["query: 북한가족법 몇 차 개정에서 이혼판결 확정 후 3개월 내에 등록시에만 유효하다는 조항을 확실히 했을까?",
122
+ "passage: 1990년에 제정된 북한 가족법은 지금까지 4차례 개정되어 현재에 이르고 있다. 1993년에 이루어진 제1차 개정은 주로 규정의 정확성을 기하기 위하여 몇몇 조문을 수정한 것이며, 실체적인 내용을 보완한 것은 상속의 승인과 포기기간을 설정한 제52조 정도라고 할 수 있다. 2004년에 이루어진 제2차에 개정에서는 제20조제3항을 신설하여 재판상 확정된 이혼판결을 3개월 내에 등록해야 이혼의 효력이 발생한다는 것을 명확하게 하였다. 2007년에 이루어진 제3차 개정에서는 부모와 자녀 관계 또한 신분등록기관에 등록한 때부터 법적 효력이 발생한다는 것을 신설(제25조제2항)하였다. 또한 미성년자, 노동능력 없는 자의 부양과 관련(제37조제2항)하여 기존에는 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀, 조부모나 손자녀, 형제자매가 부양한다”고 규정하고 있었던 것을 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀가 부양하며 그들이 없을 경우에는 조부모나 손자녀, 형제자매가 부양한다”로 개정하였다.",
123
+ "passage: 환경마크 제도, 인증기준 변경으로 기업부담 줄인다\n환경마크 제도 소개\n□ 개요\n○ 동일 용도의 다른 제품에 비해 ‘제품의 환경성*’을 개선한 제품에 로고와 설명을 표시할 수 있도록하는 인증 제도\n※ 제품의 환경성 : 재료와 제품을 제조․소비 폐기하는 전과정에서 오염물질이나 온실가스 등을 배출하는 정도 및 자원과 에너지를 소비하는 정도 등 환경에 미치는 영향력의 정도(「환경기술 및 환경산업 지원법」제2조제5호)\n□ 법적근거\n○ 「환경기술 및 환경산업 지원법」제17조(환경표지의 인증)\n□ 관련 국제표준\n○ ISO 14024(제1유형 환경라벨링)\n□ 적용대상\n○ 사무기기, 가전제품, 생활용품, 건축자재 등 156개 대상제품군\n□ 인증현황\n○ 2,737개 기업의 16,647개 제품(2015.12월말 기준)"]
124
+
125
+ tokenizer = AutoTokenizer.from_pretrained('dragonkue/multilingual-e5-small-ko')
126
+ model = AutoModel.from_pretrained('dragonkue/multilingual-e5-small-ko')
127
+
128
+ # Tokenize the input texts
129
+ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
130
+
131
+ outputs = model(**batch_dict)
132
+ embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
133
+
134
+ # normalize embeddings
135
+ embeddings = F.normalize(embeddings, p=2, dim=1)
136
+ scores = (embeddings[:1] @ embeddings[1:].T)
137
+ print(scores.tolist())
138
+ ```
139
+
140
+
141
+ <!--
142
+ ### Downstream Usage (Sentence Transformers)
143
+
144
+ You can finetune this model on your own dataset.
145
+
146
+ <details><summary>Click to expand</summary>
147
+
148
+ </details>
149
+ -->
150
+
151
+ <!--
152
+ ### Out-of-Scope Use
153
+
154
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
155
+ -->
156
+
157
+ ## Evaluation
158
+
159
+ - This evaluation references the KURE GitHub repository. (https://github.com/nlpai-lab/KURE)
160
+ - We conducted an evaluation on all **Korean Retrieval Benchmarks** registered in [MTEB](https://github.com/embeddings-benchmark/mteb).
161
+
162
+ ### Korean Retrieval Benchmark
163
+ - [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): A Korean **ODQA multi-hop retrieval dataset**, translated from StrategyQA.
164
+ - [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): A **Korean document retrieval dataset** constructed by parsing PDFs from five domains: **finance, public, medical, legal, and commerce**.
165
+ - [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl): A **Korean document retrieval dataset** based on Wikipedia.
166
+ - [PublicHealthQA](https://huggingface.co/datasets/xhluca/publichealth-qa): A **retrieval dataset** focused on **medical and public health domains** in Korean.
167
+ - [BelebeleRetrieval](https://huggingface.co/datasets/facebook/belebele): A **Korean document retrieval dataset** based on FLORES-200.
168
+ - [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): A **Wikipedia-based Korean document retrieval dataset**.
169
+ - [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): A **cross-domain Korean document retrieval dataset**.
170
+
171
+ ### Metrics
172
+
173
+ * Standard metric : NDCG@10
174
+
175
+ #### Information Retrieval
176
+ | Model | Size(M) | Average | XPQARetrieval | PublicHealthQA | MIRACLRetrieval | Ko-StrategyQA | BelebeleRetrieval | AutoRAGRetrieval | MrTidyRetrieval |
177
+ |:------------------------------------------------------------|----------:|----------:|----------------:|-----------------:|------------------:|----------------:|--------------------:|-------------------:|------------------:|
178
+ | BAAI/bge-m3 | 560 | 0.724169 | 0.36075 | 0.80412 | 0.70146 | 0.79405 | 0.93164 | 0.83008 | 0.64708 |
179
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 560 | 0.724104 | 0.43018 | 0.81679 | 0.66077 | 0.80455 | 0.9271 | 0.83863 | 0.59071 |
180
+ | intfloat/multilingual-e5-large | 560 | 0.721607 | 0.3571 | 0.82534 | 0.66486 | 0.80348 | 0.94499 | 0.81337 | 0.64211 |
181
+ | intfloat/multilingual-e5-base | 278 | 0.689429 | 0.3607 | 0.77203 | 0.6227 | 0.76355 | 0.92868 | 0.79752 | 0.58082 |
182
+ | **dragonkue/multilingual-e5-small-ko** | 118 | 0.688819 | 0.34871 | 0.79729 | 0.61113 | 0.76173 | 0.9297 | 0.86184 | 0.51133 |
183
+ | **exp-models/dragonkue-KoEn-E5-Tiny** | 37 | 0.687496 | 0.34735 | 0.7925 | 0.6143 | 0.75978 | 0.93018 | 0.86503 | 0.50333 |
184
+ | intfloat/multilingual-e5-small | 118 | 0.670906 | 0.33003 | 0.73668 | 0.61238 | 0.75157 | 0.90531 | 0.80068 | 0.55969 |
185
+ | ibm-granite/granite-embedding-278m-multilingual | 278 | 0.616466 | 0.23058 | 0.77668 | 0.59216 | 0.71762 | 0.83231 | 0.70226 | 0.46365 |
186
+ | ibm-granite/granite-embedding-107m-multilingual | 107 | 0.599759 | 0.23058 | 0.73209 | 0.58413 | 0.70531 | 0.82063 | 0.68243 | 0.44314 |
187
+ | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 118 | 0.409766 | 0.21345 | 0.67409 | 0.25676 | 0.45903 | 0.71491 | 0.42296 | 0.12716 |
188
+
189
+ #### Performance Comparison by Model Size (Based on Average NDCG@10)
190
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/Utunk7FbZsTDEVsOVUms1.png" width="1000"/>
191
+
192
+ <!--
193
+ ### Recommendations
194
+
195
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
196
+ -->
197
+
198
+ ## Training Details
199
+
200
+ ### Training Datasets
201
+ This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
202
+ The training objective was to improve retrieval performance specifically for Korean-language tasks.
203
+
204
+ ### Training Methods
205
+
206
+ Following the training approach used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, this model constructs in-batch negatives based on clustered passages. In addition, we introduce GISTEmbedLoss with a configurable margin.
207
+
208
+ **📈 Margin-based Training Results**
209
+ - Using the standard MNR (Multiple Negatives Ranking) loss alone resulted in decreased performance.
210
+
211
+ - The original GISTEmbedLoss (without margin) yielded modest improvements of around +0.8 NDCG@10.
212
+
213
+ - Applying a margin led to performance gains of up to +1.5 NDCG@10.
214
+
215
+ - This indicates that simply tuning the margin value can lead to up to 2x improvement, showing strong sensitivity and effectiveness of margin scaling.
216
+
217
+ This margin-based approach extends the idea proposed in the NV-Retriever paper, which originally filtered false negatives during hard negative sampling.
218
+ We adapt this to in-batch negatives, treating false negatives as dynamic samples guided by margin-based filtering.
219
+
220
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/IpDDTshuZ5noxPOdm6gVk.png" width="800"/>
221
+
222
+ The sentence-transformers library now supports GISTEmbedLoss with margin configuration, making it easy to integrate into any training pipeline.
223
+
224
+ You can install the latest version with:
225
+
226
+ ```bash
227
+ pip install -U sentence-transformers
228
+ ```
229
+
230
+
231
+ ### Training Hyperparameters
232
+ #### Non-Default Hyperparameters
233
+
234
+ - `eval_strategy`: steps
235
+ - `per_device_train_batch_size`: 20000
236
+ - `per_device_eval_batch_size`: 4096
237
+ - `learning_rate`: 0.00025
238
+ - `num_train_epochs`: 3
239
+ - `warmup_ratio`: 0.05
240
+ - `fp16`: True
241
+ - `dataloader_drop_last`: True
242
+ - `batch_sampler`: no_duplicates
243
+
244
+ #### All Hyperparameters
245
+ <details><summary>Click to expand</summary>
246
+
247
+ - `overwrite_output_dir`: False
248
+ - `do_predict`: False
249
+ - `eval_strategy`: steps
250
+ - `prediction_loss_only`: True
251
+ - `per_device_train_batch_size`: 20000
252
+ - `per_device_eval_batch_size`: 4096
253
+ - `per_gpu_train_batch_size`: None
254
+ - `per_gpu_eval_batch_size`: None
255
+ - `gradient_accumulation_steps`: 1
256
+ - `eval_accumulation_steps`: None
257
+ - `torch_empty_cache_steps`: None
258
+ - `learning_rate`: 0.00025
259
+ - `weight_decay`: 0.0
260
+ - `adam_beta1`: 0.9
261
+ - `adam_beta2`: 0.999
262
+ - `adam_epsilon`: 1e-08
263
+ - `max_grad_norm`: 1.0
264
+ - `num_train_epochs`: 2
265
+ - `max_steps`: -1
266
+ - `lr_scheduler_type`: linear
267
+ - `lr_scheduler_kwargs`: {}
268
+ - `warmup_ratio`: 0.05
269
+ - `warmup_steps`: 0
270
+ - `log_level`: passive
271
+ - `log_level_replica`: warning
272
+ - `log_on_each_node`: True
273
+ - `logging_nan_inf_filter`: True
274
+ - `save_safetensors`: True
275
+ - `save_on_each_node`: False
276
+ - `save_only_model`: False
277
+ - `restore_callback_states_from_checkpoint`: False
278
+ - `no_cuda`: False
279
+ - `use_cpu`: False
280
+ - `use_mps_device`: False
281
+ - `seed`: 42
282
+ - `data_seed`: None
283
+ - `jit_mode_eval`: False
284
+ - `use_ipex`: False
285
+ - `bf16`: False
286
+ - `fp16`: True
287
+ - `fp16_opt_level`: O1
288
+ - `half_precision_backend`: auto
289
+ - `bf16_full_eval`: False
290
+ - `fp16_full_eval`: False
291
+ - `tf32`: None
292
+ - `local_rank`: 0
293
+ - `ddp_backend`: None
294
+ - `tpu_num_cores`: None
295
+ - `tpu_metrics_debug`: False
296
+ - `debug`: []
297
+ - `dataloader_drop_last`: True
298
+ - `dataloader_num_workers`: 0
299
+ - `dataloader_prefetch_factor`: None
300
+ - `past_index`: -1
301
+ - `disable_tqdm`: False
302
+ - `remove_unused_columns`: True
303
+ - `label_names`: None
304
+ - `load_best_model_at_end`: False
305
+ - `ignore_data_skip`: False
306
+ - `fsdp`: []
307
+ - `fsdp_min_num_params`: 0
308
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
309
+ - `tp_size`: 0
310
+ - `fsdp_transformer_layer_cls_to_wrap`: None
311
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
312
+ - `deepspeed`: None
313
+ - `label_smoothing_factor`: 0.0
314
+ - `optim`: adamw_torch
315
+ - `optim_args`: None
316
+ - `adafactor`: False
317
+ - `group_by_length`: False
318
+ - `length_column_name`: length
319
+ - `ddp_find_unused_parameters`: None
320
+ - `ddp_bucket_cap_mb`: None
321
+ - `ddp_broadcast_buffers`: False
322
+ - `dataloader_pin_memory`: True
323
+ - `dataloader_persistent_workers`: False
324
+ - `skip_memory_metrics`: True
325
+ - `use_legacy_prediction_loop`: False
326
+ - `push_to_hub`: False
327
+ - `resume_from_checkpoint`: None
328
+ - `hub_model_id`: None
329
+ - `hub_strategy`: every_save
330
+ - `hub_private_repo`: None
331
+ - `hub_always_push`: False
332
+ - `gradient_checkpointing`: False
333
+ - `gradient_checkpointing_kwargs`: None
334
+ - `include_inputs_for_metrics`: False
335
+ - `include_for_metrics`: []
336
+ - `eval_do_concat_batches`: True
337
+ - `fp16_backend`: auto
338
+ - `push_to_hub_model_id`: None
339
+ - `push_to_hub_organization`: None
340
+ - `mp_parameters`:
341
+ - `auto_find_batch_size`: False
342
+ - `full_determinism`: False
343
+ - `torchdynamo`: None
344
+ - `ray_scope`: last
345
+ - `ddp_timeout`: 1800
346
+ - `torch_compile`: False
347
+ - `torch_compile_backend`: None
348
+ - `torch_compile_mode`: None
349
+ - `include_tokens_per_second`: False
350
+ - `include_num_input_tokens_seen`: False
351
+ - `neftune_noise_alpha`: None
352
+ - `optim_target_modules`: None
353
+ - `batch_eval_metrics`: False
354
+ - `eval_on_start`: False
355
+ - `use_liger_kernel`: False
356
+ - `eval_use_gather_object`: False
357
+ - `average_tokens_across_devices`: False
358
+ - `prompts`: None
359
+ - `batch_sampler`: no_duplicates
360
+ - `multi_dataset_batch_sampler`: proportional
361
+
362
+ </details>
363
+
364
+
365
+ ### Framework Versions
366
+ - Python: 3.11.10
367
+ - Sentence Transformers: 4.1.0
368
+ - Transformers: 4.51.3
369
+ - PyTorch: 2.7.0+cu126
370
+ - Accelerate: 1.6.0
371
+ - Datasets: 3.5.1
372
+ - Tokenizers: 0.21.1
373
+
374
+ ## FAQ
375
+ **1. Do I need to add the prefix "query: " and "passage: " to input texts?**
376
+
377
+ Yes, this is how the model is trained, otherwise you will see a performance degradation.
378
+
379
+ Here are some rules of thumb:
380
+
381
+ Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
382
+
383
+ Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
384
+
385
+ Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
386
+
387
+ **2. Why does the cosine similarity scores distribute around 0.7 to 1.0?**
388
+
389
+ This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
390
+
391
+ For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.
392
+
393
+ ## Citation
394
+
395
+ ### BibTeX
396
+
397
+ #### Sentence Transformers
398
+ ```bibtex
399
+ @inproceedings{reimers-2019-sentence-bert,
400
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
401
+ author = "Reimers, Nils and Gurevych, Iryna",
402
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
403
+ month = "11",
404
+ year = "2019",
405
+ publisher = "Association for Computational Linguistics",
406
+ url = "https://arxiv.org/abs/1908.10084",
407
+ }
408
+ ```
409
+
410
+ #### Base Model
411
+ ```bibtex
412
+ @article{wang2024multilingual,
413
+ title={Multilingual E5 Text Embeddings: A Technical Report},
414
+ author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
415
+ journal={arXiv preprint arXiv:2402.05672},
416
+ year={2024}
417
+ }
418
+ ```
419
+ #### NV-Retriever: Improving text embedding models with effective hard-negative mining
420
+ ```bibtex
421
+ @article{moreira2024nvretriever,
422
+ title = {NV-Retriever: Improving text embedding models with effective hard-negative mining},
423
+ author = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even},
424
+ journal = {arXiv preprint arXiv:2407.15831},
425
+ year = {2024},
426
+ url = {https://arxiv.org/abs/2407.15831},
427
+ doi = {10.48550/arXiv.2407.15831}
428
+ }
429
+ ```
430
+
431
+ #### KURE
432
+ ```bibtex
433
+ @misc{KURE,
434
+ publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
435
+ year = {2024},
436
+ url = {https://github.com/nlpai-lab/KURE}
437
+ }
438
+ ```
439
+
440
+ ## Limitations
441
+
442
+ Long texts will be truncated to at most 512 tokens.
443
+
444
+ ## Acknowledgements
445
+ Special thanks to lemon-mint for their valuable contribution in optimizing and compressing this model.
446
+
447
+ <!--
448
+ ## Glossary
449
+
450
+ *Clearly define terms in order to be accessible across audiences.*
451
+ -->
452
+
453
+ <!--
454
+ ## Model Card Authors
455
+
456
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
457
+ -->
458
+
459
+ <!--
460
+ ## Model Card Contact
461
+
462
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
463
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 384,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 1536,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "tokenizer_class": "XLMRobertaTokenizer",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.52.4",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 250037
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.52.4",
5
+ "pytorch": "2.8.0.dev20250319+cu128"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9794f247caf80caf54b5a04d391c932ba6260cf20bd16fba3892ce2d8f5784ec
3
+ size 470637416
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd98e5698b201ba914efb8c18b6709fa8735ab71dcad8d2b431e52e8bf68d932
3
+ size 17082800
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 512,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "sp_model_kwargs": {},
54
+ "tokenizer_class": "XLMRobertaTokenizerFast",
55
+ "unk_token": "<unk>"
56
+ }