Update README.md
Browse files
README.md
CHANGED
|
@@ -4580,28 +4580,29 @@ model-index:
|
|
| 4580 |
---
|
| 4581 |
<h1 align="center">GIST Embedding v0</h1>
|
| 4582 |
|
| 4583 |
-
*
|
| 4584 |
|
| 4585 |
The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
| 4586 |
|
| 4587 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
| 4588 |
|
| 4589 |
-
Technical
|
|
|
|
| 4590 |
|
| 4591 |
# Data
|
| 4592 |
|
| 4593 |
-
The dataset used is a compilation of the MEDI
|
| 4594 |
|
| 4595 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
| 4596 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
| 4597 |
|
| 4598 |
-
The dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
| 4599 |
|
| 4600 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
| 4601 |
|
| 4602 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
| 4603 |
|
| 4604 |
-
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation.
|
| 4605 |
|
| 4606 |
# Usage
|
| 4607 |
|
|
@@ -4611,7 +4612,7 @@ The model can be easily loaded using the Sentence Transformers library.
|
|
| 4611 |
import torch.nn.functional as F
|
| 4612 |
from sentence_transformers import SentenceTransformer
|
| 4613 |
|
| 4614 |
-
revision = None # Replace with the specific revision to ensure reproducibility
|
| 4615 |
|
| 4616 |
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
|
| 4617 |
|
|
@@ -4643,13 +4644,29 @@ Checkpoint step = 103500
|
|
| 4643 |
Contrastive loss temperature = 0.01
|
| 4644 |
```
|
| 4645 |
|
| 4646 |
-
Specific training details and strategies will be published shortly.
|
| 4647 |
|
| 4648 |
# Evaluation
|
| 4649 |
|
| 4650 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
| 4651 |
|
| 4652 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4653 |
# Acknowledgements
|
| 4654 |
|
| 4655 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|
|
|
|
| 4580 |
---
|
| 4581 |
<h1 align="center">GIST Embedding v0</h1>
|
| 4582 |
|
| 4583 |
+
*GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning*
|
| 4584 |
|
| 4585 |
The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
| 4586 |
|
| 4587 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
| 4588 |
|
| 4589 |
+
Technical paper: [GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning](https://arxiv.org/abs/2402.16829)
|
| 4590 |
+
|
| 4591 |
|
| 4592 |
# Data
|
| 4593 |
|
| 4594 |
+
The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
|
| 4595 |
|
| 4596 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
| 4597 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
| 4598 |
|
| 4599 |
+
The dataset contains a `task_type` key, which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
| 4600 |
|
| 4601 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
| 4602 |
|
| 4603 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
| 4604 |
|
| 4605 |
+
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.
|
| 4606 |
|
| 4607 |
# Usage
|
| 4608 |
|
|
|
|
| 4612 |
import torch.nn.functional as F
|
| 4613 |
from sentence_transformers import SentenceTransformer
|
| 4614 |
|
| 4615 |
+
revision = None # Replace with the specific revision to ensure reproducibility if the model is updated.
|
| 4616 |
|
| 4617 |
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
|
| 4618 |
|
|
|
|
| 4644 |
Contrastive loss temperature = 0.01
|
| 4645 |
```
|
| 4646 |
|
|
|
|
| 4647 |
|
| 4648 |
# Evaluation
|
| 4649 |
|
| 4650 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
| 4651 |
|
| 4652 |
|
| 4653 |
+
# Citation
|
| 4654 |
+
|
| 4655 |
+
Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗
|
| 4656 |
+
|
| 4657 |
+
```
|
| 4658 |
+
@article{solatorio2024gistembed,
|
| 4659 |
+
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
|
| 4660 |
+
author={Aivin V. Solatorio},
|
| 4661 |
+
journal={arXiv preprint arXiv:2402.16829},
|
| 4662 |
+
year={2024},
|
| 4663 |
+
URL={https://arxiv.org/abs/2402.16829}
|
| 4664 |
+
eprint={2402.16829},
|
| 4665 |
+
archivePrefix={arXiv},
|
| 4666 |
+
primaryClass={cs.LG}
|
| 4667 |
+
}
|
| 4668 |
+
```
|
| 4669 |
+
|
| 4670 |
# Acknowledgements
|
| 4671 |
|
| 4672 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|