soyuj
/

llama2-doc2query

information retrieval

document expansion

Model card Files Files and versions

soyuj commited on May 30, 2024

Commit

d104f07

·

verified ·

1 Parent(s): b73d7a4

Update README.md

Files changed (1) hide show

README.md +46 -3

README.md CHANGED Viewed

@@ -1,3 +1,46 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+tags:
+- information retrieval
+- llama2
+- document expansion
+- LoRA
+---
+This repository contains the LoRA weights for fine-tuning pre-trained Llama 2 7B for document expansion for use with [DeeperImpact](https://arxiv.org/abs/2405.17093).
+We use the same dataset as DocT5Query for fine-tuning the pre-trained Llama 2 model i.e. 532k document-query pairs from MSMARCO Passage Qrels Train Dataset.
+Please refer to the following GitHub repository to learn how to use it for document expansion: [inference_deeper_impact.ipynb](https://github.com/basnetsoyuj/improving-learned-index/blob/master/inference_deeper_impact.ipynb)
+You can also clone the [DeeperImpact repo](https://github.com/basnetsoyuj/improving-learned-index/blob/master) and run expansions on a collection of documents using the following command:
+```
+python -m src.llama2.generate \
+    --llama_path <path | HuggingFaceHub link> \
+    --collection_path <path> \
+    --collection_type [msmarco | beir] \
+    --output_path <path> \
+    --batch_size <batch_size> \
+    --max_tokens 512 \
+    --num_return_sequences 80 \
+    --max_new_tokens 50 \
+    --top_k 50 \
+    --top_p 0.95 \
+    --peft_path soyuj/llama2-doc2query
+```
+This will generate a jsonl file with expansions for each document in the collection. To append the unique expansion terms to the original collection, use the following command:
+```
+python -m src.llama2.merge \
+  --collection_path <path> \
+  --collection_type [msmarco | beir] \
+  --queries_path <jsonl file generated above> \
+  --output_path <path>
+```