SentenceTransformer based on thenlper/gte-large

This is a sentence-transformers model finetuned from thenlper/gte-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: thenlper/gte-large
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("JFernandoGRE/gtelarge-colombian-elitenames2")
# Run inference
sentences = [
    'JOSE ALBERTO SOTELO PAZ',
    'JOLBERTOSOTELO PAZ',
    'CESAR AUGUSTO ARANGUREN LONDONO',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 21,959 training samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 4 tokens
    • mean: 8.22 tokens
    • max: 17 tokens
    • min: 4 tokens
    • mean: 8.58 tokens
    • max: 16 tokens
    • 0: ~78.10%
    • 1: ~21.90%
  • Samples:
    sentence1 sentence2 label
    ABDON EDUARDO ESPINOSA GUTIERREZ LUISPEREZ GUTIERREZ 0
    JOSE GUSTAVO BARBOSA COBO JOSE GUSTAVO BARBOZA COBO 1
    LUZ MILA MORELLI SOCARRAS LUZMILA MORELLI SOCARRAS 1
  • Loss: OnlineContrastiveLoss

Evaluation Dataset

Unnamed Dataset

  • Size: 5,490 evaluation samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 4 tokens
    • mean: 8.17 tokens
    • max: 18 tokens
    • min: 4 tokens
    • mean: 8.63 tokens
    • max: 14 tokens
    • 0: ~78.10%
    • 1: ~21.90%
  • Samples:
    sentence1 sentence2 label
    GLADIS MARINA GIRALDO GOMEZ GLADIS GIRALDO GOMEZ 1
    CARLOS ANDRES PEREZ FERRER CARLOS ANDRES PEREZ GUEERERO 0
    ALEXANDER PINEDA BONILLA EN REORGANIZACION LUIS FELIPE MENDOSA BONILLA 0
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 1e-05
  • num_train_epochs: 5
  • warmup_ratio: 0.182
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.182
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss
0.0728 100 0.2313 0.2908
0.1457 200 0.2094 0.2922
0.2185 300 0.1951 0.3055
0.2913 400 0.1721 0.3221
0.3642 500 0.1275 0.2690
0.4370 600 0.1477 0.2556
0.5098 700 0.1415 0.2106
0.5827 800 0.1074 0.1935
0.6555 900 0.1195 0.2059
0.7283 1000 0.1259 0.1856
0.8012 1100 0.1129 0.1640
0.8740 1200 0.1094 0.1834
0.9468 1300 0.1119 0.1672
1.0197 1400 0.1247 0.1946
1.0925 1500 0.0735 0.1717
1.1653 1600 0.0836 0.1589
1.2382 1700 0.0871 0.1595
1.3110 1800 0.089 0.1609
1.3838 1900 0.0926 0.1723
1.4567 2000 0.086 0.1553
1.5295 2100 0.087 0.1591
1.6023 2200 0.0935 0.1617
1.6752 2300 0.0969 0.1510
1.7480 2400 0.1021 0.1436
1.8208 2500 0.0729 0.1431
1.8937 2600 0.0951 0.1398
1.9665 2700 0.0996 0.1357
2.0393 2800 0.0596 0.1454
2.1122 2900 0.0594 0.1365
2.1850 3000 0.0747 0.1325
2.2578 3100 0.0547 0.1378
2.3307 3200 0.0511 0.1326
2.4035 3300 0.0467 0.1307
2.4763 3400 0.0478 0.1327
2.5492 3500 0.0497 0.1281
2.6220 3600 0.0712 0.1264
2.6948 3700 0.0652 0.1375
2.7677 3800 0.0582 0.1308
2.8405 3900 0.0571 0.1322
2.9133 4000 0.0609 0.1282
2.9862 4100 0.0516 0.1156
3.0590 4200 0.0528 0.1229
3.1318 4300 0.0359 0.1125
3.2047 4400 0.0313 0.1206
3.2775 4500 0.0418 0.1225
3.3503 4600 0.0552 0.1218
3.4232 4700 0.0445 0.1244
3.4960 4800 0.048 0.1261
3.5688 4900 0.0425 0.1278
3.6417 5000 0.0365 0.1289
3.7145 5100 0.0587 0.1291
3.7873 5200 0.0536 0.1269
3.8602 5300 0.0384 0.1272
3.9330 5400 0.0448 0.1211
4.0058 5500 0.0466 0.1214
4.0787 5600 0.0329 0.1193
4.1515 5700 0.0306 0.1169
4.2243 5800 0.0463 0.1186
4.2972 5900 0.0322 0.1210
4.3700 6000 0.0298 0.1204
4.4428 6100 0.034 0.1192
4.5157 6200 0.0261 0.1182
4.5885 6300 0.033 0.1168
4.6613 6400 0.0394 0.1162
4.7342 6500 0.0342 0.1169
4.8070 6600 0.0295 0.1161
4.8798 6700 0.0272 0.1164
4.9527 6800 0.0333 0.1161

Framework Versions

  • Python: 3.11.12
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.7.0+cu126
  • Accelerate: 1.6.0
  • Datasets: 2.14.4
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JFernandoGRE/gtelarge-colombian-elitenames-righttail

Base model

thenlper/gte-large
Finetuned
(18)
this model