ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer

This is a Chem-MRL (sentence-transformers) model finetuned from Derify/ModChemBERT-IR-BASE on the pubchem_10m_genmol_similarity dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'ModChemBertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Chem-MRL)

First install the Chem-MRL library:

pip install -U chem-mrl>=0.7.3

Then you can load this model and run inference.

from chem_mrl import ChemMRL

# Download from the 🤗 Hub
model = ChemMRL(
    "Derify/ChemMRL",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "bfloat16"},
)
# Run inference
sentences = [
    'OCCCc1cc(F)cc(F)c1',
    'Fc1cc(F)cc(-n2cc[o+]n2)c1',
    'CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1',
]
embeddings = model.backbone.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.backbone.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.3876, 0.0078],
#         [0.3876, 1.0000, 0.0028],
#         [0.0078, 0.0028, 1.0000]])

Direct Usage (Sentence Transformers)

Click to see the direct usage in Transformers

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer(
    "Derify/ChemMRL",
    # SentenceTransformer doesn't support tanimoto similarity natively so we set a different similarity function here
    similarity_fn_name="cosine",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "bfloat16"},
)
# Run inference
sentences = [
    'OCCCc1cc(F)cc(F)c1',
    'Fc1cc(F)cc(-n2cc[o+]n2)c1',
    'CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5587, 0.0155],
#         [0.5587, 1.0000, 0.0055],
#         [0.0155, 0.0055, 1.0000]])

Evaluation

Metrics

Semantic Similarity

  • Dataset: pubchem_10m_genmol_similarity
  • Evaluated with chem_mrl.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator with these parameters:
    {
        "precision": "float32"
    }
    
Split Metric Value
validation spearman 0.98914
test spearman 0.98916

Training Details

Training Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at 9aec8fd
  • Size: 19,381,001 training samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 17 tokens
    • mean: 42.36 tokens
    • max: 122 tokens
    • min: 11 tokens
    • mean: 40.93 tokens
    • max: 122 tokens
    • min: 0.02
    • mean: 0.56
    • max: 1.0
  • Samples:
    smiles_a smiles_b label
    COc1ccc(NC(=O)C2CC[NH+](C(C)C(=O)Nc3ccc(C(=O)Nc4ccc(F)c(F)c4)cc3C)CC2)cc1NC(=O)C1CCCCC1 Cc1cc(C(=O)Nc2ccc(F)c(F)c2)ccc1NC(=O)C(C)[NH+]1CCC(C(=O)Nc2cccc(NC(=O)C3CCCCC3)c2)CC1 0.8495575189590454
    OCCN1CC[NH+](Cc2ccccc2OC2CC2)CC1 OCCN1CC[NH+](Cc2ccccc2On2cccn2)CC1 0.6615384817123413
    CC1CN(C(=O)C2CC[NH+](Cc3cccc(C(N)=O)c3)CC2)CC(C)O1 CC1CN(C(=O)C2CC[NH+](Cc3ccccc3)CC2)CC(C)O1 0.7123287916183472
  • Loss: Matryoshka2dLoss with these parameters:
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": -1,
        "last_layer_weight": 2.0,
        "prior_layers_weight": 1.0,
        "kl_div_weight": 0.0,
        "kl_temperature": 0.0,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at 9aec8fd
  • Size: 1,080,394 evaluation samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 16 tokens
    • mean: 42.05 tokens
    • max: 101 tokens
    • min: 11 tokens
    • mean: 40.23 tokens
    • max: 104 tokens
    • min: 0.0
    • mean: 0.57
    • max: 1.0
  • Samples:
    smiles_a smiles_b label
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)C1CCCC1 0.8600000143051147
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)C1CCOCC1 0.7962962985038757
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)CC(F)F 0.5517241358757019
  • Loss: Matryoshka2dLoss with these parameters:
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": -1,
        "last_layer_weight": 2.0,
        "prior_layers_weight": 1.0,
        "kl_div_weight": 0.0,
        "kl_temperature": 0.0,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 192
  • per_device_eval_batch_size: 512
  • learning_rate: 8e-06
  • weight_decay: 1e-05
  • max_grad_norm: None
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_kwargs: {'num_decay_steps': 100943, 'warmup_type': 'linear', 'decay_type': '1-sqrt'}
  • warmup_steps: 100943
  • data_seed: 42
  • bf16: True
  • bf16_full_eval: True
  • tf32: True
  • optim: stable_adamw
  • optim_args: decouple_lr=True,max_lr=8.0e-6
  • gradient_checkpointing: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 192
  • per_device_eval_batch_size: 512
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 8e-06
  • weight_decay: 1e-05
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: None
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_kwargs: {'num_decay_steps': 100943, 'warmup_type': 'linear', 'decay_type': '1-sqrt'}
  • warmup_ratio: 0.0
  • warmup_steps: 100943
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: 42
  • jit_mode_eval: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: True
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: stable_adamw
  • optim_args: decouple_lr=True,max_lr=8.0e-6
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: True
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss pubchem 10m genmol similarity loss pubchem_10m_genmol_similarity_spearman
0 0 - 297.6136 0.7261
0.0000 1 244.6862 - -
0.2477 25000 161.5037 - -
0.2500 25235 - 195.4624 0.9067
0.4978 50250 155.7822 - -
0.5000 50470 - 189.4068 0.9655
0.7479 75500 152.7915 - -
0.7500 75705 - 186.3661 0.9780
0.9981 100750 151.0411 - -
1.0000 100940 - 184.6362 0.9829
1.2482 126000 149.8544 - -
1.2500 126175 - 183.5648 0.9855
1.4984 151250 149.2916 - -
1.5000 151410 - 182.8947 0.9868
1.7485 176500 148.7942 - -
1.7499 176645 - 182.3662 0.9879
1.9987 201750 148.3459 - -
1.9999 201880 - 181.9855 0.9885
2.2488 227000 148.0316 - -
2.2499 227115 - 181.7683 0.9889
2.4989 252250 147.8658 - -
2.4999 252350 - 181.6711 0.9890
2.7491 277500 147.9642 - -
2.7499 277585 - 181.6077 0.9891
2.9992 302750 147.8874 - -
2.9999 302820 - 181.6066 0.9891
3.0000 302829 - - 0.98914

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 30.936 kWh
  • Carbon Emitted: 6.350 kg of CO2
  • Hours Used: 116.388 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: AMD Ryzen 7 3700X 8-Core Processor
  • RAM Size: 62.70 GB

Framework Versions

  • Python: 3.13.7
  • Sentence Transformers: 5.1.2
  • Transformers: 4.57.1
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.10.1
  • Datasets: 4.3.0
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Matryoshka2dLoss

@misc{li20242d,
    title={2D Matryoshka Sentence Embeddings},
    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
    year={2024},
    eprint={2402.14776},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}

TanimotoSentLoss

@online{cortes-2025-tanimotosentloss,
    title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
    author={Emmanuel Cortes},
    year={2025},
    month={Jan},
    url={https://github.com/emapco/chem-mrl},
}

Model Card Authors

@eacortes

Model Card Contact

Manny Cortes ([email protected])

Downloads last month
4,349
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Derify/ChemMRL

Finetuned
(1)
this model

Dataset used to train Derify/ChemMRL

Collection including Derify/ChemMRL

Evaluation results

  • Spearman on pubchem 10m genmol similarity (validation)
    self-reported
    0.989
  • Spearman on pubchem 10m genmol similarity (test)
    self-reported
    0.989