CliSciBERT πŸŒΏπŸ“š

CliSciBERT is a domain-adapted version of SciBERT, further pretrained on a curated corpus of peer-reviewed research papers in the climate change domain. It is designed to enhance performance on climate-focused scientific NLP tasks by adapting the general scientific knowledge of SciBERT to the specialized subdomain of climate research.

πŸ” Overview

  • Base Model: SciBERT (BERT-base architecture, scientific vocab)
  • Pretraining Method: Continued pretraining (domain adaptation) using Masked Language Modeling (MLM)
  • Corpus: Scientific papers focused on climate change and environmental science
  • Tokenizer: SciBERT tokenizer (unchanged)
  • Language: English
  • Domain: Climate change research

πŸ“Š Performance

Evaluated on ClimaBench, a benchmark for climate-focused NLP tasks:

Metric Value
Macro F1 (avg) 60.50
Tasks won 0/7
Avg. Std Dev 0.01772

Note: While CliSciBERT builds on SciBERT’s scientific grounding, its domain specialization improves relevance for climate-related NLP tasks.

Climate performance model card:

CliSciBERT
1. Model publicly available? Yes
2. Time to train final model 463h
3. Time for all experiments 1,226h ~ 51 days
4. Power of GPU and CPU 0.250 kW + 0.013 kW
5. Location for computations Croatia
6. Energy mix at location 224.71 gCO2eq/kWh
7. CO$_2$eq for final model 28 kg CO2
8. CO$_2$eq for all experiments 74 kg CO2

πŸ§ͺ Intended Uses

Use for:

  • Scientific text classification and relation extraction in climate change literature
  • Domain-specific document tagging or summarization
  • Supporting knowledge graph population for climate research

Not recommended for:

  • Non-climate or general news content
  • Non-English corpora
  • Highly informal or colloquial text

Example:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']} β€” {p['score']:.4f}")

Output:

The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth. β€” 0.3911
the increase in greenhouse gas ... affected the radiative balance of the earth. β€” 0.2640
the increase in greenhouse gas ... affected the radiation balance of the earth. β€” 0.1233
the increase in greenhouse gas ... affected the carbon balance of the earth. β€” 0.0589
the increase in greenhouse gas ... affected the ecological balance of the earth. β€” 0.0332

⚠️ Limitations

  • Retains SciBERT’s limitations outside the scientific domain
  • May inherit biases from climate change literature
  • No tokenizer retraining β€” tokenization optimized for general science, not climate-specific vocabulary

🧾 Citation

If you use this model, please cite:

@Article{Poleksić2025,
  author={Poleksi{\'{c}}, Andrija
  and Martin{\v{c}}i{\'{c}}-Ip{\v{s}}i{\'{c}}, Sanda},
  title={Pretraining and evaluation of BERT models for climate research},
  journal={Discover Applied Sciences},
  year={2025},
  month={Oct},
  day={24},
  volume={7},
  number={11},
  pages={1278},
  abstract={Motivated by the pressing issue of climate change and the growing volume of data, we pretrain three new language models using climate change research papers published in top-tier journals. Adaptation of existing domain-specific models based on Bidirectional Encoder Representations from Transformers (BERT) architecture is utilized for CliSciBERT (domain adaptation of SciBERT) and SciClimateBERT (domain adaptation of ClimateBERT) and pretraining from scratch resulted in CliReBERT (Climate Research BERT). The performance assessment is performed on the climate change NLP text classification benchmark ClimaBench. We evaluate SciBERT, ClimateBERT, BERT, RoBERTa and DistilRoBERTa - along with our new models - CliReBERT, CliSciBERT and SciClimateBERT - using five different random seeds on all seven ClimaBench datasets. CliReBERT achieves the highest overall performance with a macro-averaged F1 score of 65.45{\%}, and performs better than other models on three out of seven tasks. Additionally, CliReBERT demonstrates the most stable fine-tuning behavior, yielding the lowest average standard deviation across seeds (0.0118). The 5-fold stratified cross-validation on the SciDCC dataset showed that CliReBERT achieved the highest overall macro-average F1 score (53.75{\%}), performing slightly better than RoBERTa and DistilRoBERTa, while the domain-adapted models underperformed their base counterparts. The results show the usefulness of the new pretrained models for text classification in the climate change domain and underline the positive influence of domain-specific vocabulary.},
  issn={3004-9261},
  doi={10.1007/s42452-025-07740-5},
  url={https://doi.org/10.1007/s42452-025-07740-5}
}

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including P0L3/cliscibert_scivocab_uncased

Evaluation results