EstLLM Prototype 0825 Instruct

llama-estllm-prototype-0825 is the first artifact produced by the EstLLM project. The intention of this release is to evaluate the first prototype in a conversational ChatbotArena-style setting on baromeeter.ai, and thus establish a baseline for future improvements.

The model underwent continuous pre-training starting from meta-llama/Llama-3.1-8B on approximately 35B tokens, which resulted in tartuNLP/Llama-3.1-EstLLM-8B-0525, then supervised fine-tuning and direct preference optimization were applied.

Use with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "tartuNLP/llama-estllm-prototype-0825"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# to use on apple silicon, load the following way
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     dtype=torch.float16,
#     device_map="mps",
# )

tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Kas sa räägid eesti keelt?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.4,
    # specify eos token to stop at the end of the assistant response
    eos_token_id=tokenizer.eos_token_id,
)

# generated_ids include the input tokens as well, so we only decode new tokens
response = tokenizer.decode(
    generated_ids[0][model_inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)

print(response)

Model Details

Model Description

Developed by: TartuNLP and TalTechNLP research groups
Funded by: Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
Model type: Causal Language Model, Instruction-following
Language(s) (NLP): Estonian, English
License: Llama 3.1 Community License Agreement
Finetuned from model: tartuNLP/Llama-3.1-EstLLM-8B-0525

Continued Pre-Training

Continued Pre-Training was performed for a single epoch on:

Estonian National Corpus (8.6B tokens)
Python-Edu (3.3B tokens)
FineMath4-Plus (9.5B tokens)
General Instruction-Augmented Corpora (7.4B tokens)
Cosmopedia v2 (6.9B tokens)

Supervised Fine-Tuning

Approximately 764k examples were used for Supervised Fine-Tuning. The examples mainly come from the Tulu 3 SFT mixture and EuroBlocks. Additional data provided by the Institute of Estonian Language (EKI) was also used. In total about 80% of examples are in English. More details TBA.

Direct Preference Optimization

English-only HelpSteer3 was used as is in the Direct Preference Optimization step, as previous research on Poro 2 models showed that there's no observable benefit from translating preference pairs.

Evaluation

Logits-based

Scores for logits-based evaluation benchmarks are available on the EuroEval leaderboard.

Generative

Every benchmark in this category is treated as a generative problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with bold. Second best scores are highlighted with italic bold. Rows are sorted in descending order based on the number of parameters of models (not scores). The test set is used for evaluation of each dataset unless noted otherwise.

Note that all models are evaluated with the same prompt template for comparability, meaning that the scores do not necessarily represent each model's best possible performance. This is especially the case for deepseek-ai/DeepSeek-V3-0324 on some of the benchmarks.

Only models of comparable size are evaluated on benchmarks in English.

Instruction-following

Estonian

Instruction level strict accuracy is reported for IFEval-et.

Model (# parameters ↓)	IFEval-et
moonshotai/Kimi-K2-Instruct	0.7891
deepseek-ai/DeepSeek-V3.2	0.7221
deepseek-ai/DeepSeek-V3-0324	0.7171
mistralai/Mistral-Large-3-675B-Instruct-2512	0.7097
meta-llama/Llama-3.1-405B-Instruct	0.7159
meta-llama/Llama-3.3-70B-Instruct	*0.7705*
Qwen/Qwen2.5-72B-Instruct	0.7407
google/gemma-3-27b-it	0.7655
google/gemma-3-12b-it	0.7556
utter-project/EuroLLM-9B-Instruct	0.5397
mistralai/Ministral-3-8B-Instruct-2512	0.4888
swiss-ai/Apertus-8B-Instruct-2509	0.5484
meta-llama/Llama-3.1-8B-Instruct	0.3797
tartuNLP/llama-estllm-prototype-0825	0.5174
BSC-LT/salamandra-7b-instruct	0.5195
tartuNLP/Llammas	0.3524
Qwen/Qwen2.5-7B-Instruct	0.4988

English

Instruction level strict accuracy is reported for IFEval-en.

Model (# parameters ↓)	IFEval-en
utter-project/EuroLLM-9B-Instruct	0.7004
mistralai/Ministral-3-8B-Instruct-2512	0.6845
swiss-ai/Apertus-8B-Instruct-2509	0.7808
meta-llama/Llama-3.1-8B-Instruct	0.8106
tartuNLP/llama-estllm-prototype-0825	0.7527
tartuNLP/Llammas	0.4373
BSC-LT/salamandra-7b-instruct	0.3289
Qwen/Qwen2.5-7B-Instruct	0.7954

Multiple Choice

All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.

Estonian Language Competence

Model (# parameters ↓)	Grammar-et	Inflection-et	Word-Meanings-et
moonshotai/Kimi-K2-Instruct	0.916	0.6458	0.9689
deepseek-ai/DeepSeek-V3.2	0.781	0.6891	0.8134
deepseek-ai/DeepSeek-V3-0324	0.364	0	0
mistralai/Mistral-Large-3-675B-Instruct-2512	0.796	0.8355	0.9488
meta-llama/Llama-3.1-405B-Instruct	*0.818*	0.9089	0.9438
meta-llama/Llama-3.3-70B-Instruct	0.797	0.6421	0.9408
Qwen/Qwen2.5-72B-Instruct	0.694	0.5208	0.9057
google/gemma-3-27b-it	0.817	0.5934	0.9529
google/gemma-3-12b-it	0.789	0.4227	0.9318
utter-project/EuroLLM-9B-Instruct	0.764	0.367	0.9258
mistralai/Ministral-3-8B-Instruct-2512	0.562	0.4833	0.8395
swiss-ai/Apertus-8B-Instruct-2509	0.512	0.3662	0.9027
meta-llama/Llama-3.1-8B-Instruct	0.657	0.4165	0.8335
tartuNLP/llama-estllm-prototype-0825	0.692	0.5188	*0.9569*
BSC-LT/salamandra-7b-instruct	0.594	0.2668	0.8084
Qwen/Qwen2.5-7B-Instruct	0.598	0.4136	0.7984
tartuNLP/Llammas	0.529	0.2289	0.5326

Knowledge and Reasoning (Estonian)

Model (# parameters ↓)	Winogrande-et	Trivia-et	Exam-et	GlobalPIQA-et	TruthfulQA-et
moonshotai/Kimi-K2-Instruct	0.8138	0.4225	0.8414	0.79	0.7136
deepseek-ai/DeepSeek-V3.2	0.4805	0.38	0.614	0.7	0.5863
deepseek-ai/DeepSeek-V3-0324	*0.8042*	0.27	0.1221	0.04	0.2093
mistralai/Mistral-Large-3-675B-Instruct-2512	0.7487	0.4275	0.7931	0.73	0.6854
meta-llama/Llama-3.1-405B-Instruct	0.7878	0.4713	0.8309	0.58	0.7001
meta-llama/Llama-3.3-70B-Instruct	0.7397	0.3875	0.7652	0.58	0.6255
Qwen/Qwen2.5-72B-Instruct	0.7227	0.315	0.7162	0.65	0.6683
google/gemma-3-27b-it	0.7510	0.325	0.7751	0.71	0.5814
google/gemma-3-12b-it	0.6712	0.3237	0.7069	0.54	0.3158
utter-project/EuroLLM-9B-Instruct	0.5846	0.3738	0.5589	0.55	0.2889
mistralai/Ministral-3-8B-Instruct-2512	0.5812	0.3125	0.5012	0.48	0.3525
swiss-ai/Apertus-8B-Instruct-2509	0.5105	0.345	0.552	0.59	0.366
meta-llama/Llama-3.1-8B-Instruct	0.5399	0.2888	0.5	0.54	0.437
tartuNLP/llama-estllm-prototype-0825	0.5812	0.425	0.5093	0.63	0.3525
BSC-LT/salamandra-7b-instruct	0.2878	0.2875	0.3556	0.55	0.3011
Qwen/Qwen2.5-7B-Instruct	0.5473	0.2938	0.4913	0.57	0.4113
tartuNLP/Llammas	0.5037	0.2838	0.3649	0.01	0.2032

Knowledge and Reasoning (English)

Model (# parameters ↓)	Winogrande	GlobalPIQA-en	TruthfulQA	MMLU-Redux	GSM8K
utter-project/EuroLLM-9B-Instruct	0.5059	0.58	0.2962	0.5741	0.5944
meta-llama/Llama-3.1-8B-Instruct	0.5625	0.76	0.5239	0.6959	0.7710
mistralai/Ministral-3-8B-Instruct-2512	0.6503	0.77	0.519	0.7418	0.3927
swiss-ai/Apertus-8B-Instruct-2509	0.5133	0.73	0.3831	0.6099	0.5936
tartuNLP/llama-estllm-prototype-0825	0.6084	0.71	0.366	0.6388	0.7202
tartuNLP/Llammas	0.498	0	0.1971	0.3417	0.1456
BSC-LT/salamandra-7b-instruct	0.4029	0.63	0.2717	0.5180	0.0076
Qwen/Qwen2.5-7B-Instruct	0.6627	0.83	0.5875	0.7555	0.7862

Translation

English to Estonian

Model	wmt24pp (BLEU ↑)
BSC-LT/salamandraTA-7b-instruct	0.2713
tartuNLP/llama-estllm-prototype-0825	0.264
utter-project/EuroLLM-9B-Instruct	0.2602
swiss-ai/Apertus-8B-Instruct-2509	0.2372
tartuNLP/Llammas	0.1472
meta-llama/Llama-3.1-8B-Instruct	0.1406
BSC-LT/salamandra-7b-instruct	0.1201
Qwen/Qwen2.5-7B-Instruct	0.0476

Limitations

This is an early prototype version. Accordingly, it has limitations in addition to the base Llama limitations:

Relatively short context of 4096 tokens. It's not expected to perform well on context sizes beyond that.
Multi-turn conversations are not supported in this version.
Trained with the original Llama 3.1 system prompt that has a hard-coded date cut-off.

Citation

TBA

Downloads last month: 510

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for tartuNLP/llama-estllm-prototype-0825

Base model

meta-llama/Llama-3.1-8B

Finetuned

tartuNLP/Llama-3.1-EstLLM-8B-0525