Bilingual ELECTRA (Czech-Slovak)

Bilingual ELECTRA (Czech-Slovak) is an Electra-small model pretrained on a mixed Czech and Slovak corpus. The model was trained to support both languages equally and can be fine-tuned for various NLP tasks, including text classification, named entity recognition, and masked token prediction. The model is released under the CC BY 4.0 license, which allows commercial use.

Tokenization

The model uses a SentencePiece tokenizer and requires a SentencePiece model file (m.model) for proper tokenization. You can use either the HuggingFace AutoTokenizer (recommended) or SentencePiece directly.

Using HuggingFace AutoTokenizer (Recommended)

from transformers import AutoTokenizer, ElectraForPreTraining

# Load the tokenizer directly from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")

# Or load from local directory
# tokenizer = AutoTokenizer.from_pretrained("./CZSK")

# Load the pretrained model
model = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")

# Tokenize input text
sentence = "Toto je testovací věta v češtině a slovenčine."
inputs = tokenizer(sentence, return_tensors="pt")

# Run inference
outputs = model(**inputs)

Using SentencePiece directly

from transformers import ElectraForPreTraining
import sentencepiece as spm
import torch

# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("m.model")

# Load the pretrained model
discriminator = ElectraForPreTraining.from_pretrained("AILabTUL/BiELECTRA-czech-slovak")

# Tokenize input text (note: input should be lowercase)
sentence = "toto je testovací věta v češtině a slovenčine."
tokens = sp.encode(sentence, out_type=str)
token_ids = sp.encode(sentence)

# Convert to tensor
input_tensor = torch.tensor([token_ids])

# Run inference
outputs = discriminator(input_tensor)
predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()

Citation

This model was published as part of the research paper:

"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"

@InProceedings{polacek:2025:RANLPStud,
  author    = {Polacek, Martin},
  title     = {Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream},
  booktitle      = {Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing},
  month          = {September},
  year           = {2025},
  address        = {Varna, Bulgaria},
  publisher      = {INCOMA Ltd., Shoumen, Bulgaria},
  pages     = {37--43},
  abstract  = {In this study, we employ various ELECTRA-Small models that are pre-trained and fine-tuned on specific sets of languages for automatic punctuation restoration (APR) in automatically transcribed TV and radio shows, which contain conversations in two closely related languages. Our evaluation data specifically concerns bilingual interviews in Czech and Slovak and data containing speeches in Swedish and Norwegian. We train and evaluate three types of models: the multilingual (mELECTRA) model, which is pre-trained for 13 European languages; two bilingual models, each pre-trained for one language pair; and four monolingual models, each pre-trained for a single language. Our experimental results show that a) fine-tuning, which must be performed using data belonging to both target languages, is the key step in developing a bilingual APR system and b) the mELECTRA model yields competitive results, making it a viable option for bilingual APR and other multilingual applications. Thus, we publicly release our pre-trained bilingual and, in particular, multilingual ELECTRA-small models on HuggingFace, fostering further research in various multilingual tasks.},
  url       = {https://aclanthology.org/2025.ranlpstud-1.5}
}

Related Models

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support