Model Card for `impresso-project/ner-stacked-bert-multilingual`

The Impresso NER model is a multilingual named entity recognition model trained for historical document processing. It is based on a stacked Transformer architecture and is designed to identify fine-grained and coarse-grained entity types in digitized historical texts, including names, titles, and locations.

Model Details

Model Description

Developed by: EPFL from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
Model type: Stacked BERT-based token classification for named entity recognition
Languages: French, German, English (with support for multilingual historical texts)
License: AGPL v3+
Finetuned from: dbmdz/bert-medium-historic-multilingual-cased

Model Architecture

The model architecture consists of the following components:

A pre-trained BERT encoder (multilingual historic BERT) as the base.
One or two Transformer encoder layers stacked on top of the BERT encoder.
A Conditional Random Field (CRF) decoder layer to model label dependencies.
Learned absolute positional embeddings for improved handling of noisy inputs.

These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.

Training and Evaluation Results

This evaluation corresponds to the HIPE-2020 dataset (v2.1), using French and German combined for training,
German (dev-de) for validation, and French (test-fr) for testing.

Training Hyperparameters

Training regime: Mixed precision (fp16)
Epochs: 5
Max sequence length: 512
Base model: dbmdz/bert-medium-historic-multilingual-cased
Stacked Transformer layers: 2

Results

The results below show performance on the French test set across multiple evaluation settings.

Evaluation	Label	P	R	F1
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL	ALL	0.797	0.833	0.814
↑	LOC	0.859	0.862	0.860
↑	ORG	0.432	0.485	0.457
↑	PERS	0.816	0.908	0.860
↑	PROD	0.531	0.426	0.473
↑	TIME	0.836	0.962	0.895
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL	ALL	0.689	0.720	0.704
↑	LOC	0.797	0.800	0.798
↑	ORG	0.349	0.392	0.370
↑	PERS	0.640	0.713	0.675
↑	PROD	0.429	0.344	0.382
↑	TIME	0.639	0.736	0.684
NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL	ALL	0.824	0.849	0.828
↑	LOC	0.834	0.877	0.845
↑	ORG	0.393	0.513	0.477
↑	PERS	0.813	0.889	0.858
↑	PROD	0.483	0.354	0.450
↑	TIME	0.861	0.917	0.956
NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL	ALL	0.720	0.740	0.723
↑	LOC	0.771	0.815	0.783
↑	ORG	0.278	0.359	0.336
↑	PERS	0.670	0.727	0.705
↑	PROD	0.358	0.281	0.348
↑	TIME	0.727	0.746	0.796

Entity Types Supported

The model supports both coarse-grained and fine-grained entity types defined in the HIPE-2020/2022 guidelines. The output format of the model includes structured predictions with contextual and semantic details. Each prediction is a dictionary with the following fields:

{
  'type': 'pers' | 'org' | 'loc' | 'time' | 'prod',
  'confidence_ner': float,              # Confidence score
  'surface': str,                       # Surface form in text
  'lOffset': int,                       # Start character offset
  'rOffset': int,                       # End character offset
  'name': str,                          # Optional: full name (for persons)
  'title': str,                         # Optional: title (for persons)
  'function': str                       # Optional: function (if detected)
}

Coarse-Grained Entity Types:

pers: Person entities (individuals, collectives, authors)
org: Organizations (administrative, enterprise, press agencies)
prod: Products (media)
time: Time expressions (absolute dates)
loc: Locations (towns, regions, countries, physical, facilities)

If present in the text, surrounding an entity, model returns person-specific attributes such as:

name: canonical full name
title: honorific or title (e.g., "king", "chancellor")
function: role or function in context (if available)

Model Sources

Repository: https://huggingface.co/impresso-project/ner-stacked-bert-multilingual
Paper: CoNLL 2020
Demo: Impresso project

Uses

Direct Use

The model is intended to be used directly with the Hugging Face pipeline for token-classification, specifically with generic-ner tasks on historical texts.

Downstream Use

Can be used for downstream tasks such as:

Historical information extraction
Biographical reconstruction
Place and person mention detection across historical archives

Out-of-Scope Use

Not suitable for contemporary named entity recognition in domains such as social media or modern news.
Not optimized for OCR-free modern corpora.

Bias, Risks, and Limitations

Due to training on historical documents, the model may reflect historical biases and inaccuracies. It may underperform on contemporary or non-European languages.

Recommendations

Users should be cautious of historical and typographical biases.
Consider post-processing to filter false positives from OCR noise.

How to Get Started with the Model

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

ner_pipeline = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True, device='cpu')

sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
entities = ner_pipeline(sentence)
print(entities)

Example Output

[
  {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
  {'type': 'loc', 'confidence_ner': 90.75, 'surface': "Europe", 'lOffset': 69, 'rOffset': 75},
  {'type': 'loc', 'confidence_ner': 75.45, 'surface': "Royaume de France", 'lOffset': 80, 'rOffset': 97},
  {'type': 'pers', 'confidence_ner': 85.27, 'surface': "roi Philippe VI", 'lOffset': 181, 'rOffset': 196, 'title': "roi", 'name': "roi Philippe VI"},
  {'type': 'loc', 'confidence_ner': 30.59, 'surface': "Louvre", 'lOffset': 210, 'rOffset': 216},
  {'type': 'loc', 'confidence_ner': 94.46, 'surface': "Paris", 'lOffset': 266, 'rOffset': 271},
  {'type': 'pers', 'confidence_ner': 96.1, 'surface': "chancelier Guillaume de Nogaret", 'lOffset': 350, 'rOffset': 381, 'title': "chancelier", 'name': "Guillaume de Nogaret"},
  {'type': 'loc', 'confidence_ner': 49.35, 'surface': "Royaume", 'lOffset': 80, 'rOffset': 87},
  {'type': 'loc', 'confidence_ner': 24.18, 'surface': "France", 'lOffset': 91, 'rOffset': 97}
]

Citation

BibTeX:

@inproceedings{boros2020alleviating,
  title={Alleviating digitization errors in named entity recognition for historical documents},
  author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
  booktitle={Proceedings of the 24th conference on computational natural language learning},
  pages={431--441},
  year={2020}
}

Contact

Website: https://impresso-project.ch

Impresso Logo

Downloads last month: 200

Safetensors

Model size

48.5M params

Tensor type

F32

impresso-project
/

ner-stacked-bert-multilingual

Model Card for `impresso-project/ner-stacked-bert-multilingual`

Model Details

Model Description

Model Architecture

Training and Evaluation Results

Training Hyperparameters

Results

Entity Types Supported

Coarse-Grained Entity Types:

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Example Output

Citation

Contact

Space using impresso-project/ner-stacked-bert-multilingual 1

Model Card for impresso-project/ner-stacked-bert-multilingual

Model Details

Model Description

Model Architecture

Training and Evaluation Results

Training Hyperparameters

Results

Entity Types Supported

Coarse-Grained Entity Types:

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Example Output

Citation

Contact

Space using impresso-project/ner-stacked-bert-multilingual 1

Model Card for `impresso-project/ner-stacked-bert-multilingual`