Model Card for bk-arts-omikuji
An Annif model, trained on bibliographic metadata for automatic subject indexing tasks. It classifies a given text such as a title or description into one or multiple subjects from the classification system Nederlandse basisclassificatie / Basisklassifikation (BK). The model was developed in the research project Human.Machine.Culture at Staatsbibliothek zu Berlin – Berlin State Library (SBB).
Questions and comments about the model can be directed to Sophie Schneider at [email protected].
Table of Contents
- Model Card for bk-arts-omikuji
- Table of Contents
- Model Details
- Uses
- Bias, Risks, and Limitations
- Training Details
- Evaluation
- Model Examination
- Environmental Impact
- Technical Specifications
- Model Card Authors
- Model Card Contact
- How to Get Started with the Model
Model Details
Model Description
An Annif model, trained on title/catalogue metadata for automatic subject indexing tasks. Subject indexing is a classical library task, aiming at describing the content of a resource. The model is intended to be used to automatically classify titles or texts that have not been classified manually so far. For each input, the model outputs one or multiple subjects from the BK classification system. It is part of a collection of 3 models, created with the help of the Annif toolkit, which address the task of automated subject indexing.
- Developed by: Sophie Schneider
- Shared by [Optional]: Staatsbibliothek zu Berlin / Berlin State Library
- Model type: tree-based
- Language(s) (NLP): multilingual
- License: apache-2.0
Uses
Direct Use
This model can directly be used to automatically classify texts with the BK classification scheme. It is intended to be used together with the Annif automated subject indexing toolkit version 1.1.0.
Downstream Use
Other/downstream uses outside of the Annif setting described above are not intended but also not excluded.
Out-of-Scope Use
The task of the model is primarily the subject indexing of bibliographic metadata. Basically, the model can be applied to any kind of text describing publications. However, it might not be suitable for colloquial texts from the internet.
Bias, Risks, and Limitations
The BK was introduced in the 1980s and therefore its structure partly represents outdated ways of thinking. In itself, the BK is biased, for some examples see https://verbundkonferenz.gbv.de/wp-content/uploads/2024/09/2024-08-20_VK_beckmann_Kunst-oder-Krempel-Potenziale-der-Basisklassifikation.pdf, slide 17. Hence, typical biases are e.g. the binary understanding of gender roles, outmoded use of geographical designators or of social movements. However, in order to keep it up to date, the classification system is under constant revision. As of now, the classes suggested for an input text might not be suitable for today’s understanding and might not conform to contemporary values.
Recommendations
This BK model has been trained solely on the basis of title data. We recommend being aware of this limitation and, if available, to use additional training data such as full texts or tables of contents to enhance the existing model (e.g. by running annif learn or combining models trained with different kinds of training data to an ensemble model).
Training Details
Training Data
The vocabulary file for the BK was downloaded directly from https://api.dante.gbv.de/export/download/bk/default/ (bk__default.turtle.ttl as of 05/2024).
For the subject data, several versions were created for each model with slight adaptations.
A dataset was generated following these steps:
- downloading the
kxp-subjects.tsv.gzof Voß, J., & Verbundzentrale des GBV. (2024). Normalized subject indexing data of K10plus library union catalog (2024-02-26) [Data set]. VZG. https://doi.org/10.5281/zenodo.10933926 - cleaning this file to only contain PPNs together with their BK notations
- querying the corresponding BK titles (Pica3 field 4000, subfields a and d) for all PPNs from previous step via unapi
- merge the subject and title data based on PPN, filter out duplicates (identical titles)
- this led to a dataset of overall 6.114.950 entries, split into 80% train and 10% for test and validation subsets; this training dataset has been published on Zenodo
- the dataset was then filtered for all titles classified with specific classes from BK (all subclasses of 20 – Kunstwissenschaften, 21 – Einzelne Kunstformen and 24 – Theater, Film, Musik), resulting in 542.300 entries overall (98%/1%/1% split due to smaller size)
Training Procedure
The training procedure includes loading the BK vocabulary into Annif and training the Omikuji backend with the help of the respective training data. Further aspects on technical specifications can be found in the section Training hyperparameters.
Preprocessing
Besides merging and transforming the data as described under Training Data, no further preprocessing of natural language or similar has been performed in the dataset creation. For training the model, we applied the simple analyzer provided by Annif.
Speeds, Sizes, Times
Training takes from several minutes to an hour on an 2,8 GHz CPU with 32 cores, depending on the choice of dataset and algorithm as well as hyperparameter settings.
Training hyperparameters
name=BK (Arts) Omikuji Project language=de backend=omikuji analyzer=simple vocab=bk cluster_balanced=False cluster_k=100 max_depth=3
Training results
documents evaluated: 5000
- Precision (
--limit2,--threshold0.05): 0.4858 - Recall (
--limit2,--threshold0.05): 0.5529 - F1 (
--limit2,--threshold0.05): 0.4872 - NDCG (
--limit2,--threshold0.05): 0.5472 - F1@5: 0.3339
- NDCG@5: 0.6390
Evaluation
Testing Data and Metrics
Testing Data
The dataset is described under Training Data. It was split into smaller subsets used for training, testing and validating (80%/10%/10% split).
Metrics
Model performance has been evaluated based on the following metrics: Precision, Recall, F1 and NDCG. These are standard metrics for machine learning and more specifically for automatic subject indexing tasks and are directly provided in Annif by calling the annif eval statement. Evaluation parameters (limit = maximum number of results to return; threshold = minimum confidence for a suggestion to be considered) have been optimized before using the test subset and chosen based on the best F1 score reached. They affect the final results accordingly. We also state F1@5 and NDCG@5 scores reached without any evaluation parameters.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: A100.
- Hours used: 0,5-1 hour.
- Cloud Provider: No cloud.
- Compute Region: Germany.
- Carbon Emitted: More information needed.
Technical Specifications
Model Architecture and Objective
See the Annif Wiki on the Omikuji algorithm.
Software
To run this model, Annif version 1.1.0 must be installed.
Model Card Authors
Sophie Schneider and Jörg Lehmann
Model Card Contact
Questions and comments about the model can be directed to Sophie Schneider at [email protected], questions and comments about the model card can be directed to Jörg Lehmann at [email protected]
How to Get Started with the Model
Follow the Annif Getting Started page to set up and run Annif. Create a projects.cfg file (see section Training hyperparameters for details on the specific project configuration), load the BK vocabulary via annif load-vocab command and copy the model folder over to data/projects.
Model Card as of October 20th, 2025