SBB
/

Text Classification
annif
subject_indexing
glam
lam

Model Card for bk-arts-omikuji

An Annif model, trained on bibliographic metadata for automatic subject indexing tasks. It classifies a given text such as a title or description into one or multiple subjects from the classification system Nederlandse basisclassificatie / Basisklassifikation (BK). The model was developed in the research project Human.Machine.Culture at Staatsbibliothek zu Berlin – Berlin State Library (SBB).

Questions and comments about the model can be directed to Sophie Schneider at [email protected].

Table of Contents

Model Details

Model Description

An Annif model, trained on title/catalogue metadata for automatic subject indexing tasks. Subject indexing is a classical library task, aiming at describing the content of a resource. The model is intended to be used to automatically classify titles or texts that have not been classified manually so far. For each input, the model outputs one or multiple subjects from the BK classification system. It is part of a collection of 3 models, created with the help of the Annif toolkit, which address the task of automated subject indexing.

Uses

Direct Use

This model can directly be used to automatically classify texts with the BK classification scheme. It is intended to be used together with the Annif automated subject indexing toolkit version 1.1.0.

Downstream Use

Other/downstream uses outside of the Annif setting described above are not intended but also not excluded.

Out-of-Scope Use

The task of the model is primarily the subject indexing of bibliographic metadata. Basically, the model can be applied to any kind of text describing publications. However, it might not be suitable for colloquial texts from the internet.

Bias, Risks, and Limitations

The BK was introduced in the 1980s and therefore its structure partly represents outdated ways of thinking. In itself, the BK is biased, for some examples see https://verbundkonferenz.gbv.de/wp-content/uploads/2024/09/2024-08-20_VK_beckmann_Kunst-oder-Krempel-Potenziale-der-Basisklassifikation.pdf, slide 17. Hence, typical biases are e.g. the binary understanding of gender roles, outmoded use of geographical designators or of social movements. However, in order to keep it up to date, the classification system is under constant revision. As of now, the classes suggested for an input text might not be suitable for today’s understanding and might not conform to contemporary values.

Recommendations

This BK model has been trained solely on the basis of title data. We recommend being aware of this limitation and, if available, to use additional training data such as full texts or tables of contents to enhance the existing model (e.g. by running annif learn or combining models trained with different kinds of training data to an ensemble model).

Training Details

Training Data

The vocabulary file for the BK was downloaded directly from https://api.dante.gbv.de/export/download/bk/default/ (bk__default.turtle.ttl as of 05/2024).

For the subject data, several versions were created for each model with slight adaptations.

A dataset was generated following these steps:

  • downloading the kxp-subjects.tsv.gz of Voß, J., & Verbundzentrale des GBV. (2024). Normalized subject indexing data of K10plus library union catalog (2024-02-26) [Data set]. VZG. https://doi.org/10.5281/zenodo.10933926
  • cleaning this file to only contain PPNs together with their BK notations
  • querying the corresponding BK titles (Pica3 field 4000, subfields a and d) for all PPNs from previous step via unapi
  • merge the subject and title data based on PPN, filter out duplicates (identical titles)
  • this led to a dataset of overall 6.114.950 entries, split into 80% train and 10% for test and validation subsets; this training dataset has been published on Zenodo
  • the dataset was then filtered for all titles classified with specific classes from BK (all subclasses of 20 – Kunstwissenschaften, 21 – Einzelne Kunstformen and 24 – Theater, Film, Musik), resulting in 542.300 entries overall (98%/1%/1% split due to smaller size)

Training Procedure

The training procedure includes loading the BK vocabulary into Annif and training the Omikuji backend with the help of the respective training data. Further aspects on technical specifications can be found in the section Training hyperparameters.

Preprocessing

Besides merging and transforming the data as described under Training Data, no further preprocessing of natural language or similar has been performed in the dataset creation. For training the model, we applied the simple analyzer provided by Annif.

Speeds, Sizes, Times

Training takes from several minutes to an hour on an 2,8 GHz CPU with 32 cores, depending on the choice of dataset and algorithm as well as hyperparameter settings.

Training hyperparameters

name=BK (Arts) Omikuji Project language=de backend=omikuji analyzer=simple vocab=bk cluster_balanced=False cluster_k=100 max_depth=3

Training results

documents evaluated: 5000

  • Precision (--limit 2, --threshold 0.05): 0.4858
  • Recall (--limit 2, --threshold 0.05): 0.5529
  • F1 (--limit 2, --threshold 0.05): 0.4872
  • NDCG (--limit 2, --threshold 0.05): 0.5472
  • F1@5: 0.3339
  • NDCG@5: 0.6390

Evaluation

Testing Data and Metrics

Testing Data

The dataset is described under Training Data. It was split into smaller subsets used for training, testing and validating (80%/10%/10% split).

Metrics

Model performance has been evaluated based on the following metrics: Precision, Recall, F1 and NDCG. These are standard metrics for machine learning and more specifically for automatic subject indexing tasks and are directly provided in Annif by calling the annif eval statement. Evaluation parameters (limit = maximum number of results to return; threshold = minimum confidence for a suggestion to be considered) have been optimized before using the test subset and chosen based on the best F1 score reached. They affect the final results accordingly. We also state F1@5 and NDCG@5 scores reached without any evaluation parameters.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: A100.
  • Hours used: 0,5-1 hour.
  • Cloud Provider: No cloud.
  • Compute Region: Germany.
  • Carbon Emitted: More information needed.

Technical Specifications

Model Architecture and Objective

See the Annif Wiki on the Omikuji algorithm.

Software

To run this model, Annif version 1.1.0 must be installed.

Model Card Authors

Sophie Schneider and Jörg Lehmann

Model Card Contact

Questions and comments about the model can be directed to Sophie Schneider at [email protected], questions and comments about the model card can be directed to Jörg Lehmann at [email protected]

How to Get Started with the Model

Follow the Annif Getting Started page to set up and run Annif. Create a projects.cfg file (see section Training hyperparameters for details on the specific project configuration), load the BK vocabulary via annif load-vocab command and copy the model folder over to data/projects.

Model Card as of October 20th, 2025

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train SBB/bk-arts-omikuji

Collection including SBB/bk-arts-omikuji