YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MoTIF — Concepts in Motion

Read the Paper (arXiv)

Abstract

Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human‑interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher‑level components (e.g., "bow", "mount", "shoot") that reoccur across time—forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept‑based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance.

Key Features

Concept Bottlenecks for Video: map frames/clips to a shared image–text space and obtain concept activations by cosine similarity.
Per‑Channel Temporal Self‑Attention: concept channels stay independent; attention happens over time within each concept.
Three Explanation Views: global concept relevance, local window concepts, and attention‑based temporal maps.
Plug‑and‑Play Backbones: designed to work with CLIP and related vision–language models.
Multiple Datasets: examples provided for UCF‑101, HMDB‑51, Something‑Something v2, and Breakfast Actions.

Getting Started

1) Environment

Python 3.10+ (tested with 3.13.5)
CUDA‑enabled GPU recommended (checkpoints and scripts assume a GPU environment)

Create and activate an environment, then install requirements:

pip install -r requirements.txt

2) Data

Place your datasets under Datasets/ (see the folder structure below). If you want to generate small demo clips or frames, you can use:

python save_videos.py

3) Create Embeddings

Compute (or recompute) the video/frame embeddings used by MoTIF:

python embedding.py

4) Train MoTIF

MoTIF’s training entry point is:

python train_MoTIF.py

Adjust hyperparameters in the script or via CLI flags (if exposed).

5) Explore and Visualize

Open MoTIF.ipynb to visualize concept activations, attention over time, and example predictions.
Place model checkpoints in Models/ (see the notebook and code comments for expected paths).

Pretrained Checkpoints

Pre-trained MoTIF checkpoints for all model variants are available on Hugging Face. The checkpoints include models trained on Breakfast, HMDB-51, and UCF-101 datasets with PE-L/14 backbone. We will upload soon additional checkpoints.

To use a pre-trained checkpoint, download it from the Hugging Face repository and place it in the Models/ directory. The notebook MoTIF.ipynb will automatically load the appropriate checkpoint based on the dataset and backbone you specify.

Backbones and Datasets

Vision–Language Backbones

CLIP ViT‑B/32 — Hugging Face: openai/clip‑vit‑base‑patch32
CLIP ViT‑B/16 — Hugging Face: openai/clip‑vit‑base‑patch16
CLIP ViT‑L/14 — Hugging Face: openai/clip‑vit‑large‑patch14
(Optional) SigLIP L/14 — Hugging Face: google/siglip‑so400m‑patch14‑384
Perception Encoder (PE‑L/14) — Official Repo on GitHub

Datasets

UCF‑101 — Project page
HMDB‑51 — Project page
Something‑Something v2 — 20BN dataset page
Breakfast Actions — Dataset page

Please follow each dataset’s license and terms of use.

Note: If you use other datasets, you will need to adapt the dataset logic in the code (e.g., train/val/test splits, preprocessing, and loaders). Relevant places include utils/core/data/ (e.g., data.py, preprocessor.py, dataloader.py) and any dataset‑specific handling in embedding.py and train_MoTIF.py.

Folder Structure

Datasets/ — dataset placeholders
Embeddings/ — generated embeddings (created by scripts)
Models/ — trained model checkpoints
Videos/ — example videos used in the paper/one‑pager
utils/ — library code (vision encoder, projector, dataloaders, transforms, etc.)
index.html — minimal one‑pager describing MoTIF (open locally in a browser)
embedding.py, save_videos.py, train_MoTIF.py — main scripts
MoTIF.ipynb — notebook for inspection and visualization

Quick Tips

If you change the dataset or backbone, regenerate embeddings before training.
The attention visualizations are concept‑wise and time‑wise; they should not mix information across concepts.
GPU memory usage depends on the number of concepts and the temporal window length.

Citation

If you use MoTIF in your research, please consider citing:

@misc{knab2025conceptsmotiontemporalbottlenecks,
      title={Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification}, 
      author={Patrick Knab and Sascha Marton and Philipp J. Schubert and Drago Guggiana and Christian Bartelt},
      year={2025},
      eprint={2509.20899},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.20899}, 
}

Acknowledgements

Parts of the utils/core codebase are adapted from the Perception Encoder framework.
Thanks to the CORE research group at TU Clausthal and Ramblr.ai Research for support.

Contact

For questions and discussion, please open an issue or contact the authors.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support