This repository contains the models described in the following paper:

Orhan AE (2024) HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data. arXiv:2407.18067.

These models were pretrained with the spatiotemporal MAE algorithm on ~5k hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings) and then, optionally, finetuned on various downstream tasks with few-shot supervised training. Please note that this repository only stores the pretrained model checkpoints. Please use the associated GitHub repository to actually load the models and use them (the GitHub repository contains detailed instructions for loading and using the models).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support