Patch-ioner_talk2dino_meacap_COCO_Captions - Patch-ioner Configuration
This repository contains a pre-trained MEACAP model from the Patch-ioner framework for dense image captioning and controllable visual description.
π Paper Information
Title: "One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework"
Authors: Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
ArXiv: https://arxiv.org/abs/2510.02898
Project Page: https://paciosoft.com/Patch-ioner/
Code: https://github.com/Ruggero1912/Patch-ioner
π― Model Overview
- Model Type: MEACAP
- Configuration: mlp.meacap.k.yaml
- Vision Backbone: dinov2_vitb14_reg
- Language Model: gpt2
- Input Resolution: 518x518
- Prefix Size: 768
MeaCap Configuration
- Project Length: 10
- Temperature: 0.01
- Top-K: 3
- Memory Caption Num: 5
- VL Model: openai/clip-vit-base-patch16
- WTE Model: sentence-transformers/all-MiniLM-L6-v2
- Parser Checkpoint: lizhuang144/flan-t5-base-VG-factual-sg
- Memory ID: coco_B16_t2d
- Entity Retrieval: coco_entities
π Performance
| Task | METEOR | CIDEr | SPICE |
|---|---|---|---|
| Image Captioning | 0.207 | 0.717 | 0.157 |
| Narratives | 10.000 | 27.400 | 12.700 |
π Detailed Results
Image Captioning Results
- METEOR: 0.2075
- CIDEr: 0.7175
- SPICE: 0.1573
- BLEU_4: 0.1968
- ROUGE_L: 0.4200
- CLIP-S: 0.7278
Narratives Results
- METEOR: 10.0000
- CIDEr: 27.4000
- SPICE: 12.7000
- BLEU_4: 2.4000
- ROUGE_L: 20.2000
- CLIP-S: 67.4000
π Quick Start
from transformers import AutoModel
import torch
from PIL import Image
MODEL_ID = "Ruggero1912/Patch-ioner_talk2dino_meacap_COCO_Captions"
# Load the model with AutoModel from the transformers library
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)
# Example image (replace with your actual image loading logic)
# For a real scenario, you would load an image from a file or URL.
# e.g., image = Image.open("path/to/your/image.jpg")
image = Image.new('RGB', (224, 224), color = 'red') # Placeholder image
# The specific `forward` method signature depends on the model's implementation
# within the `patchioner` library. You might need to preprocess the image
# and provide additional inputs (e.g., text prompts for controllable captioning).
# Please refer to the official GitHub repository for detailed inference examples
# using the `Patchioner` library's specific `forward` methods.
# If the model has a simplified call for basic captioning, it might look like this:
# results = model(image)
# print(results)
print(f"Model {MODEL_ID} loaded successfully using `transformers.AutoModel`. "
"Refer to the original Patch-ioner GitHub for full usage details and example inference.")
π Repository Contents
config.yaml: Model configuration filemodel.pt: Pre-trained model weightsmemory_captions.json: MeaCap memory captions databasememory_clip_embeddings.pt: MeaCap CLIP embeddings for memorymemory_wte_embeddings.pt: MeaCap WTE embeddings for memory-README.md: This file
π§ Installation
pip install git+https://github.com/Ruggero1912/Patch-ioner
π‘ Usage Examples
Refer to the Patch-ioner repository for updated usage examples.
ποΈ Model Configuration
- Prefix Size: 768
- Memory Bank Size: 0
- Normalization: False
π Training Details
- Training Dataset: COCO Captions
- Training Epochs: TBD
- Batch Size: TBD
- Learning Rate: TBD
- Optimizer: AdamW
π Citation
If you use this model in your research, please cite our paper, refer to the Project Page for updated citation template.
π€ Contributing
We welcome contributions to improve the Patch-ioner framework. Please see the main repository for contribution guidelines.
π License
See the main repository for detailed license information.
π Issues and Support
For issues related to this model or the Patch-ioner framework, please:
- Check the main repository for existing issues
- Open a new issue with detailed information about your problem
- Contact the authors.
π Related Models
Explore other Patch-ioner model configurations:
- Patch-ioner_mlp - MLP-based DeCap model
- Patch-ioner_viecap - VieCap controllable captioning
- Patch-ioner_clipcap - ClipCap integration
More models available in Ruggero1912's models
This model is part of the Patch-ioner framework for dense image captioning and controllable visual description.
- Downloads last month
- 8
Collection including Ruggero1912/Patch-ioner_talk2dino_meacap_COCO_Captions
Evaluation results
- METEOR on COCO Captionsself-reported0.207
- CIDEr on COCO Captionsself-reported0.718
- SPICE on COCO Captionsself-reported0.157
- BLEU-4 on COCO Captionsself-reported0.197
- ROUGE-L on COCO Captionsself-reported0.420
- CLIP-S on COCO Captionsself-reported0.728
- METEOR on Visual Storytelling Dataset (VIST)self-reported10.000
- CIDEr on Visual Storytelling Dataset (VIST)self-reported27.400
- SPICE on Visual Storytelling Dataset (VIST)self-reported12.700
- BLEU-4 on Visual Storytelling Dataset (VIST)self-reported2.400