Patch-ioner_talk2dino_meacap_COCO_Captions - Patch-ioner Configuration

This repository contains a pre-trained MEACAP model from the Patch-ioner framework for dense image captioning and controllable visual description.

📝 Paper Information

Title: "One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework"
Authors: Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
ArXiv: https://arxiv.org/abs/2510.02898 Project Page: https://paciosoft.com/Patch-ioner/ Code: https://github.com/Ruggero1912/Patch-ioner

🎯 Model Overview

Model Type: MEACAP
Configuration: mlp.meacap.k.yaml
Vision Backbone: dinov2_vitb14_reg
Language Model: gpt2
Input Resolution: 518x518
Prefix Size: 768

MeaCap Configuration

Project Length: 10
Temperature: 0.01
Top-K: 3
Memory Caption Num: 5
VL Model: openai/clip-vit-base-patch16
WTE Model: sentence-transformers/all-MiniLM-L6-v2
Parser Checkpoint: lizhuang144/flan-t5-base-VG-factual-sg
Memory ID: coco_B16_t2d
Entity Retrieval: coco_entities

📊 Performance

Task	METEOR	CIDEr	SPICE
Image Captioning	0.207	0.717	0.157
Narratives	10.000	27.400	12.700

📈 Detailed Results

Image Captioning Results

METEOR: 0.2075
CIDEr: 0.7175
SPICE: 0.1573
BLEU_4: 0.1968
ROUGE_L: 0.4200
CLIP-S: 0.7278

Narratives Results

METEOR: 10.0000
CIDEr: 27.4000
SPICE: 12.7000
BLEU_4: 2.4000
ROUGE_L: 20.2000
CLIP-S: 67.4000

🚀 Quick Start

from transformers import AutoModel
import torch
from PIL import Image

MODEL_ID = "Ruggero1912/Patch-ioner_talk2dino_meacap_COCO_Captions"

# Load the model with AutoModel from the transformers library
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)

# Example image (replace with your actual image loading logic)
# For a real scenario, you would load an image from a file or URL.
# e.g., image = Image.open("path/to/your/image.jpg")
image = Image.new('RGB', (224, 224), color = 'red') # Placeholder image

# The specific `forward` method signature depends on the model's implementation
# within the `patchioner` library. You might need to preprocess the image
# and provide additional inputs (e.g., text prompts for controllable captioning).
# Please refer to the official GitHub repository for detailed inference examples
# using the `Patchioner` library's specific `forward` methods.

# If the model has a simplified call for basic captioning, it might look like this:
# results = model(image)
# print(results)
print(f"Model {MODEL_ID} loaded successfully using `transformers.AutoModel`. "
      "Refer to the original Patch-ioner GitHub for full usage details and example inference.")

📁 Repository Contents

config.yaml: Model configuration file
model.pt: Pre-trained model weights
memory_captions.json: MeaCap memory captions database
memory_clip_embeddings.pt: MeaCap CLIP embeddings for memory
memory_wte_embeddings.pt: MeaCap WTE embeddings for memory- README.md: This file

🔧 Installation

pip install git+https://github.com/Ruggero1912/Patch-ioner

💡 Usage Examples

Refer to the Patch-ioner repository for updated usage examples.

🎛️ Model Configuration

Prefix Size: 768
Memory Bank Size: 0
Normalization: False

📈 Training Details

Training Dataset: COCO Captions
Training Epochs: TBD
Batch Size: TBD
Learning Rate: TBD
Optimizer: AdamW

📚 Citation

If you use this model in your research, please cite our paper, refer to the Project Page for updated citation template.

🤝 Contributing

We welcome contributions to improve the Patch-ioner framework. Please see the main repository for contribution guidelines.

📄 License

See the main repository for detailed license information.

🐛 Issues and Support

For issues related to this model or the Patch-ioner framework, please:

Check the main repository for existing issues
Open a new issue with detailed information about your problem
Contact the authors.

🔗 Related Models

Explore other Patch-ioner model configurations:

Patch-ioner_mlp - MLP-based DeCap model
Patch-ioner_viecap - VieCap controllable captioning
Patch-ioner_clipcap - ClipCap integration

More models available in Ruggero1912's models

This model is part of the Patch-ioner framework for dense image captioning and controllable visual description.

Downloads last month: 8

Collection including Ruggero1912/Patch-ioner_talk2dino_meacap_COCO_Captions

Patch-ioner

Collection

The official collection of all the Patch-ioner framework models • 9 items • Updated Oct 13 • 2

Evaluation results

METEOR on COCO Captions
self-reported

0.207
CIDEr on COCO Captions
self-reported

0.718
SPICE on COCO Captions
self-reported

0.157
BLEU-4 on COCO Captions
self-reported

0.197
ROUGE-L on COCO Captions
self-reported

0.420
CLIP-S on COCO Captions
self-reported

0.728
METEOR on Visual Storytelling Dataset (VIST)
self-reported

10.000
CIDEr on Visual Storytelling Dataset (VIST)
self-reported

27.400
SPICE on Visual Storytelling Dataset (VIST)
self-reported

12.700
BLEU-4 on Visual Storytelling Dataset (VIST)
self-reported

2.400

View on Papers With Code