Patch-ioner_talk2dino_meacap_COCO_Captions - Patch-ioner Configuration

This repository contains a pre-trained MEACAP model from the Patch-ioner framework for dense image captioning and controllable visual description.

πŸ“ Paper Information

Title: "One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework"
Authors: Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
ArXiv: https://arxiv.org/abs/2510.02898 Project Page: https://paciosoft.com/Patch-ioner/ Code: https://github.com/Ruggero1912/Patch-ioner

🎯 Model Overview

  • Model Type: MEACAP
  • Configuration: mlp.meacap.k.yaml
  • Vision Backbone: dinov2_vitb14_reg
  • Language Model: gpt2
  • Input Resolution: 518x518
  • Prefix Size: 768

MeaCap Configuration

  • Project Length: 10
  • Temperature: 0.01
  • Top-K: 3
  • Memory Caption Num: 5
  • VL Model: openai/clip-vit-base-patch16
  • WTE Model: sentence-transformers/all-MiniLM-L6-v2
  • Parser Checkpoint: lizhuang144/flan-t5-base-VG-factual-sg
  • Memory ID: coco_B16_t2d
  • Entity Retrieval: coco_entities

πŸ“Š Performance

Task METEOR CIDEr SPICE
Image Captioning 0.207 0.717 0.157
Narratives 10.000 27.400 12.700

πŸ“ˆ Detailed Results

Image Captioning Results

  • METEOR: 0.2075
  • CIDEr: 0.7175
  • SPICE: 0.1573
  • BLEU_4: 0.1968
  • ROUGE_L: 0.4200
  • CLIP-S: 0.7278

Narratives Results

  • METEOR: 10.0000
  • CIDEr: 27.4000
  • SPICE: 12.7000
  • BLEU_4: 2.4000
  • ROUGE_L: 20.2000
  • CLIP-S: 67.4000

πŸš€ Quick Start

from transformers import AutoModel
import torch
from PIL import Image

MODEL_ID = "Ruggero1912/Patch-ioner_talk2dino_meacap_COCO_Captions"

# Load the model with AutoModel from the transformers library
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)

# Example image (replace with your actual image loading logic)
# For a real scenario, you would load an image from a file or URL.
# e.g., image = Image.open("path/to/your/image.jpg")
image = Image.new('RGB', (224, 224), color = 'red') # Placeholder image

# The specific `forward` method signature depends on the model's implementation
# within the `patchioner` library. You might need to preprocess the image
# and provide additional inputs (e.g., text prompts for controllable captioning).
# Please refer to the official GitHub repository for detailed inference examples
# using the `Patchioner` library's specific `forward` methods.

# If the model has a simplified call for basic captioning, it might look like this:
# results = model(image)
# print(results)
print(f"Model {MODEL_ID} loaded successfully using `transformers.AutoModel`. "
      "Refer to the original Patch-ioner GitHub for full usage details and example inference.")

πŸ“ Repository Contents

  • config.yaml: Model configuration file

  • model.pt: Pre-trained model weights

  • memory_captions.json: MeaCap memory captions database

  • memory_clip_embeddings.pt: MeaCap CLIP embeddings for memory

  • memory_wte_embeddings.pt: MeaCap WTE embeddings for memory- README.md: This file

πŸ”§ Installation

pip install git+https://github.com/Ruggero1912/Patch-ioner

πŸ’‘ Usage Examples

Refer to the Patch-ioner repository for updated usage examples.

πŸŽ›οΈ Model Configuration

  • Prefix Size: 768
  • Memory Bank Size: 0
  • Normalization: False

πŸ“ˆ Training Details

  • Training Dataset: COCO Captions
  • Training Epochs: TBD
  • Batch Size: TBD
  • Learning Rate: TBD
  • Optimizer: AdamW

πŸ“š Citation

If you use this model in your research, please cite our paper, refer to the Project Page for updated citation template.

🀝 Contributing

We welcome contributions to improve the Patch-ioner framework. Please see the main repository for contribution guidelines.

πŸ“„ License

See the main repository for detailed license information.

πŸ› Issues and Support

For issues related to this model or the Patch-ioner framework, please:

  1. Check the main repository for existing issues
  2. Open a new issue with detailed information about your problem
  3. Contact the authors.

πŸ”— Related Models

Explore other Patch-ioner model configurations:

More models available in Ruggero1912's models


This model is part of the Patch-ioner framework for dense image captioning and controllable visual description.

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Ruggero1912/Patch-ioner_talk2dino_meacap_COCO_Captions

Evaluation results