File size: 3,898 Bytes

---
language: ja
license: apache-2.0
tags:
- clip
- japanese
- multimodal
- image-text
- computer-vision
- natural-language-processing
datasets:
- stair-captions
library_name: transformers
pipeline_tag: zero-shot-image-classification
---

# japanese-clip-stair

日本語に特化したCLIPモデルです。STAIR Captionsデータセットで学習されています。

## モデル概要

このモデルは、画像とテキストの類似度を計算するマルチモーダルモデルです。
- 画像エンコーダー: ResNet50
- テキストエンコーダー: cl-tohoku/bert-base-japanese-v3
- 学習データ: STAIR Captions
- 埋め込み次元: 512

## 必要なライブラリ

```bash
pip install torch torchvision transformers pillow requests
```

## 使用方法

### 基本的な使用例

```python
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torch
from torchvision import transforms
import requests
from io import BytesIO

# モデルとトークナイザーの読み込み
model = AutoModel.from_pretrained("AoiNoGeso/japanese-clip-stair", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AoiNoGeso/japanese-clip-stair")

# 画像前処理関数
def preprocess_image(image, size=224):
    transform = transforms.Compose([
        transforms.Resize((size, size)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    if image.mode != 'RGB':
        image = image.convert('RGB')
    return transform(image).unsqueeze(0)

# 画像とテキストの準備
image_url = "https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg"
image = Image.open(BytesIO(requests.get(image_url).content))
pixel_values = preprocess_image(image)

texts = ["犬", "猫", "象", "鳥"]
text_inputs = tokenizer(texts, padding=True, return_tensors="pt")

# 推論実行
with torch.no_grad():
    outputs = model(
        pixel_values=pixel_values,
        input_ids=text_inputs.input_ids,
        attention_mask=text_inputs.attention_mask
    )
    
    # 確率計算
    probs = outputs['logits_per_image'].softmax(dim=-1)
    
    # 結果表示
    for i, (text, prob) in enumerate(zip(texts, probs[0])):
        print(f"{text}: {prob:.4f} ({prob*100:.2f}%)")
```

### 個別に特徴量を取得する場合

```python
with torch.no_grad():
    # 画像特徴量のみ取得
    image_features = model.get_image_features(pixel_values)
    
    # テキスト特徴量のみ取得
    text_features = model.get_text_features(
        text_inputs.input_ids, 
        text_inputs.attention_mask
    )
    
    # 手動で類似度計算
    similarity = torch.matmul(image_features, text_features.T)
    probs = similarity.softmax(dim=-1)
```

## モデルの性能

STAIR Captionsデータセットで学習されており、日本語の画像キャプションタスクに最適化されています。

## 制限事項

- 画像は224x224にリサイズされます
- 日本語テキストに最適化されています
- PyTorchとtorchvisionが必要です

## ライセンス

Apache 2.0

## 引用

```bibtex
@dataset{stair_captions,
  title={STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset},
  author={Yoshikawa, Yuya and Shigeto, Yutaro and Takeuchi, Akikazu},
  year={2017}
}
```

## 使用例

詳細な使用例は `usage_example.py` を参照してください。

## トラブルシューティング

### KeyError: 'japanese-clip'

もしこのエラーが発生した場合は、以下のコマンドでTransformersを最新版に更新してください：

```bash
pip install --upgrade transformers
```

それでも解決しない場合は、`trust_remote_code=True`パラメータを使用してください：

```python
model = AutoModel.from_pretrained("AoiNoGeso/japanese-clip-stair", trust_remote_code=True)
```