Overview

light-splade-japanese-28M is a Japanese SPLADE (SParse Lexical AnD Expansion) model for sparse information retrieval. This model transforms Japanese text into interpretable sparse vector representations that can be used for semantic search and document retrieval tasks.

Key Features:

  • Japanese-optimized: Specifically trained on Japanese text using mMARCO dataset
  • Sparse retrieval: Generates interpretable sparse vectors for efficient search
  • Lightweight: Only 28M parameters for fast inference - even on CPU.
  • Ready-to-use: Compatible with the light-splade package and transformers package.

Model Details

Architecture

This model is based on a compact BERT architecture optimized for Japanese text processing:

  • Model Type: SPLADE encoder for sparse retrieval
  • Architecture: BERT-based with Japanese tokenization
  • Parameters: 27,747,968 total parameters
  • Hidden Size: 384 dimensions
  • Layers: 8 transformer layers
  • Attention Heads: 8 heads per layer
  • Max Sequence Length: 2048 tokens
  • Vocabulary Size: 32,768 tokens (same as tohoku-nlp/bert-base-japanese-v3)

Tokenization

  • Tokenizer: BertJapaneseTokenizer from Hugging Face Transformers
  • Word Segmentation: MeCab with unidic-lite dictionary via fugashi
  • Subword Tokenization: WordPiece algorithm

Training

Training Data

The model was trained using the SPLADE++ algorithm on the Japanese portion of the mMARCO dataset.

Training Statistics:

  • Training queries: 384,247
  • Training documents: 7,821,973
  • Training framework: light-splade

Evaluation Data

Evaluation Statistics:

  • Evaluation queries: 1,812
  • Evaluation documents: 191,372
  • Dataset split: mMARCO Japanese dev set

Performance

The model achieves the following results on the mMARCO-ja dev set:

Metric Value
MRR@10 0.3830
NDCG@10 0.4335
Recall@10 0.6318
Recall@100 0.8684
Recall@1000 0.9504
Avg. non-zero terms 120

Intended Uses

Primary Use Cases

  • Semantic Search: Convert Japanese queries and documents into sparse representations for retrieval
  • Information Retrieval: Build search systems with interpretable sparse vectors
  • Document Ranking: Rank documents based on semantic similarity to queries
  • Cross-lingual Retrieval: Leverage Japanese text understanding for search applications

Usage Example with light-splade package

Here's how to use the model with the light-splade package:

import torch
from light_splade import SpladeEncoder

# Initialize the encoder
encoder = SpladeEncoder(model_path="bizreach-inc/light-splade-japanese-28M")

# Tokenize input text
corpus = [
    "日本の首都は東京です。",
    "大阪万博は2025年に開催されます。"
]
token_outputs = encoder.tokenizer(corpus, padding=True, return_tensors="pt")

# Generate sparse representation
with torch.inference_mode():
    sparse_vecs = encoder.get_sparse(
        input_ids=token_outputs["input_ids"],
        attention_mask=token_outputs["attention_mask"]
    )

print(sparse_vecs[0])
print(sparse_vecs[1])

Usage Example with transformers package only

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM


def dense_to_sparse(dense: torch.tensor, idx2token: dict[int, str]) -> list[dict[str, float]]:
    rows, cols = dense.nonzero(as_tuple=True)
    rows = rows.tolist()
    cols = cols.tolist()
    weights = dense[rows, cols].tolist()

    sparse_vecs = [{} for _ in range(dense.size(0))]
    for row, col, weight in zip(rows, cols, weights):
        sparse_vecs[row][idx2token[col]] = round(weight, 2)

    for i in range(len(sparse_vecs)):
        sparse_vecs[i] = dict(sorted(sparse_vecs[i].items(), key=lambda x: x[1], reverse=True))
    return sparse_vecs


MODEL_PATH = "bizreach-inc/light-splade-japanese-28M"
device = "cuda" if torch.cuda.is_available() else "cpu"
transformer = AutoModelForMaskedLM.from_pretrained(MODEL_PATH).to(device)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}

corpus = [
    "日本の首都は東京です。",
    "大阪万博は2025年に開催されます。"
]
token_outputs = tokenizer(corpus, padding=True, return_tensors="pt")
attention_mask = token_outputs["attention_mask"].to(device)
token_outputs = {key: value.to(device) for key, value in token_outputs.items()}

with torch.inference_mode():
    outputs = transformer(**token_outputs)
    dense, _ = torch.max(
        torch.log(1 + torch.relu(outputs.logits)) * attention_mask.unsqueeze(-1),
        dim=1,
    )
sparse_vecs = dense_to_sparse(dense, idx2token)

print(sparse_vecs[0])
print(sparse_vecs[1])

Example Output:

{'首都': 1.83, '日本': 1.82, '東京': 1.78, '中立': 0.73, '都会': 0.69, '駒': 0.68, '州都': 0.67, '首相': 0.64, '足立': 0.62, 'です': 0.61, '都市': 0.54, 'ユニ': 0.54, '京都': 0.52, '国': 0.51, '発表': 0.49, '成田': 0.48, '太陽': 0.45, '藤原': 0.45, '私立': 0.42, '王国': 0.4...}
{'202': 1.61, '開催': 1.49, '大阪': 1.34, '万博': 1.19, '東京': 1.15, '年': 1.1, 'いつ': 1.05, '##5': 1.03, '203': 0.86, '月': 0.8, '期間': 0.79, '高槻': 0.79, '京都': 0.7, '神戸': 0.62, '2024': 0.54, '夢': 0.52, '206': 0.52, '姫路': 0.51, '行わ': 0.49, 'こう': 0.49, '芸術': 0.48...}

Limitations

  • Language Support: This model is optimized for Japanese text only
  • Domain: Performance may vary on domains significantly different from mMARCO
  • Sparse Representation: The model produces sparse vectors which may not capture all semantic nuances compared to dense representations
  • Vocabulary: Limited to the 32,768 token vocabulary; out-of-vocabulary terms may not be handled optimally

Note

  • The model was trained on open mMARCO dataset which may contain biases present in the source material
  • Users should be aware of potential biases when deploying in production systems
  • Consider evaluation on domain-specific datasets before deployment

Citation

If you use this model in your research, please cite:

@misc{light-splade-japanese-28m,
  title={Light SPLADE Japanese 28M},
  author={Bizreach Inc.},
  year={2025},
  url={https://huggingface.co/bizreach-inc/light-splade-japanese-28M}
}

License

This model is distributed under the Apache License 2.0.

Downloads last month
257
Safetensors
Model size
27.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train bizreach-inc/light-splade-japanese-28M

Collection including bizreach-inc/light-splade-japanese-28M

Evaluation results