Overview

light-splade-japanese-28M is a Japanese SPLADE (SParse Lexical AnD Expansion) model for sparse information retrieval. This model transforms Japanese text into interpretable sparse vector representations that can be used for semantic search and document retrieval tasks.

Key Features:

Japanese-optimized: Specifically trained on Japanese text using mMARCO dataset
Sparse retrieval: Generates interpretable sparse vectors for efficient search
Lightweight: Only 28M parameters for fast inference - even on CPU.
Ready-to-use: Compatible with the light-splade package and transformers package.

Model Details

Architecture

This model is based on a compact BERT architecture optimized for Japanese text processing:

Model Type: SPLADE encoder for sparse retrieval
Architecture: BERT-based with Japanese tokenization
Parameters: 27,747,968 total parameters
Hidden Size: 384 dimensions
Layers: 8 transformer layers
Attention Heads: 8 heads per layer
Max Sequence Length: 2048 tokens
Vocabulary Size: 32,768 tokens (same as tohoku-nlp/bert-base-japanese-v3)

Tokenization

Tokenizer: BertJapaneseTokenizer from Hugging Face Transformers
Word Segmentation: MeCab with unidic-lite dictionary via fugashi
Subword Tokenization: WordPiece algorithm

Training

Training Data

The model was trained using the SPLADE++ algorithm on the Japanese portion of the mMARCO dataset.

Training Statistics:

Training queries: 384,247
Training documents: 7,821,973
Training framework: light-splade

Evaluation Data

Evaluation Statistics:

Evaluation queries: 1,812
Evaluation documents: 191,372
Dataset split: mMARCO Japanese dev set

Performance

The model achieves the following results on the mMARCO-ja dev set:

Metric	Value
MRR@10	0.3830
NDCG@10	0.4335
Recall@10	0.6318
Recall@100	0.8684
Recall@1000	0.9504
Avg. non-zero terms	120

Intended Uses

Primary Use Cases

Semantic Search: Convert Japanese queries and documents into sparse representations for retrieval
Information Retrieval: Build search systems with interpretable sparse vectors
Document Ranking: Rank documents based on semantic similarity to queries
Cross-lingual Retrieval: Leverage Japanese text understanding for search applications

Usage Example with `light-splade` package

Here's how to use the model with the light-splade package:

import torch
from light_splade import SpladeEncoder

# Initialize the encoder
encoder = SpladeEncoder(model_path="bizreach-inc/light-splade-japanese-28M")

# Tokenize input text
corpus = [
    "日本の首都は東京です。",
    "大阪万博は2025年に開催されます。"
]
token_outputs = encoder.tokenizer(corpus, padding=True, return_tensors="pt")

# Generate sparse representation
with torch.inference_mode():
    sparse_vecs = encoder.get_sparse(
        input_ids=token_outputs["input_ids"],
        attention_mask=token_outputs["attention_mask"]
    )

print(sparse_vecs[0])
print(sparse_vecs[1])

Usage Example with `transformers` package only

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM


def dense_to_sparse(dense: torch.tensor, idx2token: dict[int, str]) -> list[dict[str, float]]:
    rows, cols = dense.nonzero(as_tuple=True)
    rows = rows.tolist()
    cols = cols.tolist()
    weights = dense[rows, cols].tolist()

    sparse_vecs = [{} for _ in range(dense.size(0))]
    for row, col, weight in zip(rows, cols, weights):
        sparse_vecs[row][idx2token[col]] = round(weight, 2)

    for i in range(len(sparse_vecs)):
        sparse_vecs[i] = dict(sorted(sparse_vecs[i].items(), key=lambda x: x[1], reverse=True))
    return sparse_vecs


MODEL_PATH = "bizreach-inc/light-splade-japanese-28M"
device = "cuda" if torch.cuda.is_available() else "cpu"
transformer = AutoModelForMaskedLM.from_pretrained(MODEL_PATH).to(device)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}

corpus = [
    "日本の首都は東京です。",
    "大阪万博は2025年に開催されます。"
]
token_outputs = tokenizer(corpus, padding=True, return_tensors="pt")
attention_mask = token_outputs["attention_mask"].to(device)
token_outputs = {key: value.to(device) for key, value in token_outputs.items()}

with torch.inference_mode():
    outputs = transformer(**token_outputs)
    dense, _ = torch.max(
        torch.log(1 + torch.relu(outputs.logits)) * attention_mask.unsqueeze(-1),
        dim=1,
    )
sparse_vecs = dense_to_sparse(dense, idx2token)

print(sparse_vecs[0])
print(sparse_vecs[1])

Example Output:

{'首都': 1.83, '日本': 1.82, '東京': 1.78, '中立': 0.73, '都会': 0.69, '駒': 0.68, '州都': 0.67, '首相': 0.64, '足立': 0.62, 'です': 0.61, '都市': 0.54, 'ユニ': 0.54, '京都': 0.52, '国': 0.51, '発表': 0.49, '成田': 0.48, '太陽': 0.45, '藤原': 0.45, '私立': 0.42, '王国': 0.4...}
{'202': 1.61, '開催': 1.49, '大阪': 1.34, '万博': 1.19, '東京': 1.15, '年': 1.1, 'いつ': 1.05, '##5': 1.03, '203': 0.86, '月': 0.8, '期間': 0.79, '高槻': 0.79, '京都': 0.7, '神戸': 0.62, '2024': 0.54, '夢': 0.52, '206': 0.52, '姫路': 0.51, '行わ': 0.49, 'こう': 0.49, '芸術': 0.48...}

Limitations

Language Support: This model is optimized for Japanese text only
Domain: Performance may vary on domains significantly different from mMARCO
Sparse Representation: The model produces sparse vectors which may not capture all semantic nuances compared to dense representations
Vocabulary: Limited to the 32,768 token vocabulary; out-of-vocabulary terms may not be handled optimally

Note

The model was trained on open mMARCO dataset which may contain biases present in the source material
Users should be aware of potential biases when deploying in production systems
Consider evaluation on domain-specific datasets before deployment

Citation

If you use this model in your research, please cite:

@misc{light-splade-japanese-28m,
  title={Light SPLADE Japanese 28M},
  author={Bizreach Inc.},
  year={2025},
  url={https://huggingface.co/bizreach-inc/light-splade-japanese-28M}
}

License

This model is distributed under the Apache License 2.0.

Downloads last month: 257

Safetensors

Model size

27.7M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train bizreach-inc/light-splade-japanese-28M

Collection including bizreach-inc/light-splade-japanese-28M

SPLADE

Collection

Lightweight SPLADE models developed by BizReach Inc., • 3 items • Updated Oct 22

Evaluation results

MRR@10 on mMARCO Japanese
self-reported

0.383
NDCG@10 on mMARCO Japanese
self-reported

0.433
Recall@10 on mMARCO Japanese
self-reported

0.632
Recall@100 on mMARCO Japanese
self-reported

0.868
Recall@1000 on mMARCO Japanese
self-reported

0.950

View on Papers With Code