Overview
light-splade-japanese-28M is a Japanese SPLADE (SParse Lexical AnD Expansion) model for sparse information retrieval. This model transforms Japanese text into interpretable sparse vector representations that can be used for semantic search and document retrieval tasks.
Key Features:
- Japanese-optimized: Specifically trained on Japanese text using mMARCO dataset
- Sparse retrieval: Generates interpretable sparse vectors for efficient search
- Lightweight: Only 28M parameters for fast inference - even on CPU.
- Ready-to-use: Compatible with the
light-spladepackage andtransformerspackage.
Model Details
Architecture
This model is based on a compact BERT architecture optimized for Japanese text processing:
- Model Type: SPLADE encoder for sparse retrieval
- Architecture: BERT-based with Japanese tokenization
- Parameters: 27,747,968 total parameters
- Hidden Size: 384 dimensions
- Layers: 8 transformer layers
- Attention Heads: 8 heads per layer
- Max Sequence Length: 2048 tokens
- Vocabulary Size: 32,768 tokens (same as tohoku-nlp/bert-base-japanese-v3)
Tokenization
- Tokenizer:
BertJapaneseTokenizerfrom Hugging Face Transformers - Word Segmentation: MeCab with unidic-lite dictionary via fugashi
- Subword Tokenization: WordPiece algorithm
Training
Training Data
The model was trained using the SPLADE++ algorithm on the Japanese portion of the mMARCO dataset.
Training Statistics:
- Training queries: 384,247
- Training documents: 7,821,973
- Training framework: light-splade
Evaluation Data
Evaluation Statistics:
- Evaluation queries: 1,812
- Evaluation documents: 191,372
- Dataset split: mMARCO Japanese dev set
Performance
The model achieves the following results on the mMARCO-ja dev set:
| Metric | Value |
|---|---|
| MRR@10 | 0.3830 |
| NDCG@10 | 0.4335 |
| Recall@10 | 0.6318 |
| Recall@100 | 0.8684 |
| Recall@1000 | 0.9504 |
| Avg. non-zero terms | 120 |
Intended Uses
Primary Use Cases
- Semantic Search: Convert Japanese queries and documents into sparse representations for retrieval
- Information Retrieval: Build search systems with interpretable sparse vectors
- Document Ranking: Rank documents based on semantic similarity to queries
- Cross-lingual Retrieval: Leverage Japanese text understanding for search applications
Usage Example with light-splade package
Here's how to use the model with the light-splade package:
import torch
from light_splade import SpladeEncoder
# Initialize the encoder
encoder = SpladeEncoder(model_path="bizreach-inc/light-splade-japanese-28M")
# Tokenize input text
corpus = [
"日本の首都は東京です。",
"大阪万博は2025年に開催されます。"
]
token_outputs = encoder.tokenizer(corpus, padding=True, return_tensors="pt")
# Generate sparse representation
with torch.inference_mode():
sparse_vecs = encoder.get_sparse(
input_ids=token_outputs["input_ids"],
attention_mask=token_outputs["attention_mask"]
)
print(sparse_vecs[0])
print(sparse_vecs[1])
Usage Example with transformers package only
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
def dense_to_sparse(dense: torch.tensor, idx2token: dict[int, str]) -> list[dict[str, float]]:
rows, cols = dense.nonzero(as_tuple=True)
rows = rows.tolist()
cols = cols.tolist()
weights = dense[rows, cols].tolist()
sparse_vecs = [{} for _ in range(dense.size(0))]
for row, col, weight in zip(rows, cols, weights):
sparse_vecs[row][idx2token[col]] = round(weight, 2)
for i in range(len(sparse_vecs)):
sparse_vecs[i] = dict(sorted(sparse_vecs[i].items(), key=lambda x: x[1], reverse=True))
return sparse_vecs
MODEL_PATH = "bizreach-inc/light-splade-japanese-28M"
device = "cuda" if torch.cuda.is_available() else "cpu"
transformer = AutoModelForMaskedLM.from_pretrained(MODEL_PATH).to(device)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}
corpus = [
"日本の首都は東京です。",
"大阪万博は2025年に開催されます。"
]
token_outputs = tokenizer(corpus, padding=True, return_tensors="pt")
attention_mask = token_outputs["attention_mask"].to(device)
token_outputs = {key: value.to(device) for key, value in token_outputs.items()}
with torch.inference_mode():
outputs = transformer(**token_outputs)
dense, _ = torch.max(
torch.log(1 + torch.relu(outputs.logits)) * attention_mask.unsqueeze(-1),
dim=1,
)
sparse_vecs = dense_to_sparse(dense, idx2token)
print(sparse_vecs[0])
print(sparse_vecs[1])
Example Output:
{'首都': 1.83, '日本': 1.82, '東京': 1.78, '中立': 0.73, '都会': 0.69, '駒': 0.68, '州都': 0.67, '首相': 0.64, '足立': 0.62, 'です': 0.61, '都市': 0.54, 'ユニ': 0.54, '京都': 0.52, '国': 0.51, '発表': 0.49, '成田': 0.48, '太陽': 0.45, '藤原': 0.45, '私立': 0.42, '王国': 0.4...}
{'202': 1.61, '開催': 1.49, '大阪': 1.34, '万博': 1.19, '東京': 1.15, '年': 1.1, 'いつ': 1.05, '##5': 1.03, '203': 0.86, '月': 0.8, '期間': 0.79, '高槻': 0.79, '京都': 0.7, '神戸': 0.62, '2024': 0.54, '夢': 0.52, '206': 0.52, '姫路': 0.51, '行わ': 0.49, 'こう': 0.49, '芸術': 0.48...}
Limitations
- Language Support: This model is optimized for Japanese text only
- Domain: Performance may vary on domains significantly different from mMARCO
- Sparse Representation: The model produces sparse vectors which may not capture all semantic nuances compared to dense representations
- Vocabulary: Limited to the 32,768 token vocabulary; out-of-vocabulary terms may not be handled optimally
Note
- The model was trained on open mMARCO dataset which may contain biases present in the source material
- Users should be aware of potential biases when deploying in production systems
- Consider evaluation on domain-specific datasets before deployment
Citation
If you use this model in your research, please cite:
@misc{light-splade-japanese-28m,
title={Light SPLADE Japanese 28M},
author={Bizreach Inc.},
year={2025},
url={https://huggingface.co/bizreach-inc/light-splade-japanese-28M}
}
License
This model is distributed under the Apache License 2.0.
- Downloads last month
- 257
Dataset used to train bizreach-inc/light-splade-japanese-28M
Collection including bizreach-inc/light-splade-japanese-28M
Evaluation results
- MRR@10 on mMARCO Japaneseself-reported0.383
- NDCG@10 on mMARCO Japaneseself-reported0.433
- Recall@10 on mMARCO Japaneseself-reported0.632
- Recall@100 on mMARCO Japaneseself-reported0.868
- Recall@1000 on mMARCO Japaneseself-reported0.950