impresso-project / impresso-ad-classification-xlm-one-class
Multilingual ad detector built on top of an XLM-Rβbased genre model and a lightweight binary decision layer. It combines the modelβs βPromotionβ probability with smart chunking, pooling, and rules (prices, phones, cues, etc.) to classify any input text as AD or NOT_AD.
Optimized for FR / DE / LB, but works on βotherβ languages too.
π Quickstart
from transformers import pipeline
pipe = pipeline(
"text-classification",
model="impresso-project/impresso-ad-classification-xlm-one-class",
trust_remote_code=True, # required to use the bundled pipeline.py
)
pipe("Appartement 3 piΓ¨cesβ¦ Fr. 2'100.β, tΓ©l. 079 ...")
# [{'label': 'AD', 'score': 0.87, 'promotion_prob': 0.83, 'threshold_used': 0.80,
# 'xgenre_top_label': 'Promotion', 'xgenre_top_prob': 0.88, ...diagnostics...}]
Batch input
pipe(["texte A", "Text B"]) # β list of dicts in the same order
Dict input (with metadata)
pipe([
{"ft": "Annonce en franΓ§ais ...", "lg": "fr"},
{"ft": "Mietwohnung ...", "lg": "de"}
])
π¦ What this repo contains
model.safetensors/pytorch_model.bin+config.json+ tokenizer filespipeline.pyβ custom HF pipeline that turns the multi-genre model into a binary ad detector (AD / NOT_AD)best_params_final.jsonβ default inference knobs (loaded automatically if present)- (optional)
meta_classifier.pklβ extra stacking model (if you provide one)
The Hugging Face inference widget doesnβt run custom code; use Python as shown above or build a small Space.
π§ How it works (inference)
- Normalize text β optional chunking by word count
- Encode chunks β get logits from the base model
- Pool across chunks (max/mean/logits_* methods)
- Read βPromotionβ probability + other label probs
- Apply adaptive thresholding (language- & length-aware) and rules (prices, phones, cues, etc.)
- Optional meta-classifier stacking (if
meta_classifier.pklis present) - Output: AD / NOT_AD with diagnostics
βοΈ Parameters you can tweak
All are usable as keyword args on the pipeline call:
ad_threshold: floatβ base threshold (per-language overrides below)lang_thresholds: strβ e.g."fr:0.58,de:0.62,lb:0.58,other:0.60"short_len: int,short_bonus: floatβ make short texts easier to flagmin_words: intβ skip items with too few wordschunk_words: intβ 0 = no chunking; else words per chunkmax_length: intβ tokenizer max tokens per chunk (default 512)pool: strβ one ofmax,mean,logits_max,logits_mean,logits_weightedtemperature: floatβ divides logits before softmax for calibrationmeta_clf: strβ filename of a scikit-learn pickle in the repo (optional)return_diagnostics: boolβ include rule flags & confidences (default True)
Example
pipe(
["Annonce FR ...", "Wohnung DE ..."],
ad_threshold=0.60,
lang_thresholds="fr:0.58,de:0.62,lb:0.58,other:0.60",
chunk_words=150,
pool="logits_weighted",
temperature=1.0,
)
π JSONL I/O helper
pipeline.py also exposes a convenience method to mirror the CLI workflow:
from transformers import AutoPipelineForTextClassification
p = AutoPipelineForTextClassification.from_pretrained(
"impresso-project/impresso-ad-classification-xlm-one-class",
trust_remote_code=True,
)
p.predict_jsonl("input.jsonl", "results.jsonl")
Input JSONL: one object per line, with at least { "ft": "<text>" } (optionally "lg": "fr|de|lb|...")
Output JSONL: same rows plus promotion_prob, promotion_prob_final (as score), is_ad_pred, etc.
π Output schema
Each result item is a dict:
{
"label": "AD" | "NOT_AD",
"score": <final_prob>, # after rules/ensembling
"promotion_prob": <raw model prob>, # 'Promotion' class
"threshold_used": <effective threshold>,
"xgenre_top_label": <top genre>,
"xgenre_top_prob": <top genre prob>,
# ... diagnostics (rule flags & confidences) unless return_diagnostics=False
}
If min_words is set and the text is shorter, youβll get:
{"label": "SKIPPED", "score": None, ...}
π Installation
pip install -U transformers huggingface_hub torch
# optional: scikit-learn (only if you use meta_classifier.pkl)
pip install -U scikit-learn
β Intended uses & caveats
- Detecting classified ads / promotional notices in multilingual corpora.
- Works well on FR/DE/LB; for other languages, tune
lang_thresholds. - Heuristic rules help recall, but may trigger on number-heavy texts (obituaries, legal notices, etc.). Review edge cases for your domain.
π Files of interest
pipeline.pyβ custom pipeline (loaded withtrust_remote_code=True)best_params_final.jsonβ default knobs; auto-loaded if presentmeta_classifier.pklβ optional meta-stackerREADME.mdβ this file
π TL;DR: run it
from transformers import pipeline
pipe = pipeline('text-classification',
model='impresso-project/impresso-ad-classification-xlm-one-class',
trust_remote_code=True)
pipe("Annonce: 3 piΓ¨ces Γ louer, Fr. 2'100.β/mois, tΓ©l. 079 ...")
- Downloads last month
- 27