YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

impresso-project / impresso-ad-classification-xlm-one-class

Multilingual ad detector built on top of an XLM-R–based genre model and a lightweight binary decision layer. It combines the model’s “Promotion” probability with smart chunking, pooling, and rules (prices, phones, cues, etc.) to classify any input text as AD or NOT_AD.

Optimized for FR / DE / LB, but works on “other” languages too.

🔌 Quickstart

from transformers import pipeline
pipe = pipeline(
    "text-classification",
    model="impresso-project/impresso-ad-classification-xlm-one-class",
    trust_remote_code=True,   # required to use the bundled pipeline.py
)

pipe("Appartement 3 pièces… Fr. 2'100.–, tél. 079 ...")
# [{'label': 'AD', 'score': 0.87, 'promotion_prob': 0.83, 'threshold_used': 0.80,
#   'xgenre_top_label': 'Promotion', 'xgenre_top_prob': 0.88, ...diagnostics...}]

Batch input

pipe(["texte A", "Text B"])  # → list of dicts in the same order

Dict input (with metadata)

pipe([
  {"ft": "Annonce en français ...", "lg": "fr"},
  {"ft": "Mietwohnung ...", "lg": "de"}
])

📦 What this repo contains

model.safetensors / pytorch_model.bin + config.json + tokenizer files
pipeline.py — custom HF pipeline that turns the multi-genre model into a binary ad detector (AD / NOT_AD)
best_params_final.json — default inference knobs (loaded automatically if present)
(optional) meta_classifier.pkl — extra stacking model (if you provide one)

The Hugging Face inference widget doesn’t run custom code; use Python as shown above or build a small Space.

🧠 How it works (inference)

Normalize text → optional chunking by word count
Encode chunks → get logits from the base model
Pool across chunks (max/mean/logits_* methods)
Read “Promotion” probability + other label probs
Apply adaptive thresholding (language- & length-aware) and rules (prices, phones, cues, etc.)
Optional meta-classifier stacking (if meta_classifier.pkl is present)
Output: AD / NOT_AD with diagnostics

⚙️ Parameters you can tweak

All are usable as keyword args on the pipeline call:

ad_threshold: float — base threshold (per-language overrides below)
lang_thresholds: str — e.g. "fr:0.58,de:0.62,lb:0.58,other:0.60"
short_len: int, short_bonus: float — make short texts easier to flag
min_words: int — skip items with too few words
chunk_words: int — 0 = no chunking; else words per chunk
max_length: int — tokenizer max tokens per chunk (default 512)
pool: str — one of max, mean, logits_max, logits_mean, logits_weighted
temperature: float — divides logits before softmax for calibration
meta_clf: str — filename of a scikit-learn pickle in the repo (optional)
return_diagnostics: bool — include rule flags & confidences (default True)

Example

pipe(
  ["Annonce FR ...", "Wohnung DE ..."],
  ad_threshold=0.60,
  lang_thresholds="fr:0.58,de:0.62,lb:0.58,other:0.60",
  chunk_words=150,
  pool="logits_weighted",
  temperature=1.0,
)

📄 JSONL I/O helper

pipeline.py also exposes a convenience method to mirror the CLI workflow:

from transformers import AutoPipelineForTextClassification
p = AutoPipelineForTextClassification.from_pretrained(
    "impresso-project/impresso-ad-classification-xlm-one-class",
    trust_remote_code=True,
)
p.predict_jsonl("input.jsonl", "results.jsonl")

Input JSONL: one object per line, with at least { "ft": "<text>" } (optionally "lg": "fr|de|lb|...")
Output JSONL: same rows plus promotion_prob, promotion_prob_final (as score), is_ad_pred, etc.

🔍 Output schema

Each result item is a dict:

{
  "label": "AD" | "NOT_AD",
  "score": <final_prob>,              # after rules/ensembling
  "promotion_prob": <raw model prob>, # 'Promotion' class
  "threshold_used": <effective threshold>,
  "xgenre_top_label": <top genre>,
  "xgenre_top_prob": <top genre prob>,
  # ... diagnostics (rule flags & confidences) unless return_diagnostics=False
}

If min_words is set and the text is shorter, you’ll get:

{"label": "SKIPPED", "score": None, ...}

🛠 Installation

pip install -U transformers huggingface_hub torch
# optional: scikit-learn (only if you use meta_classifier.pkl)
pip install -U scikit-learn

✅ Intended uses & caveats

Detecting classified ads / promotional notices in multilingual corpora.
Works well on FR/DE/LB; for other languages, tune lang_thresholds.
Heuristic rules help recall, but may trigger on number-heavy texts (obituaries, legal notices, etc.). Review edge cases for your domain.

📁 Files of interest

pipeline.py — custom pipeline (loaded with trust_remote_code=True)
best_params_final.json — default knobs; auto-loaded if present
meta_classifier.pkl — optional meta-stacker
README.md — this file

🏁 TL;DR: run it

from transformers import pipeline
pipe = pipeline('text-classification',
                model='impresso-project/impresso-ad-classification-xlm-one-class',
                trust_remote_code=True)

pipe("Annonce: 3 pièces à louer, Fr. 2'100.–/mois, tél. 079 ...")

Downloads last month: 27

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support