YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

impresso-project / impresso-ad-classification-xlm-one-class

Multilingual ad detector built on top of an XLM-R–based genre model and a lightweight binary decision layer. It combines the model’s β€œPromotion” probability with smart chunking, pooling, and rules (prices, phones, cues, etc.) to classify any input text as AD or NOT_AD.

Optimized for FR / DE / LB, but works on β€œother” languages too.


πŸ”Œ Quickstart

from transformers import pipeline
pipe = pipeline(
    "text-classification",
    model="impresso-project/impresso-ad-classification-xlm-one-class",
    trust_remote_code=True,   # required to use the bundled pipeline.py
)

pipe("Appartement 3 piΓ¨ces… Fr. 2'100.–, tΓ©l. 079 ...")
# [{'label': 'AD', 'score': 0.87, 'promotion_prob': 0.83, 'threshold_used': 0.80,
#   'xgenre_top_label': 'Promotion', 'xgenre_top_prob': 0.88, ...diagnostics...}]

Batch input

pipe(["texte A", "Text B"])  # β†’ list of dicts in the same order

Dict input (with metadata)

pipe([
  {"ft": "Annonce en franΓ§ais ...", "lg": "fr"},
  {"ft": "Mietwohnung ...", "lg": "de"}
])

πŸ“¦ What this repo contains

  • model.safetensors / pytorch_model.bin + config.json + tokenizer files
  • pipeline.py β€” custom HF pipeline that turns the multi-genre model into a binary ad detector (AD / NOT_AD)
  • best_params_final.json β€” default inference knobs (loaded automatically if present)
  • (optional) meta_classifier.pkl β€” extra stacking model (if you provide one)

The Hugging Face inference widget doesn’t run custom code; use Python as shown above or build a small Space.


🧠 How it works (inference)

  1. Normalize text β†’ optional chunking by word count
  2. Encode chunks β†’ get logits from the base model
  3. Pool across chunks (max/mean/logits_* methods)
  4. Read β€œPromotion” probability + other label probs
  5. Apply adaptive thresholding (language- & length-aware) and rules (prices, phones, cues, etc.)
  6. Optional meta-classifier stacking (if meta_classifier.pkl is present)
  7. Output: AD / NOT_AD with diagnostics

βš™οΈ Parameters you can tweak

All are usable as keyword args on the pipeline call:

  • ad_threshold: float β€” base threshold (per-language overrides below)
  • lang_thresholds: str β€” e.g. "fr:0.58,de:0.62,lb:0.58,other:0.60"
  • short_len: int, short_bonus: float β€” make short texts easier to flag
  • min_words: int β€” skip items with too few words
  • chunk_words: int β€” 0 = no chunking; else words per chunk
  • max_length: int β€” tokenizer max tokens per chunk (default 512)
  • pool: str β€” one of max, mean, logits_max, logits_mean, logits_weighted
  • temperature: float β€” divides logits before softmax for calibration
  • meta_clf: str β€” filename of a scikit-learn pickle in the repo (optional)
  • return_diagnostics: bool β€” include rule flags & confidences (default True)

Example

pipe(
  ["Annonce FR ...", "Wohnung DE ..."],
  ad_threshold=0.60,
  lang_thresholds="fr:0.58,de:0.62,lb:0.58,other:0.60",
  chunk_words=150,
  pool="logits_weighted",
  temperature=1.0,
)

πŸ“„ JSONL I/O helper

pipeline.py also exposes a convenience method to mirror the CLI workflow:

from transformers import AutoPipelineForTextClassification
p = AutoPipelineForTextClassification.from_pretrained(
    "impresso-project/impresso-ad-classification-xlm-one-class",
    trust_remote_code=True,
)
p.predict_jsonl("input.jsonl", "results.jsonl")

Input JSONL: one object per line, with at least { "ft": "<text>" } (optionally "lg": "fr|de|lb|...")
Output JSONL: same rows plus promotion_prob, promotion_prob_final (as score), is_ad_pred, etc.


πŸ” Output schema

Each result item is a dict:

{
  "label": "AD" | "NOT_AD",
  "score": <final_prob>,              # after rules/ensembling
  "promotion_prob": <raw model prob>, # 'Promotion' class
  "threshold_used": <effective threshold>,
  "xgenre_top_label": <top genre>,
  "xgenre_top_prob": <top genre prob>,
  # ... diagnostics (rule flags & confidences) unless return_diagnostics=False
}

If min_words is set and the text is shorter, you’ll get:

{"label": "SKIPPED", "score": None, ...}

πŸ›  Installation

pip install -U transformers huggingface_hub torch
# optional: scikit-learn (only if you use meta_classifier.pkl)
pip install -U scikit-learn

βœ… Intended uses & caveats

  • Detecting classified ads / promotional notices in multilingual corpora.
  • Works well on FR/DE/LB; for other languages, tune lang_thresholds.
  • Heuristic rules help recall, but may trigger on number-heavy texts (obituaries, legal notices, etc.). Review edge cases for your domain.

πŸ“ Files of interest

  • pipeline.py β€” custom pipeline (loaded with trust_remote_code=True)
  • best_params_final.json β€” default knobs; auto-loaded if present
  • meta_classifier.pkl β€” optional meta-stacker
  • README.md β€” this file

🏁 TL;DR: run it

from transformers import pipeline
pipe = pipeline('text-classification',
                model='impresso-project/impresso-ad-classification-xlm-one-class',
                trust_remote_code=True)

pipe("Annonce: 3 piΓ¨ces Γ  louer, Fr. 2'100.–/mois, tΓ©l. 079 ...")
Downloads last month
27
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support