Farsisham 🇮🇷: (POS)--> Part-of-Speech Model for Persian Lang
🤗 Quick Start Load and use the pre-trained POS tagger from Hugging Face: from farsisham.pos_tagger import POSTagger
Load the model
tagger = POSTagger.from_pretrained("AmirHossein1455/farsisham")
Tag a Persian sentence
text = "سلام من امیرحسین هستم"
tagged_sentence = tagger.tag_sentence(text)
print(tagged_sentence)
Alternatively, download the model manually from the Hugging Face Model Hub and load it locally.
🧠 Training a Custom Model
Train your own POS tagger using a custom corpus:
from farsisham.pos_tagger import POSTagger
tagger = POSTagger()
tagger.train("path/to/your/corpus.txt")
📊 Model Evaluation
The POS tagger was evaluated on a test set of 71 samples. Key performance metrics:
Overall Accuracy: 90%
Macro Average: Precision: 0.81, Recall: 0.78, F1-score: 0.77 Weighted Average: Precision: 0.93, Recall: 0.90, F1-score: 0.91
Per-label performance:
ADJ: Precision: 0.80, Recall: 0.80, F1-score: 0.80 (Support: 5) ADV: Precision: 1.00, Recall: 0.80, F1-score: 0.89 (Support: 5) CON: Precision: 0.50, Recall: 1.00, F1-score: 0.67 (Support: 1) DET: Precision: 1.00, Recall: 0.50, F1-score: 0.67 (Support: 4) N: Precision: 0.85, Recall: 1.00, F1-score: 0.92 (Support: 17) P: Precision: 1.00, Recall: 1.00, F1-score: 1.00 (Support: 7) PRO: Precision: 1.00, Recall: 0.83, F1-score: 0.91 (Support: 6) PUNC: Precision: 1.00, Recall: 1.00, F1-score: 1.00 (Support: 12) QUA: Precision: 0.00, Recall: 0.00, F1-score: 0.00 (Support: 0) V: Precision: 0.92, Recall: 0.86, F1-score: 0.89 (Support: 14)
Note: “Support” indicates the number of samples per label in the test set.
🎯 Intended Use
Farsisham is designed for:
Researchers developing Persian NLP applications. Developers building tools like chatbots, text analyzers, or translation systems. Educators and linguists studying Persian language structures.
⚠️ Limitations
The POS tagger’s performance may vary with out-of-domain text or informal Persian. The lemmatizer relies on a provided wordlist, which may not cover all vocabulary. Limited support for low-resource labels (e.g., QUA) due to small training data.
📄 License Licensed under the MIT License.