HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition
Highlights
HTR-ConvText is a novel hybrid architecture for Handwritten Text Recognition (HTR) that effectively balances local feature extraction with global contextual modeling. Designed to overcome the limitations of standard CTC-based decoding and data-hungry Transformers, HTR-ConvText delivers state-of-the-art performance with the following key features:
- Hybrid CNN-ViT Architecture: Seamlessly integrates a ResNet backbone with MobileViT blocks (MVP) and Conditional Positional Encoding, enabling the model to capture fine-grained stroke details while maintaining global spatial awareness.
- Hierarchical ConvText Encoder: A U-Net-like encoder structure that interleaves Multi-Head Self-Attention with Depthwise Convolutions. This design efficiently models both long-range dependencies and local structural patterns.
- Textual Context Module (TCM): An innovative training-only auxiliary module that injects bidirectional linguistic priors into the visual encoder. This mitigates the conditional independence weakness of CTC decoding without adding any latency during inference.
- State-of-the-Art Performance: Outperforms existing methods on major benchmarks including IAM (English), READ2016 (German), LAM (Italian), and HANDS-VNOnDB (Vietnamese), specifically excelling in low-resource scenarios and complex diacritics.
Model Overview
HTR-ConvText configurations and specifications:
| Feature | Specification |
|---|---|
| Architecture Type | Hybrid CNN + Vision Transformer (Encoder-Only) |
| Parameters | ~65.9M |
| Backbone | ResNet-18 + MobileViT w/ Positional Encoding (MVP) |
| Encoder Layers | 8 ConvText Blocks (Hierarchical) |
| Attention Heads | 8 |
| Embedding Dimension | 512 |
| Image Input Size | 512Γ64 |
| Inference Strategy | Standard CTC Decoding (TCM is removed at inference) |
For more details, including ablation studies and theoretical proofs, please refer to our Technical Report.
Performance
We evaluated HTR-ConvText across four diverse datasets. The model achieves new SOTA results with the lowest Character Error Rate (CER) and Word Error Rate (WER) without requiring massive synthetic pre-training.
| Dataset | Language | Ours CER (%) | HTR-VT | OrigamiNet | TrOCR | CRNN |
|---|---|---|---|---|---|---|
| IAM | English | 4.0 | 4.7 | 4.8 | 7.3 | 7.8 |
| LAM | Italian | 2.7 | 2.8 | 3.0 | 3.6 | 3.8 |
| READ2016 | German | 3.6 | 3.9 | - | - | 4.7 |
| VNOnDB | Vietnamese | 3.45 | 4.26 | 7.6 | - | 10.53 |
Quickstart
Instalation
- Clone the repository
git clone https://github.com/0xk0ry/HTR-ConvText.git cd HTR-ConvText - Create and activate a Python 3.9+ Conda environment
conda create -n htr-convtext python=3.9 -y conda activate htr-convtext - Install PyTorch using the wheel that matches your CUDA driver (swap the index for CPU-only builds):
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126 - Install the remaining project requirements (everything except PyTorch, which you already picked in step 3).
pip install -r requirements.txt
The code was tested on Python 3.9 and PyTorch 2.9.1.
Data Preparation
We provide split files (train.ln, val.ln, test.ln) for IAM, READ2016, LAM, and VNOnDB under data/. Organize your data as follows:
./data/iam/
βββ train.ln
βββ val.ln
βββ test.ln
βββ lines
βββ a01-000u-00.png
βββ a01-000u-00.txt
βββ ...
Training
We provide comprehensive scripts in the ./run/ directory. To train on the IAM dataset with the Textual Context Module (TCM) enabled:
# Using the provided script
bash run/iam.sh
# OR running directly via Python
python train.py \
--use-wandb \
--dataset iam \
--tcm-enable \
--exp-name "htr-convtext-iam" \
--img-size 512 64 \
--train-bs 32 \
--val-bs 8 \
--data-path /path/to/iam/lines/ \
--train-data-list data/iam/train.ln \
--val-data-list data/iam/val.ln \
--test-data-list data/iam/test.ln \
--nb-cls 80
Inference / Evaluation
To evaluate a pre-trained checkpoint on the test set:
python test.py \
--resume ./checkpoints/best_CER.pth \
--dataset iam \
--img-size 512 64 \
--data-path /path/to/iam/lines/ \
--test-data-list data/iam/test.ln \
--nb-cls 80
Citation
If you find our work helpful, please cite our paper:
@misc{truc2025htrconvtex,
title={HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition},
author={Pham Thach Thanh Truc and Dang Hoai Nam and Huynh Tong Dang Khoa and Vo Nguyen Le Duy},
year={2025},
eprint={2512.05021},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05021},
}
Acknowledgement
This project is inspired by and adapted from HTR-VT. We gratefully acknowledge the authors for their open-source contributions.
Evaluation results
- Test CER on IAMtest set self-reported4.000
- Test WER on IAMtest set self-reported12.900
- Test CER on LAMtest set self-reported2.700
- Test WER on LAMtest set self-reported7.000
- Test CER on READ2016test set self-reported3.600
- Test WER on READ2016test set self-reported15.700
- Test CER on HANDS-VNOnDBtest set self-reported3.450
- Test WER on HANDS-VNOnDBtest set self-reported8.900