HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition

Highlights

HTR-ConvText is a novel hybrid architecture for Handwritten Text Recognition (HTR) that effectively balances local feature extraction with global contextual modeling. Designed to overcome the limitations of standard CTC-based decoding and data-hungry Transformers, HTR-ConvText delivers state-of-the-art performance with the following key features:

Hybrid CNN-ViT Architecture: Seamlessly integrates a ResNet backbone with MobileViT blocks (MVP) and Conditional Positional Encoding, enabling the model to capture fine-grained stroke details while maintaining global spatial awareness.
Hierarchical ConvText Encoder: A U-Net-like encoder structure that interleaves Multi-Head Self-Attention with Depthwise Convolutions. This design efficiently models both long-range dependencies and local structural patterns.
Textual Context Module (TCM): An innovative training-only auxiliary module that injects bidirectional linguistic priors into the visual encoder. This mitigates the conditional independence weakness of CTC decoding without adding any latency during inference.
State-of-the-Art Performance: Outperforms existing methods on major benchmarks including IAM (English), READ2016 (German), LAM (Italian), and HANDS-VNOnDB (Vietnamese), specifically excelling in low-resource scenarios and complex diacritics.

Model Overview

HTR-ConvText configurations and specifications:

Feature	Specification
Architecture Type	Hybrid CNN + Vision Transformer (Encoder-Only)
Parameters	~65.9M
Backbone	ResNet-18 + MobileViT w/ Positional Encoding (MVP)
Encoder Layers	8 ConvText Blocks (Hierarchical)
Attention Heads	8
Embedding Dimension	512
Image Input Size	512×64
Inference Strategy	Standard CTC Decoding (TCM is removed at inference)

For more details, including ablation studies and theoretical proofs, please refer to our Technical Report.

Performance

We evaluated HTR-ConvText across four diverse datasets. The model achieves new SOTA results with the lowest Character Error Rate (CER) and Word Error Rate (WER) without requiring massive synthetic pre-training.

Dataset	Language	Ours CER (%)	HTR-VT	OrigamiNet	TrOCR	CRNN
IAM	English	4.0	4.7	4.8	7.3	7.8
LAM	Italian	2.7	2.8	3.0	3.6	3.8
READ2016	German	3.6	3.9	-	-	4.7
VNOnDB	Vietnamese	3.45	4.26	7.6	-	10.53

Quickstart

Instalation

Clone the repository

git clone https://github.com/0xk0ry/HTR-ConvText.git
cd HTR-ConvText

Create and activate a Python 3.9+ Conda environment

conda create -n htr-convtext python=3.9 -y
conda activate htr-convtext

Install PyTorch using the wheel that matches your CUDA driver (swap the index for CPU-only builds):
```
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
```
Install the remaining project requirements (everything except PyTorch, which you already picked in step 3).
```
pip install -r requirements.txt
```

The code was tested on Python 3.9 and PyTorch 2.9.1.

Data Preparation

We provide split files (train.ln, val.ln, test.ln) for IAM, READ2016, LAM, and VNOnDB under data/. Organize your data as follows:

./data/iam/
├── train.ln
├── val.ln
├── test.ln
└── lines
      ├── a01-000u-00.png
      ├── a01-000u-00.txt
      └── ...

Training

We provide comprehensive scripts in the ./run/ directory. To train on the IAM dataset with the Textual Context Module (TCM) enabled:

# Using the provided script
bash run/iam.sh

# OR running directly via Python
python train.py \
    --use-wandb \
    --dataset iam \
    --tcm-enable \
    --exp-name "htr-convtext-iam" \
    --img-size 512 64 \
    --train-bs 32 \
    --val-bs 8 \
    --data-path /path/to/iam/lines/ \
    --train-data-list data/iam/train.ln \
    --val-data-list data/iam/val.ln \
    --test-data-list data/iam/test.ln \
    --nb-cls 80

Inference / Evaluation

To evaluate a pre-trained checkpoint on the test set:

python test.py \
    --resume ./checkpoints/best_CER.pth \
    --dataset iam \
    --img-size 512 64 \
    --data-path /path/to/iam/lines/ \
    --test-data-list data/iam/test.ln \
    --nb-cls 80

Citation

If you find our work helpful, please cite our paper:

@misc{truc2025htrconvtex,
      title={HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition},
      author={Pham Thach Thanh Truc and Dang Hoai Nam and Huynh Tong Dang Khoa and Vo Nguyen Le Duy},
      year={2025},
      eprint={2512.05021},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05021},
}

Acknowledgement

This project is inspired by and adapted from HTR-VT. We gratefully acknowledge the authors for their open-source contributions.

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Test CER on IAM
test set self-reported

4.000
Test WER on IAM
test set self-reported

12.900
Test CER on LAM
test set self-reported

2.700
Test WER on LAM
test set self-reported

7.000
Test CER on READ2016
test set self-reported

3.600
Test WER on READ2016
test set self-reported

15.700
Test CER on HANDS-VNOnDB
test set self-reported

3.450
Test WER on HANDS-VNOnDB
test set self-reported

8.900