Czech Building Law LoRA Adapter for Mistral-7B

🏗️ A LoRA adapter fine-tuned on Czech Building Law (Stavební zákon) Q&A dataset for question-answering tasks. This adapter enables Mistral-7B-Instruct-v0.3 to provide accurate responses about Czech building regulations, construction permits, and related legal matters.

⚠️ Note: This is an educational project with limitations. Mistral-7B has gaps in Czech language understanding. For production use, a Czech-native model like OpenEuroLLM would be more suitable.

Model Details

Model Description

This LoRA (Low-Rank Adaptation) adapter was fine-tuned on a dataset of 576 Czech Building Law question-answer pairs. It adapts Mistral-7B-Instruct-v0.3 to answer questions about Czech construction regulations, building permits, territorial planning, and related legal matters.

The model was trained as part of an AI Developer course project and is intended for educational and research purposes. While functional, it has limitations due to Mistral's non-native Czech language capabilities.

Model Sources

Uses

Direct Use

This adapter is designed for:

  • Answering questions about Czech building law (Stavební zákon)
  • Providing information on construction permits and procedures
  • Explaining territorial planning regulations
  • Educational purposes for learning Czech legal terminology
  • Research on legal domain adaptation for LLMs

Ideal users:

  • Students learning about Czech building regulations
  • Developers creating chatbots for construction-related queries
  • Researchers studying legal NLP in Czech language

Downstream Use

This adapter can be integrated into:

  • Legal chatbots for construction companies
  • Educational platforms teaching Czech building law
  • Document analysis tools for building permits
  • Q&A systems for architectural firms

Note: For production systems, further fine-tuning or using a Czech-native base model (e.g., OpenEuroLLM) is recommended.

Out-of-Scope Use

❌ NOT suitable for:

  • Legal advice - This is NOT a replacement for professional legal counsel
  • Official legal documents - Responses may contain inaccuracies
  • Critical decision-making - Always verify with official sources and legal experts
  • Production systems without review - Requires human oversight
  • Non-Czech building law - Trained specifically on Czech regulations
  • Real-time legal changes - May not reflect the latest amendments

⚠️ Always consult licensed legal professionals for official guidance.

Bias, Risks, and Limitations

Technical Limitations

  • Base model gaps: Mistral-7B is not optimized for Czech language, leading to potential grammatical errors or unnatural phrasing
  • Dataset size: Only 576 training samples - limited coverage of all building law scenarios
  • Domain specificity: Trained only on Czech building law (Stavební zákon)
  • Temporal limitations: Training data may not reflect the most recent legal amendments
  • LoRA constraints: Adapter size limits the model's capacity to learn complex legal reasoning

Recommended Alternative

OpenEuroLLM would be a superior base model for this task due to native Czech language support. If you have access to OpenEuroLLM and would like to collaborate on improving this project, please reach out!

Bias Considerations

  • Responses reflect the training dataset's interpretation of building law
  • May contain biases present in the original Q&A dataset
  • Legal language complexity may not be fully captured

Safety Risks

  • Hallucination: Model may generate plausible but incorrect legal information
  • Oversimplification: Complex legal matters may be oversimplified
  • Misinterpretation: Users may misinterpret responses as official legal advice

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Load base model with 4-bit quantization (for GPU efficiency)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "rostislavpeska/mistral-czech-building-law-lora",
    torch_dtype=torch.bfloat16,
)
model.eval()

# Ask a question
test_messages = [
    {"role": "user", "content": "Kdy potřebuji stavební povolení?"}
]

inputs = tokenizer.apply_chat_template(
    test_messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

Dataset: rostislavpeska/stavebni-zakon-dataset

  • Total samples: 576 Q&A pairs
  • Language: Czech
  • Domain: Czech Building Law (Stavební zákon)
  • Format: Conversational Q&A pairs in chat template format
  • Split: 80/20 train/test (460 training, 116 testing)
  • Token length: 64-1731 tokens per sample (avg: 271.9 tokens)

The dataset contains questions and answers about:

  • Building permits (stavební povolení)
  • Territorial planning (územní plánování)
  • Construction regulations (stavební předpisy)
  • Legal procedures and requirements

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: bf16 mixed precision (bfloat16)
  • Quantization: 4-bit NF4 with double quantization (QLoRA)
  • Optimizer: paged_adamw_8bit
  • Learning rate: 2e-4
  • Learning rate scheduler: cosine
  • Batch size: 2 per device
  • Gradient accumulation steps: 4 (effective batch size: 8)
  • Number of epochs: 3
  • LoRA rank (r): 64
  • LoRA alpha: 32
  • LoRA dropout: 0.05
  • LoRA target modules: all-linear
  • Warmup steps: 50
  • Gradient checkpointing: Enabled
  • Max sequence length: 2048 tokens

Speeds, Sizes, Times

  • Training time: ~12.4 minutes (0.21 hours)
  • Training date: November 2, 2025
  • Adapter size: ~336 MB (safetensors format)
  • Trainable parameters: 167,772,160 (2.26% of base model)
  • Hardware: NVIDIA GeForce RTX 4070 Ti SUPER (16GB VRAM)
  • Peak VRAM usage: ~12-14 GB
  • Training framework: PyTorch with Hugging Face Transformers, PEFT, and TRL

Evaluation

Testing Data, Factors & Metrics

Testing Data

Test split: 116 samples (20% of total dataset)

From the same dataset: rostislavpeska/stavebni-zakon-dataset

Factors

Evaluation focuses on:

  • Domain accuracy: Correctness of legal information
  • Language quality: Czech grammar and fluency
  • Relevance: Appropriateness of responses to questions
  • Citation: Proper references to legal codes and regulations

Metrics

  • Evaluation loss: Primary metric during training
  • Qualitative assessment: Manual review of response quality
  • Domain expert review recommended: Legal professionals should validate outputs

Results

The model successfully generates contextually relevant responses to Czech building law questions. However, as this is an educational project with a limited dataset and non-native base model, comprehensive quantitative evaluation has not been performed.

Observed strengths:

  • Maintains conversational context
  • References relevant legal codes
  • Provides structured responses

Observed limitations:

  • Occasional grammatical imperfections due to base model's Czech limitations
  • May oversimplify complex legal scenarios
  • Limited by training data coverage

Summary

Model Examination

No formal interpretability analysis has been conducted. This is an educational project with limited scope.

Future work could include:

  • Attention weight visualization for legal reasoning
  • Error analysis on failure cases
  • Comparison with Czech-native base models (e.g., OpenEuroLLM)

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA GeForce RTX 4070 Ti SUPER (16GB)
  • Hours used: ~0.21 hours (12.4 minutes)
  • Cloud Provider: Local machine (not cloud)
  • Compute Region: Czech Republic / Central Europe
  • Carbon Emitted: Minimal due to short training time and local infrastructure

Technical Specifications [optional]

Model Architecture and Objective

Base Model: Mistral-7B-Instruct-v0.3

  • Architecture: Transformer decoder with Sliding Window Attention
  • Parameters: ~7.2 billion
  • Context length: 32k tokens

Adaptation Method: QLoRA (Quantized Low-Rank Adaptation)

  • Trainable parameters: ~168 million (2.26% of base model)
  • Quantization: 4-bit NF4 with double quantization
  • LoRA rank: 64
  • Target modules: All linear layers

Objective: Supervised fine-tuning for Czech building law question-answering

Compute Infrastructure

Hardware

  • GPU: NVIDIA GeForce RTX 4070 Ti SUPER
  • VRAM: 16GB GDDR6X
  • RAM: 32GB
  • CPU: 24 Logical Processors
  • Storage: Local SSD
  • Location: Local workstation (not cloud)

Software

  • OS: Windows
  • Python: 3.12.10
  • PyTorch: 2.7.1+cu118
  • Transformers: ≥4.40.0
  • PEFT: ≥0.10.0
  • BitsAndBytes: ≥0.43.0
  • TRL: ≥0.8.0
  • Accelerate: ≥0.28.0
  • CUDA: 11.8
  • Training framework: Jupyter Notebook with SFTTrainer

Citation

If you use this model, please cite:

BibTeX:

@misc{peska2025czechbuildinglaw,
  author = {Peška, Rostislav},
  title = {Czech Building Law LoRA Adapter for Mistral-7B},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/rostislavpeska/mistral-czech-building-law-lora}},
  note = {Educational AI Developer Course Project}
}

APA:

Peška, R. (2025). Czech Building Law LoRA Adapter for Mistral-7B [LoRA adapter]. HuggingFace. https://huggingface.co/rostislavpeska/mistral-czech-building-law-lora

Glossary

  • LoRA (Low-Rank Adaptation): Efficient fine-tuning method that trains small adapter modules
  • QLoRA: LoRA with 4-bit quantization for reduced memory usage
  • Stavební zákon: Czech Building Law
  • Stavební povolení: Building permit
  • Územní plánování: Territorial/spatial planning
  • PEFT: Parameter-Efficient Fine-Tuning
  • BitsAndBytes: Library for efficient quantization
  • SFTTrainer: Supervised Fine-Tuning Trainer from TRL library

More Information

Project Context

This adapter was developed as part of an AI Developer course project to demonstrate:

  • Fine-tuning LLMs for specialized domains
  • Efficient training with limited resources (QLoRA)
  • Working with Czech language legal data
  • Practical deployment to HuggingFace Hub

Limitations & Future Work

Known issues:

  • Mistral-7B has gaps in Czech language understanding
  • Limited dataset size (576 samples)
  • May not reflect the latest legal amendments

Improvements wanted:

  • OpenEuroLLM base model for better Czech language support
  • Expanded dataset with more scenarios
  • Multi-turn conversation capabilities
  • Integration with official legal databases

Collaboration Welcome!

If you have access to OpenEuroLLM or expertise in Czech legal NLP, I'd love to collaborate on improving this project. This is a non-profit educational initiative, and contributions are welcome!

Model Card Authors

Mgr. Rostislav Peška

  • Email: [email protected]
  • Phone: +420 754 506 863
  • Role: Developer & Trainer (AI Developer Course Project)

Model Card Contact

For questions, collaborations, or feedback:

Areas of interest:

  • Collaboration on OpenEuroLLM-based version
  • Czech legal NLP research
  • 3D Model generation
  • Deployment and integration support

This is an educational project. Always consult licensed legal professionals for official building law guidance.

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rostislavpeska/mistral-czech-building-law-lora

Adapter
(507)
this model