--- license: mit language: - pt tags: - masked-language-modeling - legal-domain - bert - portuguese datasets: - seu_dataset_legal # substitua pelo nome do dataset, se houver metrics: - loss pipeline_tag: fill-mask base_model: bert-base-uncased library_name: transformers --- # BERT Model Adapted for the Legal Domain This repository contains a BERT model adapted to the Brazilian legal domain, using Masked Language Modeling as the fine-tuning task. This model was trained on a specialized legal dataset with the goal of creating a robust foundation for various applications in the legal field, such as text classification, information extraction, and other related tasks. --- ## Model Details * **Base Model**: bert-base-uncased * **Training Task**: Masked Language Modeling (MLM) * **Domain**: Legal * **Objective**: Adaptive fine-tuning for generalization in legal texts. * **Architecture**: BertForMaskedLM --- ## Model Usage This model can be used directly with the Hugging Face **Masked Language Modeling** pipeline. Application examples include filling gaps in legal texts to check for coherence, consistency, or to explore specific terminologies. ### Inference Example ```python from transformers import pipeline # Load the model from the Hugging Face Hub fill_mask = pipeline("fill-mask", model="fabricioalmeida/bert-with-mlm-legal") # Example phrase input_text = "O contrato foi firmado entre as partes no dia [MASK]." # Perform inference results = fill_mask(input_text) for result in results: print(f"Option: {result['token_str']}, Score: {result['score']}") ``` ## Training History The model was trained through adaptive fine-tuning on a legal dataset with the following performance metrics: | Step | Training Loss | Validation Loss | |-------|---------------|-----------------| | 2000 | 0.992100 | 0.824256 | | 4000 | 0.812500 | 0.710587 | | 6000 | 0.740800 | 0.656129 | | 8000 | 0.699100 | 0.621186 | | 10000 | 0.668100 | 0.594372 | | 12000 | 0.641700 | 0.577950 | | 14000 | 0.624800 | 0.569022 | | 16000 | 0.603600 | 0.559712 | | 18000 | 0.598100 | 0.544894 | | 20000 | 0.588800 | 0.538299 | | 22000 | 0.578800 | 0.525268 | | 24000 | 0.573700 | 0.528776 | ## Repository Structure - `config.json`: Model configurations. - `pytorch_model.bin`: Trained model weights. - `tokenizer_config.json`: Tokenizer configurations. - `vocab.txt`: Vocabulary used for training. --- ## How to Cite If you use this model in your research or application, please cite it as follows: ``` @misc{bert-juridico, author = {CARMO, A. F.}, title = {LegalBERT-Anotado: aplicando Fine-tuning orientado à tokens Domínio Jurídico Brasileiro}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/fabricioalmeida/bert-with-mlm-legal}} } ``` --- ## Contact For questions or suggestions, please contact [fabrycio30@gmail.com](mailto:fabrycio30@gmail.com).