| # SOBertBase | |
| ## Model Description | |
| SOBertBase is a 109M parameter BERT model trained on 27 billion tokens of SO data StackOverflow answer and comment text using the Megatron Toolkit. | |
| SOBert is pre-trained with 19 GB data presented as 15 million samples where each sample contains an entire post and all its corresponding comments. We also include | |
| all code in each answer so that our model is bimodal in nature. We use a SentencePiece tokenizer trained with BytePair Encoding, which has the benefit over WordPiece of never labeling tokens as “unknown". | |
| Additionally, SOBert is trained with a a maximum sequence length of 2048 based on the empirical length distribution of StackOverflow posts and a relatively | |
| large batch size of 0.5M tokens. A larger 762 million parameter model can also be found [here](https://huggingface.co/mmukh/SOBertLarge). More details can be found in the paper | |
| [Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](https://arxiv.org/pdf/2306.03268). | |
| #### How to use | |
| ```python | |
| from transformers import MegatronBertModel,PreTrainedTokenizerFast | |
| model = MegatronBertModel.from_pretrained("mmukh/SOBertBase") | |
| tokenizer = PreTrainedTokenizerFast.from_pretrained("mmukh/SOBertBase") | |
| ``` | |
| ### BibTeX entry and citation info | |
| ```bibtex | |
| @article{mukherjee2023stack, | |
| title={Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models}, | |
| author={Mukherjee, Manisha and Hellendoorn, Vincent J}, | |
| journal={arXiv preprint arXiv:2306.03268}, | |
| year={2023} | |
| } | |
| ``` |