Glaurung Small 001
A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis. Part of the Glaurung project: a modern reverse engineering framework with first-class AI integration.
Overview
Glaurung Small 001 is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
This is the small variant (160M parameters, 12 layers) optimized for faster inference. For enhanced understanding, see glaurung-large-001 (371M parameters).
Key Features
- Custom Binary Tokenizer: BPE tokenizer that creates efficient multi-byte tokens from binary data
- Binary-Aware: Trained on actual executable files, not hex strings
- Multi-Architecture: Understands patterns from various CPU architectures and file formats
- Latin-1 Encoding: Preserves all byte values (0-255) without loss
Model Details
- Architecture: RoBERTa for Masked Language Modeling
- Hidden Size: 768
- Layers: 12
- Attention Heads: 12
- Vocabulary Size: 65,536 tokens
- Tokenizer: binary-tokenizer-005
- Max Position Embeddings: 520
- Special Tokens:
<|start|>(0): Beginning of sequence<|end|>(1): End token<|sep|>(2): Separator/EOS<|cls|>(3): Classification token<|pad|>(4): Padding<|mask|>(5): Mask token for MLM<|unk|>(6): Unknown token
Glaurung Ecosystem
This model is part of the Glaurung project ecosystem:
π§ Main Project
- Glaurung - A modern reverse engineering framework designed to replace Ghidra with first-class AI integration throughout the analysis pipeline. Built with Rust's performance and Python's accessibility, featuring AI agents integrated at every level from format detection to decompilation.
π€ Model Family
- glaurung-small-001 (this model) - 160M parameters, 12 layers, faster inference
- glaurung-large-001 - 371M parameters, 24 layers
π€ Tokenizer
- binary-tokenizer-005 - 65K vocabulary BPE tokenizer trained on multi-byte patterns
Installation & Loading
pip install transformers torch
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel, pipeline
# Method 1: Load with pipeline for fill-mask tasks
fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-small-001', device=-1)
# Method 2: Load model and tokenizer directly for fill-mask
model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-small-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')
# Method 3: Load base model for feature extraction/embeddings
model_base = AutoModel.from_pretrained('mjbommar/glaurung-small-001')
Usage Guide
1. Loading Binary Data (Critical!)
Binary files MUST be read as bytes and converted to latin-1 encoding:
# CORRECT: Read as bytes, decode with latin-1
with open('/usr/bin/ls', 'rb') as f:
binary_data = f.read() # Read first 512 bytes or as needed
text = binary_data.decode('latin-1', errors='ignore')
# WRONG: Never use hex strings or other encodings
# hex_string = "7f454c46..." # β Will not work
# utf8_text = binary_data.decode('utf-8') # β Will lose bytes
2. Understanding the BPE Tokenizer
The tokenizer creates multi-byte tokens from common binary patterns:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')
# Example: ELF header tokenization
elf_header = b'\x7fELF\x02\x01\x01\x00'
text = elf_header.decode('latin-1')
tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()
# Decode tokens individually to see multi-byte patterns
for token_id in token_ids[1:5]: # Skip special tokens
decoded = tokenizer.decode([token_id], skip_special_tokens=True)
print(f"Token {token_id}: {repr(decoded)}")
# Output:
# Token 45689: '\x7fEL' # ELF magic compressed to one token!
# Token 3665: 'F\x02' # Format byte + 64-bit flag
# Token 458: '\x01\x01' # Little-endian + version
# Token 600: '\x00\x00\x00\x00\x00\x00\x00\x00\x00' # Padding
3. Fill-Mask Task (Token-Level Prediction)
Important: Masking works at the TOKEN level, not byte level!
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-small-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')
# Read binary file
with open('/usr/bin/ls', 'rb') as f:
binary_data = f.read(512)
text = binary_data.decode('latin-1', errors='ignore')
# Tokenize
tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()
# Mask the second token (first content token after <|start|>)
masked_ids = token_ids.copy()
original_token = masked_ids[1] # Save original
masked_ids[1] = tokenizer.mask_token_id
# Prepare input
tokens_masked = {
'input_ids': torch.tensor([masked_ids]),
'attention_mask': torch.tensor([[1]*len(masked_ids)])
}
# Predict
with torch.no_grad():
outputs = model(**tokens_masked)
predictions = outputs.logits[0, 1].softmax(dim=-1)
top5 = predictions.topk(5)
# Show results
print(f"Original: {repr(tokenizer.decode([original_token]))}")
for score, token_id in zip(top5.values, top5.indices):
token_text = tokenizer.decode([token_id.item()], skip_special_tokens=True)
print(f"Predicted: {repr(token_text)} (confidence: {score:.2%})")
# Example output:
# Original: '\x7fEL'
# Predicted: '\x7fEL' (confidence: 79.07%) β Correct!
# Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 13.62%)
4. Using Pipeline for Fill-Mask
The pipeline handles tokenization automatically but requires understanding multi-byte tokens:
from transformers import pipeline
# Load pipeline
fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-small-001', device=-1)
# Read binary
with open('/usr/bin/ls', 'rb') as f:
binary_data = f.read(100)
text = binary_data.decode('latin-1', errors='ignore')
# Create masked input at token boundaries
# First, tokenize to understand token boundaries
tokenizer = fill_mask.tokenizer
tokens = tokenizer(text)
decoded_tokens = [tokenizer.decode([tid], skip_special_tokens=True) for tid in tokens['input_ids']]
# Reconstruct with mask at token boundary
masked_text = ''.join([
decoded_tokens[0], # <|start|>
fill_mask.tokenizer.mask_token, # Mask the ELF magic
''.join(decoded_tokens[2:]) # Rest of tokens
])
# Predict
predictions = fill_mask(masked_text, top_k=3)
for pred in predictions:
print(f"{repr(pred['token_str'])}: {pred['score']:.2%}")
5. Feature Extraction & Embedding Similarity
Compare binary files by their learned embeddings:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
from pathlib import Path
# Load for embeddings (not MaskedLM)
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')
model = AutoModel.from_pretrained('mjbommar/glaurung-small-001')
model.eval()
def get_binary_embedding(file_path, max_bytes=512):
"""Extract embedding for a binary file using mean pooling"""
with open(file_path, 'rb') as f:
binary_data = f.read(max_bytes)
text = binary_data.decode('latin-1', errors='ignore')
# Tokenize
tokens = tokenizer(text, return_tensors='pt',
padding=True, truncation=True, max_length=512)
# Get embeddings with mean pooling
with torch.no_grad():
outputs = model(**tokens)
# Mean pooling (better than CLS token for this model)
attention_mask = tokens['attention_mask']
hidden_states = outputs.last_hidden_state
# Mask padding tokens
mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
embedding = sum_embeddings / sum_mask
return embedding
# Compare multiple binaries
files = ['/usr/bin/ls', '/usr/bin/cat', '/usr/bin/echo', '/etc/passwd']
embeddings = {}
for file_path in files:
if Path(file_path).exists():
name = Path(file_path).name
embeddings[name] = get_binary_embedding(file_path)
# Calculate similarities
print("Cosine Similarity Matrix:")
names = list(embeddings.keys())
for name1 in names:
similarities = []
for name2 in names:
sim = F.cosine_similarity(embeddings[name1], embeddings[name2], dim=-1).item()
similarities.append(f"{sim:.3f}")
print(f"{name1:10s}: {' '.join(similarities)}")
# Expected output:
# ELF executables (ls, cat, echo) will have high similarity (0.85-0.95)
# Text file (passwd) will have low similarity (0.25-0.30) to ELF files
Real-World Example: ELF Header Analysis
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-small-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')
# Analyze ELF executable structure
with open('/usr/bin/ls', 'rb') as f:
binary_data = f.read(512) # Read enough for context
print(f"Raw bytes (hex): {binary_data[:16].hex()}")
# Output: 7f454c46020101000000000000000000
# Convert to latin-1 for model
text = binary_data.decode('latin-1', errors='ignore')
# Tokenize to see learned patterns
tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()
# Show what tokens the model learned
print("\nTokenized ELF header:")
for i in range(1, min(5, len(token_ids)-1)): # First few content tokens
token_text = tokenizer.decode([token_ids[i]], skip_special_tokens=True)
print(f"Token {i}: {token_ids[i]:5d} = {repr(token_text)}")
# Output:
# Token 1: 45689 = '\x7fEL' - ELF magic compressed to one token!
# Token 2: 3665 = 'F\x02' - 'F' + 64-bit flag
# Token 3: 458 = '\x01\x01' - Little-endian + version
# Token 4: 600 = '\x00\x00\x00\x00\x00\x00\x00\x00\x00' - Padding
# Test model's understanding by masking each token
print("\nTesting model predictions:")
for position in [1, 2, 3]: # Test first 3 content tokens
masked_ids = token_ids.copy()
original_token = masked_ids[position]
masked_ids[position] = tokenizer.mask_token_id
# Create input tensors
tokens_masked = {
'input_ids': torch.tensor([masked_ids]),
'attention_mask': torch.tensor([[1]*len(masked_ids)])
}
# Get prediction
with torch.no_grad():
outputs = model(**tokens_masked)
predictions = outputs.logits[0, position].softmax(dim=-1)
predicted_token = predictions.argmax().item()
confidence = predictions.max().item()
# Show results
original_text = tokenizer.decode([original_token], skip_special_tokens=True)
predicted_text = tokenizer.decode([predicted_token], skip_special_tokens=True)
correct = "β" if predicted_token == original_token else "β"
print(f"Position {position}: {correct}")
print(f" Original: {repr(original_text)}")
print(f" Predicted: {repr(predicted_text)} (confidence: {confidence:.1%})")
# Expected Output:
# Position 1: β
# Original: '\x7fEL'
# Predicted: '\x7fEL' (confidence: 79.1%)
# Position 2: β
# Original: 'F\x02'
# Predicted: 'F\x02' (confidence: 97.9%)
# Position 3: β
# Original: '\x01\x01'
# Predicted: '\x01\x01' (confidence: 88.7%)
Training Details
- MLM Objective: 20% masking probability
- Training Data: Binary executables from various architectures
- Optimization: AdamW with warmup, dropout 0.01
- Special Design: Increased position embeddings (520) to handle RoBERTa's position offset
Limitations
- Maximum sequence length: 512 tokens
- Optimized for executable files (ELF, PE, Mach-O)
- Mean pooling recommended for embeddings (pooler layer not specifically trained)
Citation
If using this model in research:
@software{glaurung-small-001,
title = {Glaurung Small 001: Binary Analysis Transformer},
author = {Glaurung Project},
year = {2024},
url = {https://github.com/mjbommar/glaurung-models}
}
- Downloads last month
- 1