Scripts to produce PPL and KLD diagrams?

#10
by Thireus - opened

Sure, @thireus , I have documented some methodology for collecting Perplexity, KL-Divergence statistics, and Speed benchmarks in my quant cookers guide if you scroll down about half way there.

The dependencies section shows how to download the exact wiki.test.raw that I'm using for Perplexity measurements.

I would recommend you keep logs of all your runs e.g. llama-perplexity .... 2>&1 | tee -a mymodel-myquant.log etc...

As for generating those plots, I used R1-0528 to vibe code a matplot lib python app to my requirements that can read in a datafile. So it works kinda like this:

  1. Run the perplexity scripts against all the quants and save out log files
  2. Parse the log files to extract the results and save in JSON format etc.
  3. Run your python script to read in the datafile and plot the results.

I always spot check results and try not to rely on using an ai to parse actual data directly, but prefer to use ai to write a script that reads in datafiles that I've created or checked etc. That said here are the current scripts I'm using which you could feed into R1-0528 and adjust for your needs:

👈 `ppl-r1-0528.py`
import json
import matplotlib.pyplot as plt
from adjustText import adjust_text
import numpy as np
from matplotlib.lines import Line2D

# Read JSON data from file
with open('ppl-r1-0528.json', 'r') as f:
    data = json.load(f)

# Filter out incomplete entries and extract mean perplexity and error
filtered_data = []
for entry in data:
    if 'ppl' in entry and 'size' in entry and 'bpw' in entry and 'legend' in entry:
        # Parse perplexity string to get mean and error
        ppl_parts = entry['ppl'].split()
        mean_ppl = float(ppl_parts[0])
        error = float(ppl_parts[2])  # The value after "+/-"

        filtered_data.append({
            'name': entry['name'],
            'mean_ppl': mean_ppl,
            'error': error,
            'size': float(entry['size']),
            'bpw': float(entry['bpw']),
            'legend': entry['legend']
        })

# Sort by size (smallest to largest)
sorted_data = sorted(filtered_data, key=lambda x: x['size'])

# Prepare plot data
names = [d['name'] for d in sorted_data]
sizes = [d['size'] for d in sorted_data]
ppls = [d['mean_ppl'] for d in sorted_data]
errors = [d['error'] for d in sorted_data]
bpws = [d['bpw'] for d in sorted_data]
legends = [d['legend'] for d in sorted_data]

# Find minimum perplexity (best model)
min_ppl = min(ppls)

# Calculate ln(PPL/min_ppl) for each point
ln_ratios = [np.log(p / min_ppl) for p in ppls]
# Calculate error for ln ratio: d(ln(p)) = dp/p
ln_ratio_errors = [e / p for e, p in zip(errors, ppls)]

# Create annotation labels (show original perplexity values)
labels = [
    f"{name}\nppl: {ppl:.4f}\nbpw: {bpw:.3f}"
    for name, ppl, bpw in zip(names, ppls, bpws)
]

# Identify which points to include (exclude min_ppl point)
include_mask = [p != min_ppl for p in ppls]

# Create data for connecting line (replace excluded point with NaN to break line)
line_sizes = [size if mask else np.nan for size, mask in zip(sizes, include_mask)]
line_ln_ratios = [ln if mask else np.nan for ln, mask in zip(ln_ratios, include_mask)]

# Filter data for plotting (exclude min_ppl point)
plot_sizes = [size for i, size in enumerate(sizes) if include_mask[i]]
plot_ln_ratios = [ln for i, ln in enumerate(ln_ratios) if include_mask[i]]
plot_ln_ratio_errors = [err for i, err in enumerate(ln_ratio_errors) if include_mask[i]]
plot_legends = [leg for i, leg in enumerate(legends) if include_mask[i]]
plot_labels = [label for i, label in enumerate(labels) if include_mask[i]]

# Apply solarized style
plt.style.use('Solarize_Light2')

# Create figure
fig, ax = plt.subplots(figsize=(12, 8))

# Set labels
ax.set_xlabel('Model Size (GiB)', fontsize=12)
ax.set_ylabel('ln(PPL(Q)/PPL(base)) wiki.test.raw *lower is better*', fontsize=12)

# Set title and subtitle with increased padding
main_title = "DeepSeek-R1-0528 ik_llama.cpp"
subtitle = f"GGUF Perplexity Comparison against baseline pure q8_0 PPL of {min_ppl} (not plotted)"

ax.set_title(main_title, fontsize=16, pad=5)
ax.text(0.5, 1.05, subtitle, transform=ax.transAxes,
        ha='center', fontsize=13, style='italic', color='#586e75')

# Add grid
ax.grid(True, linestyle='--', alpha=0.7)

# Set axis limits based on remaining points
#padding_x = 0.2
padding_x = 2.0
if plot_sizes:
    ax.set_xlim(min(plot_sizes)-padding_x, max(plot_sizes)+padding_x)
if plot_ln_ratios:
    ymax_val = max(ln + err for ln, err in zip(plot_ln_ratios, plot_ln_ratio_errors))
    ax.set_ylim(0, ymax_val * 1.1)

# Plot dotted connecting line (with break at excluded point)
#ax.plot(line_sizes, line_ln_ratios, ':', color='#586e75', linewidth=1.5, zorder=1)

# Define unique markers and color map for legend groups
markers = ['o', 's', '^', 'D', 'v', '*', 'p', 'h', 'X', 'd', 'P', '>']
unique_legends = sorted(set(plot_legends))  # Only include legends from remaining points
colors = plt.cm.Set2(np.linspace(0, 1, len(unique_legends)))

# Create mapping from legend to color and marker
legend_color_map = {legend: colors[i] for i, legend in enumerate(unique_legends)}
legend_marker_map = {legend: markers[i % len(markers)] for i, legend in enumerate(unique_legends)}

# Plot each point with error bars, using group-specific color and marker
for i, (size, ln_ratio, ln_error, legend) in enumerate(zip(plot_sizes, plot_ln_ratios, plot_ln_ratio_errors, plot_legends)):
    # Get color and marker for this legend group
    color = legend_color_map[legend]
    marker = legend_marker_map[legend]

    # Add error bar
    ax.errorbar(
        size,
        ln_ratio,
        yerr=ln_error,
        fmt='none',  # Don't plot main line
        ecolor=color,
        elinewidth=1.5,
        capsize=4,
        alpha=0.7,
        zorder=2
    )

    # Add scatter point with marker based on legend
    ax.scatter(
        size,
        ln_ratio,
        marker=marker,
        color=color,
        s=100,
        edgecolor='#586e75',  # Solarized base01 for outline
        linewidth=0.8,
        zorder=3
    )

# Create text annotations without boxes
texts = []
for size, ln_ratio, label in zip(plot_sizes, plot_ln_ratios, plot_labels):
    texts.append(
        plt.text(
            size,
            ln_ratio,
            label,
            fontsize=9,
            ha='center',
            va='bottom',
            zorder=4
        )
    )

# Adjust text positions to avoid overlaps
adjust_text(
    texts,
    x=plot_sizes,
    y=plot_ln_ratios,
    arrowprops=dict(
        arrowstyle='->',
        color='#586e75',  # Solarized base01
        alpha=0.7,
        linewidth=1.0
    ),
    expand=(1.2, 1.8),
    ensure_inside_axes=True,
    min_arrow_len=0.15,
    prevent_crossings=False,
    only_move={'points': 'xy', 'text': 'xy'},
    max_move=150
)

# Add horizontal line at 0 for reference (ln(1)=0)
ax.axhline(y=0, color='#93a1a1', linestyle='-', linewidth=0.5, alpha=0.5, zorder=0)

# Create custom legend for legend groups with group-specific colors
legend_handles = [
    Line2D([0], [0],
           marker=legend_marker_map[legend],
           color=legend_color_map[legend],
           markersize=10,
           label=legend,
           linewidth=0,
           markeredgecolor='gray')
    for legend in unique_legends
]

# Add legend to plot
ax.legend(
    handles=legend_handles,
    title='Flavour',
    loc='upper right',
    framealpha=0.9
)

# Save figure
out_filename = 'ppl-r1-0528.png'
plt.tight_layout()
plt.savefig(out_filename, dpi=150, bbox_inches='tight')
print(f"Plot saved to {out_filename}")
print(f"Reference: Minimum perplexity = {min_ppl:.4f} (excluded from plot)")
👈 `ppl-r1-0528.json`
[
  {
    "name": "q4_0",
    "ppl": "3.2895 +/- 0.01755",
    "size": 352.656,
    "bpw": 4.508,
    "legend": "pure"
  },
  {
    "name": "q8_0",
    "ppl": "3.2119 +/- 0.01697",
    "size": 665.308,
    "bpw": 8.504,
    "legend": "pure"
  },
  {
    "name": "IQ4_KS_R4",
    "ppl": "3.2286 +/- 0.01710",
    "size": 367.774,
    "bpw": 4.701,
    "legend": "ubergarm"
  },
  {
    "name": "IQ3_K_R4",
    "ppl": "3.2730 +/- 0.01738",
    "size": 300.938,
    "bpw": 3.847,
    "legend": "ubergarm"
  },
  {
    "name": "IQ2_K_R4",
    "ppl": "3.5069 +/- 0.01893",
    "size": 219.019,
    "bpw": 2.799,
    "legend": "ubergarm"
  },
  {
    "name": "IQ2_KT",
    "ppl": "3.6378 +/- 0.01997",
    "size": 196.696,
    "bpw": 2.514,
    "legend": "ubergarm"
  },
  {
    "name": "IQ1_S_R4",
    "ppl": "4.8831 +/- 0.02878",
    "size": 130.203,
    "bpw": 1.664,
    "legend": "ubergarm"
  }
]

The kld script is basically the same thing and you can use this python with a prompt to vibe code whatever you need for the statistics you decide to plot.

I'd be happy to share my unpublished ubergarm-kld-test-corpus if you DM me on reddit u/VoidAlchemy or anywhere I can share a link to a private gist. The reasoning there is to keep the test corpus away from ai scrapers such that it doesn't end up in any training data or imatrix test corpus which might over-fit to that specific dataset and give artificially low benchmark scores.

Sign up or log in to comment