loganrobbins
/

parallel-decoder-transformer

@@ -28,6 +28,315 @@ This repository contains **PDT adapter/head weights** trained against the GPT-OS
 Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While "Decomposition-and-Fill" methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from coherence drift due to the lack of cross-stream communication. In this work, we introduce the Parallel Decoder Transformer (PDT), a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight Speculative Note Conditioning (SNC) adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a speculative consensus problem, where sibling streams broadcast semantic "notes" to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching 77.8% precision in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.
 ## How to use
 1. Install the reference implementation (runtime + scripts):
@@ -48,6 +357,25 @@ The complete training artifacts and dataset archives are mirrored publicly in GC
 - **WandB run:** `https://wandb.ai/ljrweb-self/parallel-decoder-transformer/runs/fmuea63a`
 ## Citation
 ```bibtex

 Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While "Decomposition-and-Fill" methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from coherence drift due to the lack of cross-stream communication. In this work, we introduce the Parallel Decoder Transformer (PDT), a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight Speculative Note Conditioning (SNC) adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a speculative consensus problem, where sibling streams broadcast semantic "notes" to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching 77.8% precision in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.
+## Example: PDT notes artifact (truncated)
+This is a real sample from the dataset pipeline (`survey_200141_ff0a0b4f.json`), shown with list/string truncation to keep the model card readable.
+```json
+{
+  "sample_id": "survey_200141_ff0a0b4f",
+  "domain": "survey",
+  "plan_path": "outputs/structured_plans/pdt_10k/survey/survey_200141_ff0a0b4f.json",
+  "sectional_independence": true,
+  "lag_delta": 1,
+  "note_cadence_M": 6,
+  "true_notes_example": {
+    "stream_id": "stream_1",
+    "ENT": [
+      {
+        "id": "E1",
+        "name": "Croatan",
+        "aliases": [
+          "Croatoan"
+        ],
+        "type": "Ethnic Group",
+        "canonical": true
+      },
+      {
+        "id": "E2",
+        "name": "Dare County",
+        "aliases": [
+          "Alligator River",
+          "Croatan Sound",
+          "Roanoke Island",
+          "... <2 more items>"
+        ],
+        "type": "Location",
+        "canonical": true
+      },
+      {
+        "id": "E3",
+        "name": "werowances",
+        "aliases": [
+          "chiefs"
+        ],
+        "type": "Leadership Title",
+        "canonical": true
+      },
+      "... <4 more items>"
+    ],
+    "FACT": [
+      {
+        "subj_id": "E1",
+        "predicate": "lived in",
+        "object": "coastal areas of what is now North Carolina",
+        "evidence_span": {
+          "start": 45,
+          "end": 87,
+          "text": "coastal areas of what is now North Carolina"
+        },
+        "certainty": 1.0
+      },
+      {
+        "subj_id": "E1",
+        "predicate": "might have been",
+        "object": "a branch of the larger Roanoke people or allied with them",
+        "evidence_span": {
+          "start": 92,
+          "end": 141,
+          "text": "a branch of the larger Roanoke people or allied with them"
+        },
+        "certainty": 0.8
+      },
+      {
+        "subj_id": "E2",
+        "predicate": "encompasses",
+        "object": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks",
+        "evidence_span": {
+          "start": 177,
+          "end": 265,
+          "text": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks"
+        },
+        "certainty": 1.0
+      },
+      "... <5 more items>"
+    ],
+    "COVERAGE": [
+      {
+        "plan_item_id": "Define who the Croatan were, where they lived historically, and where related people live today.",
+        "status": "missing"
+      },
+      {
+        "plan_item_id": "Describe political leadership (werowances) and their responsibilities regarding wealth and decision-making.",
+        "status": "missing"
+      },
+      {
+        "plan_item_id": "Summarize core religious beliefs about a chief god, petty gods, immortality of the soul, heaven/Popogusso, and roles of priests and conjurors.",
+        "status": "missing"
+      },
+      "... <6 more items>"
+    ]
+  },
+  "speculative_variant_example": {
+    "variant_id": "survey_200141_ff0a0b4f_variant_0",
+    "noise_config": {
+      "paraphrase_ratio": 0.15,
+      "drop_ratio": 0.05,
+      "hallucination_ratio": 0.05,
+      "shuffle_notes": true
+    },
+    "lag_delta": 1,
+    "notes_example": {
+      "stream_id": "stream_1",
+      "ENT": [
+        {
+          "id": "E1",
+          "name": "Croatan",
+          "aliases": [
+            "Croatoan",
+            "Croatian"
+          ],
+          "type": "Ethnic Group",
+          "canonical": true
+        },
+        {
+          "id": "E2",
+          "name": "Dare County",
+          "aliases": [
+            "Alligator River",
+            "Croatan Sound",
+            "Roanoke Island",
+            "... <1 more items>"
+          ],
+          "type": "Location",
+          "canonical": true
+        },
+        {
+          "id": "E3",
+          "name": "werowances",
+          "aliases": [
+            "chiefs",
+            "leaders"
+          ],
+          "type": "Leadership Title",
+          "canonical": true
+        },
+        "... <1 more items>"
+      ],
+      "FACT": [
+        {
+          "subj_id": "E1",
+          "predicate": "lived in",
+          "object": "coastal areas of what is now North Carolina",
+          "evidence_span": {
+            "start": 45,
+            "end": 87,
+            "text": "coastal areas of what is now North Carolina"
+          },
+          "certainty": 1.0
+        },
+        {
+          "subj_id": "E1",
+          "predicate": "might have been",
+          "object": "a branch of the larger Roanoke people or allied with them",
+          "evidence_span": {
+            "start": 92,
+            "end": 141,
+            "text": "a branch of the larger Roanoke people or allied with them"
+          },
+          "certainty": 0.8
+        },
+        {
+          "subj_id": "E2",
+          "predicate": "encompasses",
+          "object": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks",
+          "evidence_span": {
+            "start": 177,
+            "end": 265,
+            "text": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks"
+          },
+          "certainty": 1.0
+        },
+        "... <1 more items>"
+      ],
+      "COVERAGE": [
+        {
+          "plan_item_id": "Define who the Croatan were, where they lived historically, and where related people live today.",
+          "status": "missing"
+        },
+        {
+          "plan_item_id": "Describe political leadership (werowances) and their responsibilities regarding wealth and decision-making.",
+          "status": "missing"
+        },
+        {
+          "plan_item_id": "Summarize core religious beliefs about a chief god, petty gods, immortality of the soul, heaven/Popogusso, and roles of priests and conjurors.",
+          "status": "missing"
+        },
+        "... <1 more items>"
+      ]
+    }
+  },
+  "versioned_notes_snapshot_0": {
+    "snapshot_id": 0,
+    "source": "procedural_bus",
+    "lag_delta": 1,
+    "note_cadence_M": 6,
+    "ent_count": 9,
+    "fact_count": 10,
+    "notes_example": {
+      "stream_id": "stream_1",
+      "ENT": [
+        {
+          "id": "E1",
+          "name": "Croatan",
+          "aliases": [
+            "Croatoan"
+          ],
+          "type": "Ethnic Group",
+          "canonical": true
+        },
+        {
+          "id": "E2",
+          "name": "Dare County",
+          "aliases": [
+            "Alligator River",
+            "Croatan Sound",
+            "Roanoke Island",
+            "... <1 more items>"
+          ],
+          "type": "Location",
+          "canonical": true
+        },
+        {
+          "id": "E3",
+          "name": "werowances",
+          "aliases": [
+            "chiefs"
+          ],
+          "type": "Leadership Title",
+          "canonical": true
+        },
+        "... <1 more items>"
+      ],
+      "FACT": [
+        {
+          "subj_id": "E1",
+          "predicate": "lived in",
+          "object": "coastal areas of what is now North Carolina",
+          "evidence_span": {
+            "start": 45,
+            "end": 87,
+            "text": "coastal areas of what is now North Carolina"
+          },
+          "certainty": 1.0
+        },
+        {
+          "subj_id": "E1",
+          "predicate": "might have been",
+          "object": "a branch of the larger Roanoke people or allied with them",
+          "evidence_span": {
+            "start": 92,
+            "end": 141,
+            "text": "a branch of the larger Roanoke people or allied with them"
+          },
+          "certainty": 0.8
+        },
+        {
+          "subj_id": "E2",
+          "predicate": "encompasses",
+          "object": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks",
+          "evidence_span": {
+            "start": 177,
+            "end": 265,
+            "text": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks"
+          },
+          "certainty": 1.0
+        },
+        "... <1 more items>"
+      ],
+      "COVERAGE": [
+        {
+          "plan_item_id": "Define who the Croatan were, where they lived historically, and where related people live today.",
+          "status": "missing"
+        },
+        {
+          "plan_item_id": "Describe political leadership (werowances) and their responsibilities regarding wealth and decision-making.",
+          "status": "missing"
+        },
+        {
+          "plan_item_id": "Summarize core religious beliefs about a chief god, petty gods, immortality of the soul, heaven/Popogusso, and roles of priests and conjurors.",
+          "status": "missing"
+        },
+        "... <1 more items>"
+      ]
+    }
+  },
+  "rollback": {
+    "triggered": false,
+    "l_tokens": 0,
+    "events": []
+  }
+}
+```
+To reproduce this view locally:
+```bash
+uv run python scripts/pretty_notes_artifact.py survey_200141_ff0a0b4f.json
+```
 ## How to use
 1. Install the reference implementation (runtime + scripts):
 - **WandB run:** `https://wandb.ai/ljrweb-self/parallel-decoder-transformer/runs/fmuea63a`
+## Why the dataset is structured this way
+PDT is trained on **streamed, structured supervision** produced by a 5-stage pipeline:
+- **Stage 2 (Plans):** a 3-stream decomposition plan is generated for each document.
+- **Stage 3 (Notes):** we generate **true notes (teacher)** and **speculative notes (student input)** in a consistent schema:
+  - `ENT`: entity table (stable ids)
+  - `FACT`: grounded tuples with `evidence_span`
+  - `COVERAGE`: plan-item status targets (`covered|partial|missing`)
+  - `versioned_notes`: lagged, versioned snapshots mirroring the Dynamic Notes Bus semantics
+- **Stage 5 (KD Export):** these artifacts are converted into `kd_*.jsonl` where each line is a **stream-level** training example.
+This layout is required to support the **teacher→student curriculum** described in the training guide:
+- **Stage 0:** planner/notes-head bootstrap (trunk frozen)
+- **Stage 1:** stream adapters + SNC cross-attention bootstrap (speculation frozen; teacher notes forced)
+- **Stage 2:** enable speculation + notes-bus usage (teacher-heavy mixing)
+- **Stage 3:** train agreement + coverage heads for self-correction/rollback behavior (still trunk frozen)
 ## Citation
 ```bibtex