loganrobbins commited on
Commit
b6a24df
·
verified ·
1 Parent(s): 31fee9a

Model card: add dataset schema + example artifact + curriculum explanation

Browse files
Files changed (1) hide show
  1. README.md +328 -0
README.md CHANGED
@@ -28,6 +28,315 @@ This repository contains **PDT adapter/head weights** trained against the GPT-OS
28
  Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While "Decomposition-and-Fill" methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from coherence drift due to the lack of cross-stream communication. In this work, we introduce the Parallel Decoder Transformer (PDT), a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight Speculative Note Conditioning (SNC) adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a speculative consensus problem, where sibling streams broadcast semantic "notes" to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching 77.8% precision in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.
29
 
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ## How to use
32
 
33
  1. Install the reference implementation (runtime + scripts):
@@ -48,6 +357,25 @@ The complete training artifacts and dataset archives are mirrored publicly in GC
48
 
49
  - **WandB run:** `https://wandb.ai/ljrweb-self/parallel-decoder-transformer/runs/fmuea63a`
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## Citation
52
 
53
  ```bibtex
 
28
  Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While "Decomposition-and-Fill" methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from coherence drift due to the lack of cross-stream communication. In this work, we introduce the Parallel Decoder Transformer (PDT), a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight Speculative Note Conditioning (SNC) adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a speculative consensus problem, where sibling streams broadcast semantic "notes" to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching 77.8% precision in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.
29
 
30
 
31
+
32
+ ## Example: PDT notes artifact (truncated)
33
+
34
+ This is a real sample from the dataset pipeline (`survey_200141_ff0a0b4f.json`), shown with list/string truncation to keep the model card readable.
35
+
36
+ ```json
37
+ {
38
+ "sample_id": "survey_200141_ff0a0b4f",
39
+ "domain": "survey",
40
+ "plan_path": "outputs/structured_plans/pdt_10k/survey/survey_200141_ff0a0b4f.json",
41
+ "sectional_independence": true,
42
+ "lag_delta": 1,
43
+ "note_cadence_M": 6,
44
+ "true_notes_example": {
45
+ "stream_id": "stream_1",
46
+ "ENT": [
47
+ {
48
+ "id": "E1",
49
+ "name": "Croatan",
50
+ "aliases": [
51
+ "Croatoan"
52
+ ],
53
+ "type": "Ethnic Group",
54
+ "canonical": true
55
+ },
56
+ {
57
+ "id": "E2",
58
+ "name": "Dare County",
59
+ "aliases": [
60
+ "Alligator River",
61
+ "Croatan Sound",
62
+ "Roanoke Island",
63
+ "... <2 more items>"
64
+ ],
65
+ "type": "Location",
66
+ "canonical": true
67
+ },
68
+ {
69
+ "id": "E3",
70
+ "name": "werowances",
71
+ "aliases": [
72
+ "chiefs"
73
+ ],
74
+ "type": "Leadership Title",
75
+ "canonical": true
76
+ },
77
+ "... <4 more items>"
78
+ ],
79
+ "FACT": [
80
+ {
81
+ "subj_id": "E1",
82
+ "predicate": "lived in",
83
+ "object": "coastal areas of what is now North Carolina",
84
+ "evidence_span": {
85
+ "start": 45,
86
+ "end": 87,
87
+ "text": "coastal areas of what is now North Carolina"
88
+ },
89
+ "certainty": 1.0
90
+ },
91
+ {
92
+ "subj_id": "E1",
93
+ "predicate": "might have been",
94
+ "object": "a branch of the larger Roanoke people or allied with them",
95
+ "evidence_span": {
96
+ "start": 92,
97
+ "end": 141,
98
+ "text": "a branch of the larger Roanoke people or allied with them"
99
+ },
100
+ "certainty": 0.8
101
+ },
102
+ {
103
+ "subj_id": "E2",
104
+ "predicate": "encompasses",
105
+ "object": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks",
106
+ "evidence_span": {
107
+ "start": 177,
108
+ "end": 265,
109
+ "text": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks"
110
+ },
111
+ "certainty": 1.0
112
+ },
113
+ "... <5 more items>"
114
+ ],
115
+ "COVERAGE": [
116
+ {
117
+ "plan_item_id": "Define who the Croatan were, where they lived historically, and where related people live today.",
118
+ "status": "missing"
119
+ },
120
+ {
121
+ "plan_item_id": "Describe political leadership (werowances) and their responsibilities regarding wealth and decision-making.",
122
+ "status": "missing"
123
+ },
124
+ {
125
+ "plan_item_id": "Summarize core religious beliefs about a chief god, petty gods, immortality of the soul, heaven/Popogusso, and roles of priests and conjurors.",
126
+ "status": "missing"
127
+ },
128
+ "... <6 more items>"
129
+ ]
130
+ },
131
+ "speculative_variant_example": {
132
+ "variant_id": "survey_200141_ff0a0b4f_variant_0",
133
+ "noise_config": {
134
+ "paraphrase_ratio": 0.15,
135
+ "drop_ratio": 0.05,
136
+ "hallucination_ratio": 0.05,
137
+ "shuffle_notes": true
138
+ },
139
+ "lag_delta": 1,
140
+ "notes_example": {
141
+ "stream_id": "stream_1",
142
+ "ENT": [
143
+ {
144
+ "id": "E1",
145
+ "name": "Croatan",
146
+ "aliases": [
147
+ "Croatoan",
148
+ "Croatian"
149
+ ],
150
+ "type": "Ethnic Group",
151
+ "canonical": true
152
+ },
153
+ {
154
+ "id": "E2",
155
+ "name": "Dare County",
156
+ "aliases": [
157
+ "Alligator River",
158
+ "Croatan Sound",
159
+ "Roanoke Island",
160
+ "... <1 more items>"
161
+ ],
162
+ "type": "Location",
163
+ "canonical": true
164
+ },
165
+ {
166
+ "id": "E3",
167
+ "name": "werowances",
168
+ "aliases": [
169
+ "chiefs",
170
+ "leaders"
171
+ ],
172
+ "type": "Leadership Title",
173
+ "canonical": true
174
+ },
175
+ "... <1 more items>"
176
+ ],
177
+ "FACT": [
178
+ {
179
+ "subj_id": "E1",
180
+ "predicate": "lived in",
181
+ "object": "coastal areas of what is now North Carolina",
182
+ "evidence_span": {
183
+ "start": 45,
184
+ "end": 87,
185
+ "text": "coastal areas of what is now North Carolina"
186
+ },
187
+ "certainty": 1.0
188
+ },
189
+ {
190
+ "subj_id": "E1",
191
+ "predicate": "might have been",
192
+ "object": "a branch of the larger Roanoke people or allied with them",
193
+ "evidence_span": {
194
+ "start": 92,
195
+ "end": 141,
196
+ "text": "a branch of the larger Roanoke people or allied with them"
197
+ },
198
+ "certainty": 0.8
199
+ },
200
+ {
201
+ "subj_id": "E2",
202
+ "predicate": "encompasses",
203
+ "object": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks",
204
+ "evidence_span": {
205
+ "start": 177,
206
+ "end": 265,
207
+ "text": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks"
208
+ },
209
+ "certainty": 1.0
210
+ },
211
+ "... <1 more items>"
212
+ ],
213
+ "COVERAGE": [
214
+ {
215
+ "plan_item_id": "Define who the Croatan were, where they lived historically, and where related people live today.",
216
+ "status": "missing"
217
+ },
218
+ {
219
+ "plan_item_id": "Describe political leadership (werowances) and their responsibilities regarding wealth and decision-making.",
220
+ "status": "missing"
221
+ },
222
+ {
223
+ "plan_item_id": "Summarize core religious beliefs about a chief god, petty gods, immortality of the soul, heaven/Popogusso, and roles of priests and conjurors.",
224
+ "status": "missing"
225
+ },
226
+ "... <1 more items>"
227
+ ]
228
+ }
229
+ },
230
+ "versioned_notes_snapshot_0": {
231
+ "snapshot_id": 0,
232
+ "source": "procedural_bus",
233
+ "lag_delta": 1,
234
+ "note_cadence_M": 6,
235
+ "ent_count": 9,
236
+ "fact_count": 10,
237
+ "notes_example": {
238
+ "stream_id": "stream_1",
239
+ "ENT": [
240
+ {
241
+ "id": "E1",
242
+ "name": "Croatan",
243
+ "aliases": [
244
+ "Croatoan"
245
+ ],
246
+ "type": "Ethnic Group",
247
+ "canonical": true
248
+ },
249
+ {
250
+ "id": "E2",
251
+ "name": "Dare County",
252
+ "aliases": [
253
+ "Alligator River",
254
+ "Croatan Sound",
255
+ "Roanoke Island",
256
+ "... <1 more items>"
257
+ ],
258
+ "type": "Location",
259
+ "canonical": true
260
+ },
261
+ {
262
+ "id": "E3",
263
+ "name": "werowances",
264
+ "aliases": [
265
+ "chiefs"
266
+ ],
267
+ "type": "Leadership Title",
268
+ "canonical": true
269
+ },
270
+ "... <1 more items>"
271
+ ],
272
+ "FACT": [
273
+ {
274
+ "subj_id": "E1",
275
+ "predicate": "lived in",
276
+ "object": "coastal areas of what is now North Carolina",
277
+ "evidence_span": {
278
+ "start": 45,
279
+ "end": 87,
280
+ "text": "coastal areas of what is now North Carolina"
281
+ },
282
+ "certainty": 1.0
283
+ },
284
+ {
285
+ "subj_id": "E1",
286
+ "predicate": "might have been",
287
+ "object": "a branch of the larger Roanoke people or allied with them",
288
+ "evidence_span": {
289
+ "start": 92,
290
+ "end": 141,
291
+ "text": "a branch of the larger Roanoke people or allied with them"
292
+ },
293
+ "certainty": 0.8
294
+ },
295
+ {
296
+ "subj_id": "E2",
297
+ "predicate": "encompasses",
298
+ "object": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks",
299
+ "evidence_span": {
300
+ "start": 177,
301
+ "end": 265,
302
+ "text": "the Alligator River, Croatan Sound, Roanoke Island, Ocracoke Island, and parts of the Outer Banks"
303
+ },
304
+ "certainty": 1.0
305
+ },
306
+ "... <1 more items>"
307
+ ],
308
+ "COVERAGE": [
309
+ {
310
+ "plan_item_id": "Define who the Croatan were, where they lived historically, and where related people live today.",
311
+ "status": "missing"
312
+ },
313
+ {
314
+ "plan_item_id": "Describe political leadership (werowances) and their responsibilities regarding wealth and decision-making.",
315
+ "status": "missing"
316
+ },
317
+ {
318
+ "plan_item_id": "Summarize core religious beliefs about a chief god, petty gods, immortality of the soul, heaven/Popogusso, and roles of priests and conjurors.",
319
+ "status": "missing"
320
+ },
321
+ "... <1 more items>"
322
+ ]
323
+ }
324
+ },
325
+ "rollback": {
326
+ "triggered": false,
327
+ "l_tokens": 0,
328
+ "events": []
329
+ }
330
+ }
331
+ ```
332
+
333
+ To reproduce this view locally:
334
+
335
+ ```bash
336
+ uv run python scripts/pretty_notes_artifact.py survey_200141_ff0a0b4f.json
337
+ ```
338
+
339
+
340
  ## How to use
341
 
342
  1. Install the reference implementation (runtime + scripts):
 
357
 
358
  - **WandB run:** `https://wandb.ai/ljrweb-self/parallel-decoder-transformer/runs/fmuea63a`
359
 
360
+ ## Why the dataset is structured this way
361
+
362
+ PDT is trained on **streamed, structured supervision** produced by a 5-stage pipeline:
363
+
364
+ - **Stage 2 (Plans):** a 3-stream decomposition plan is generated for each document.
365
+ - **Stage 3 (Notes):** we generate **true notes (teacher)** and **speculative notes (student input)** in a consistent schema:
366
+ - `ENT`: entity table (stable ids)
367
+ - `FACT`: grounded tuples with `evidence_span`
368
+ - `COVERAGE`: plan-item status targets (`covered|partial|missing`)
369
+ - `versioned_notes`: lagged, versioned snapshots mirroring the Dynamic Notes Bus semantics
370
+ - **Stage 5 (KD Export):** these artifacts are converted into `kd_*.jsonl` where each line is a **stream-level** training example.
371
+
372
+ This layout is required to support the **teacher→student curriculum** described in the training guide:
373
+
374
+ - **Stage 0:** planner/notes-head bootstrap (trunk frozen)
375
+ - **Stage 1:** stream adapters + SNC cross-attention bootstrap (speculation frozen; teacher notes forced)
376
+ - **Stage 2:** enable speculation + notes-bus usage (teacher-heavy mixing)
377
+ - **Stage 3:** train agreement + coverage heads for self-correction/rollback behavior (still trunk frozen)
378
+
379
  ## Citation
380
 
381
  ```bibtex