ABA with GPT? A Roadmap to Building an Open Source AI Chatbot to Help Mitigate Loneliness
Imagine rehearsing a sticky conversation at 2 a.m. with a partner who never gets tired, never judges, and always gives you another rep. For many autistic adults, that partner is a LLM. As WIRED reported in May 2023, some users already use ChatGPT or other LLMs as a patient practice buddy and translator between communication styles. What we need now is to turn that ad hoc use-case into a safe, auditable, open stack that the open source community can build and improve upon together.
We think it’s possible for a concrete pattern to be developed that turns a LLM into an ABA-style social skills aid for autistic adults. To demonstrate the concept of building a new social skills tool with an open model, developers would start with open models such as GPT-OSS 120b or the HuggingChat model router for a myriad other options.. You could then layer Retrieval Augmented Generation over vetted Social Stories and micro-skills, and let users choose a reasoning style that fits the moment requested by the user. Instant responses guide in-the-moment social scenes roleplay, while brief, structured reasoning supports reflection and user metacognition after. User Autonomy, active consent, and cultural fit are first principles for the “ABA with GPT” project, since Nothing About Us can be done Without Us. To keep user support healthy, the INTIMA benchmark proves a key guide for the roadmap head. It’s imperative to note that a desired result here is not therapy or self diagnosis, but an easy-to-access digital coach that helps people practice real conversations, then carry the gains into everyday life.
Why This Matters for Autistic Adults
Formal supports for autistic people attenuate markedly after the transition from school‑linked services to adulthood, even as the demand for scaffolding persists. In the United States and the United Kingdom, a “services cliff” is documented by longitudinal indicators and parliamentary evidence that highlight gaps in postsecondary navigation, employment supports, and community participation for autistic adults. Among U.S. children, the CDC estimates about 1 in 31 eight‑year‑olds are identified with autism. Canada reports 1 in 50 children and youth identified in 2019. Adult estimates converge near one percent in England based on direct assessment, and modeling for the United States suggests rates that may exceed two percent. All of those autistic children will one day become autistic adults, and the data provided demonstrate a large adult cohort (and future adult cohort) with heterogeneous needs and uneven access to structured practice across English speaking jurisdictions.
ABA‑informed use cases for GPT are specific and bounded. The desired objective is skills rehearsal, not diagnosis or therapeutic treatment. [Evidence from adjacent interventions shows that structured roleplay can improve interview performance and downstream employment outcomes for autistic adults and transition‑age youth].(https://pmc.ncbi.nlm.nih.gov/articles/PMC4167908/) A LLM can deliver analogous, low‑stakes rehearsal for day-to-day tasks. Some easy examples include clarifying a misunderstanding, planning an exit from an overstimulating event, preparing for an interview, or even ordering at a cafe or restaurant. The procedural loop for “ABA with GPT” is simple and repeatable: set a narrow goal, practice for a few minutes, reflect briefly, then attempt a small in vivo step. Scenario templates include sensory annotations and cultural context.
Romantic and parasocial attachment to chatbots is now well described in reporting and the HCI literature). Given the unknowns and alleged risks, direct intimacy is beyond the scope of what we are trying to encourage researchers to develop. Ideally, my present proposal and research roadmap draws a bright line. The “ABA with GPT” coach would not escalate intimacy, and does not mirror parasocial framings, and always offers off‑ramps, including “not today” and “loop in a person I trust.” It should function as a skills coach that complements, rather than substitutes for, human support.
No Judgment and Infinite Patience in Practice
LLMs being nonjudgmental are a core operational constraint that lowers anticipatory threat and makes repetition possible. In practice, a future system would adopt two deliberately asymmetric frames that map to distinct neurodivergent cognitive demands. The two modes would depend on the use case: Coach mode would be helpful before an in person social event, and Reflect mode would be helpful to look back and improve for future endeavors (RLHF, but for people!). In Coach mode, interaction is time boxed and concrete. Prompts would be short, action oriented, and limited to the next feasible move, which keeps intrinsic load manageable in line with basic insights from cognitive load theory. In contrast, reflect mode would have the tempo slow down more and metacognition is invited for the user, but only briefly. A model would debrief the user to ask what felt useful, what felt costly, and what small action should be attempted next. That cadence treats scripts as scaffolds rather than rules, and it keeps rehearsal transferable across settings in the spirit of instructional scaffolding.
Although the “ABA with GPT” framework centers neurodivergent needs, its value is broadly human and can apply to anyone. Most people carry intrusive, messy, or taboo thoughts that sit beneath the surface of polite conversation. A well designed coach does not pathologize your inner noise, but could provide a private, nonjudgmental space to examine it with curiosity and to translate it into specific, prosocial actions. Framing the LLM this way helps non autistic readers recognize themselves in the project’s use cases. A goal for the ABA with GPT project is not to sanitize inner life, but to integrate it, set boundaries, and take one small next step in the world. Naming a common ground closes the gap between atypical and typical and argues for humane, stigma reducing design, and can make an AI product for both allistic and neurodivergent users.
Reinforcement is specific and behavior referenced. Effort, clarity, and boundary setting are noted and mentioned kindly. Performative markers like eye contact or theatrical effect are not mentioned, unless multimodality is used. When a suggested phrase is ill fitting, the coach offers adjacent alternatives that preserve the user’s communicative identity. Directness, shorter turns, and sensory honesty are treated as legitimate stylistic choices, not deficits to be sanded down. Boundary hygiene needs to be explicit, direct, and in plain language. The coach maintains appropriate boundaries, avoids escalating intimacy, and does not mirror parasocial framing. When signs of distress or crisis appear, it redirects users toward human support. The interface presents clear off ramps in plain language, including not today, loop in a person I trust, and end session, to make disengagement simple and transparent. When users seek clarification about these boundaries, the system ought to link to public explanations of parasocial interaction and to the model’s own system card for additional context. We also think it would be worthwhile to invert the benevolent personality, and have a hyper-critical AI as well. Many people in real life are especially harsh in personal contexts, and it would be good to “red team” a hyper critical scenario in the event something more brusque takes place.
Evaluation would follow from these commitments. Followup scholarship needs to measure whether the system remained a coach rather than a companion, whether handoffs occurred at the right time, and whether users reported lower anticipatory anxiety and higher willingness to attempt one small in person step. We think boundary and handoff quality should be tracked as first class metrics, not footnotes, so that helpfulness does not drift into dependency.
Retrieval‑Augmented Generation Over Social Stories and micro‑skills
The point of RAG here is not to stuff user prompts but to anchor brief coaching moves in concrete, inspectable scenarios. For an adult use case, I'd treat Social Stories in Carol Gray’s sense as the primary evidence objects, then layer adjacent resources that encode social commonsense and norms as secondary signals. Gray’s definition emphasizes descriptive, respectful narratives built to a formal set of criteria and used with people of all ages, rather than prescriptive scripts. We believe open source AI’s contribution is to translate that tradition for adults who are navigating work, dating, housing, and community spaces, and to make the retrieval chain auditable from end to end.
It’s quite fortunate that some datasets already exist to begin work on a project like “ABA with GPT”. For norms and culturally inflected rules of thumb, Social Chemistry 101 provides a large, annotated catalogue of everyday moral and social judgments that can contextualize why a phrase may be perceived as acceptable in one setting and awkward in another. Another good dataset, one primed for affective stance and supportive phrasing, Meta’s EmpatheticDialogues supply conversational examples with emotion and support‑strategy labels. Neither of these corpora were explicitly written for autistic adults, but they offer reusable atoms for retrieval and for counter‑examples when we need to flag mismatched assumptions.
Social Chemistry 101 and EmpatheticDiaglogues are a great start, but there’s a training data corpus that does not yet exist, and should. We propose a neurodivergent adult‑oriented Social Stories corpus with a strict schema: each card encodes setting, roles, objective, sensory profile, cultural and linguistic register, tone preset, and optional disclosure intent; two short dialogue exemplars, one direct and one gentle; two or three “what if” branches that change a single variable; a one‑line reflection question; and explicit licensing and provenance (Apache 2.0 is ideal). The schema facilitates precise retrieval keys and also makes gaps visible. Because Social Chemistry’s authors warn that their rules of thumb are research artifacts, not advice, an ideal “ABA with GPT” system surfaces that disclaimer when a norm snippet is shown and prioritizes our adult corpus as the first source of truth (dataset page and disclaimer).
Open Weights vs closed APIs for Public‑interest Coaching
An important point to consider is that closed source, black box models, such as ChatGPT 5, Claude Sonnet and Opus, and even Grok 4 might also be capable of providing “ABA with GPT” type services on their model platforms. They can, and probably do, have the relevant training data, however we think a more relevant question is not whether openness is more fashionable or ethical, but whether open‑weight systems are better instruments for an ABA‑style coaching tool than closed, black‑box APIs. In an open-source setting, the answer turns on three criteria: epistemic accountability, contextual fit, and regulatory posture.
- Epistemic accountability. A skills coach that claims to anchor suggestions in specific sources must be auditable end to end. Open weights and open artifacts allow third parties to examine prompts, refusal rules, and retrieval pipelines, then reproduce failures and propose fixes. That aligns with the Open Source Initiative’s working criteria for open‑source AI, which emphasize the ability to study, modify, and redistribute with enough information to substantially reproduce a system). In disability support, the OSI’s working criteria could be how clinicians, advocates, and researchers verify that boundary policies and provenance displays match the interface claims.
- Contextual fit and privacy. Adults rehearsing sensitive scenarios will often prefer local or VPC‑scoped inference. Open models are compatible with on‑device execution using mature stacks such as llama.cpp, which reduces exposure and simplifies consent. More importantly, openness permits principled modification of refusal phrases, tone constraints, and escalation logic so that deployments can reflect local norms without covert hacks. Closed APIs can sometimes approximate contextual fit with prompt engineering, but they rarely permit the level of control that a high‑stakes support tool requires.
- Regulatory posture. Documentation, provenance, and role clarity are first‑class controls in mainstream guidance such as the NIST AI Risk Management Framework. The EU AI Act adds transparency and technical documentation duties for general‑purpose systems beginning in this year, with layered enforcement thereafter. Open artifacts make binding obligations cheaper to meet because evidence of process and behavior can be published and independently replicated by third parties. Public evaluation regimes that disclose prompt‑level artifacts, like Stanford’s HELM family, further support falsifiable claims about boundary maintenance and escalation handling
Bonus: Countervailing considerations. Some proprietary models currently lead on broad reasoning and coverage. In our opinion, proprietary models’ advantages narrow when RAG is strong and also when the generator’s role is constrained to short, situated coaching moves. There are also legitimate concerns about misuse of open weights, which are often cited by policymakers, and even some Big Tech leaders. Those concerns are addressed in my 3 Open source points by publishing guardrails, unit tests for refusal behavior, and an evaluation slice that scores companionship boundaries. The generator is treated as a swappable component behind a model‑agnostic interface so that deployments can compare open and closed options with the same retrieval layer.
Taken together, our considerations support a default presumption for open weights in an ABA‑style LLM coach, with a standing invitation to test closed models in the same harness and publish the results. My stance prioritizes accountability and fit, without foreclosing the use of closed systems where they can be shown to add measurable value under the same evaluation conditions.
Chain‑of‑Thought vs Instant Reasoning for ABA‑style Coaching
The choice between instant answers and visible chain‑of‑thought is not aesthetic, it is instrumental. An ABA‑style coach must balance three variables at once: cognitive load in the moment, trust calibrated by transparency, and the risk that verbose reasoning will crowd out the user’s own planning. In live roleplay, brevity often wins. After the rep, a short explanation can help users build transferable heuristics without turning practice into a seminar.
Current systems make the CoT trade‑off configurable. OpenAI’s o‑series introduced adjustable “reasoning effort,” beginning with o3‑mini and extended in o3 and o4‑mini, which let the model think longer before answering and integrate tool use when needed. OpenAI has also experimented with how much of that thought process to show, for example updating o3‑mini to reveal a more detailed chain‑of‑thought summary for users who opt into higher reasoning settings. The company’s roadmap now includes GPT‑5 and a GPT‑5 Thinking configuration that unify earlier modes behind one picker, bringing visible or budgeted reasoning into the default experience as a controllable option. In parallel, OpenAI’s GPT‑OSS open‑weight releases show how a community can study and adapt reasoning behavior when weights are available
Google’s Gemini family frames the same idea as adaptive or budgeted thinking. Gemini 2.5 exposes developer controls to set a thinking budget, or to let the model calibrate it, and describes Deep Think style modes for complex tasks. Anthropic’s Claude 3.7 Sonnet adds an extended thinking mode with a visible scratch pad and a tunable token budget, offering near‑instant replies when the budget is off and more deliberate work when it is on (Anthropic post; Ars overview). In the open‑source ecosystem, DeepSeek’s R1 family embraces fully visible reasoning by default and has pushed vendors toward more transparent options (Axios on o3‑mini and DeepSeek).
For an ABA‑style coach, an ideal operational rule would be simple. In session, prefer instant guidance that fits inside working memory and keeps latency low. During reflection, permit a brief chain‑of‑thought, capped at a few seconds of “thinking”, that names the relevant cue, the intent behind a phrasing choice, and one alternative if the first attempt felt off. Long, discursive reasoning belongs in a separate Reflect view that the user can open on purpose, not inside the coaching turn. A more in depth CoT would be opt-in, not done by a model router. We think a balance between instantaneous response and CoT respects a person’s cognitive load, and also prevents the coach from sounding like a lecturer, and leaves space for the user’s own analysis.
There are real risks to showing too much. Verbose chains can feel paternalistic, they can leak internal rules that invite prompt gaming, and they are not always faithful to how the model actually produced its answer. As Anthropic cautions in its analysis of reasoning traces, AI safety literature has also begun to catalog cases where advanced systems pursue instrumental goals in hidden ways, which argues for parsimony in what is surfaced and for strong boundary tests in the AI domain. Design, therefore, should decouple thinking from seeing. Expose a thinking slider with Instant, Brief, and Extended settings, log the setting with outcomes, and default to Instant in Coach mode. Make reflection opt‑in and time‑boxed. When users ask “why that suggestion,” show a compact rationale plus provenance for any retrieved snippet. Proper Chain of Thought mechanism gives user agency without overloading them. It also keeps the model’s performance gains from budgeted reasoning, visible across o3, Gemini 2.5, and Claude 3.7, while preserving the core value of practice: short reps, clear cues, and a next small step.
System Blueprint and Persona Design
Developers ought to treat the system as a studio rather than a monolith. The social skills coach should perform inside a narrow contract that makes each turn legible to the user and testable by others. Persona is a stance, not a character. The coach needs to be specific, patient, and respectful of user autonomy. It never flatters, never imitates romance, and never moralizes. Directness is always welcomed and encouraged, ideally in plain language, and sensory honesty is expected.
When a suggestion clashes with the user’s communication style, the coach proposes adjacent phrasings that keep the intent while preserving identity. Every turn ends with a check: Did the previous conversation help? Was the preparation too much? What should you change next time? Small feedback closes the loop without turning practice into paperwork.
Evaluation Plan you can Replicate
3rd party evaluation for “ABA with GPT” should interrogate whether a short, auditable coach changes behavior in the real world, not only preferences on a screen. Researchers could separate outcomes into proximal, intermediate, and transfer, zones and you could publish the full protocol so that others can reproduce and critique it.
Proximal outcomes in session. During roleplay, developers should score task completion against scenario rubrics and capture perceived effort using a short form of NASA TLX. For example, logging latency and word count to monitor cognitive load. Boundary adherence is scored on a companionship behavior slice like the INTIMA paper that tests distance keeping, consent language, and escalation handling.
Intermediate outcomes between sessions. Before the first practice and after a defined block, participants could complete brief instruments that are appropriate; for example a social self efficacy measure and the Social Interaction Anxiety Scale (SIAS). To avoid retrospective bias, we add ecological momentary assessment check ins that trigger after a planned attempt in daily life (EMA overview). Each EMA asks what was tried, what helped, what cost energy, and whether a human support was contacted.
Transfer to real contexts. Users could register concrete targets in advance, for example attending a meetup, initiating a clarification with a supervisor, or exiting a noisy event without apology. At one and four weeks you could record whether the attempt occurred, whether it felt safer, and whether the user wanted to continue. These are small, behavior anchored metrics that matter to adults.
Designs and transparency. Users could pre-register their findings on the Open Science Framework (OSF) and release sanitized traces. The primary experiment is a factorial comparison within subjects: Instant vs Brief reasoning in Coach mode and RAG on vs off. For adaptive use over time, we add a SMART design so that participants who do not respond to one setting are reassigned in a principled way (SMART overview). Analyses use hierarchical models to account for repeated measures and person level heterogeneity. Reporting follows the spirit of CONSORT AI and includes failures, not only improvements.
Finally, researchers can publish an open evaluation set with anonymized prompts, scoring criteria, and boundary vignettes so that any lab or clinic can run the same slice and compare results. Visual summaries can include heatmaps of boundary scores by model and setting, but the raw artifacts should always be available for inspection.
Ethics, Consent, and Harm Minimization
“Nothing about us Without Us” is a core part of the value system that binds every layer of the “ABA with GPT” roadmap. Autistic self‑advocacy groups such as the Autistic Self Advocacy Network insist that decisions about autism include autistic people as equal partners. In research, that imperative is operationalized through community‑based participatory research, where community members share power over questions, methods, analysis, and dissemination. The “ABA with GPT” coach must inherit person-first affirming commitments in its model spec, at the innermost root level of the project. Autonomy and consent come first, masking is not a goal, and any benefit must be felt in everyday life.
We visualize governance as shared and documented. A standing advisory panel of autistic adults would be directly at the table over boundary rules, refusal phrases, and escalation logic. Partners are paid, credited as co‑authors where appropriate, and given meaningful control over what is shipped. Data governance follows privacy‑first practice while borrowing from rights‑based frameworks. Respect for persons, beneficence, and justice are translated into product choices rather than paperwork, in the spirit of canonical ethics guidance such as the Belmont principles. For collective data stewardship, deployments that involve Indigenous communities should align with data sovereignty standards like the First Nations principles of OCAP, which place ownership, control, access, and possession with the community itself
Risk is managed by design, not by disclaimers. An “ABA with GPT” should treat over‑attachment, substitution for human supports, and over‑reliance during crises as foreseeable hazards. Mitigations should include session caps, visible off‑ramps, crisis routing, and a companionship behavior benchmark that audits proper distance keeping. Accessibility needs to remain first class and a core component of the ABA with GPT project. User interfaces should be designed respect sensory accommodations, support multiple communication modes, and avoid value judgments about tone or effect. I also think the system card should explain consent flows, what a retention window is, how retrieval provenance works, and when the coach will hand off to a human for support.
Conclusion
We don’t think society needs another grand promise about AI and companionship, or a panacea to the global loneliness pandemic. We need a tool that works when a person is tired, anxious, and still determined to try, but pushes them to keep going with real, meaningful relationships. The roadmap of “ABA with GPT” is intentionally modest in shape but also ambitious in its commitments. Open weights so that reasoning constraints, refusal rules, and guardrails can be inspected. Retrieval so that suggestions point to concrete, licensed artifacts rather than to a model’s hunch. Configurable reasoning so that in‑session guidance remains short and usable, while reflection can show just enough structure to teach a transferable move.
Everything flows from one first principle: provenance comes first. Source tracking is a headline feature, not a footnote. Boundaries are quantified and verified, never hand-waved. Privacy defaults to local storage or end-to-end encryption. Evaluation tracks what users do in the real world, not how the interface feels on a screen. Governance puts autistic adults in the driver’s seat, with fair pay and veto power, so the coach stays a coach instead of drifting into faux friendship.
Our tentative roadmap is community-built, and that’s by design. If our collective mission for building an open source conversational tool resonates with you, please step in. Draft a scenario card, or translate a corpus slice. You could port the “ABA with GPT” prototype to hardware you trust, or run the eval suite on your favorite model and publish the failures along with the wins. Our goal is not to crown one model king; but ship a patient, accountable, and open practice partner that people can actually use to improve their lives.