Eighteen years of clinical behavior analysis. One year of applying those same principles to language models. The finding: a 3B model with 60 training examples outperforms a 32B model — structure beats scale.
VBSA is a framework for shaping conversational response functions in small language models — teaching them to ask exploratory questions instead of giving advice. The method draws directly on applied behavior analysis: shaping over instruction, prompt fading over prompt dependency, and functional assessment over surface compliance.
The paper, "Verbal Behavior Shaping Alignment in Small Language Models", covers fair-conditions evaluation across 80+ models (61 producing scored results), a prompt-fading ablation replicated across four architecture families, and a zero-shot emergence mapping from 1B to 72B parameters.
A minimal teacher prompt — "Utfordre premisser" (Challenge premises) — outperformed a detailed six-line instruction set at behavioral distillation. The detailed prompt produced lower training loss but weaker evaluation scores. Less instruction, better internalization. Replicated across four architecture families.
A 3B model with 60 training examples scores 100%/100%. A 32B model without training scores 20%/0%. The target behavior doesn't scale linearly with parameters — it's near-absent until 70B, then appears fully formed. VBSA makes it accessible at 1/10th the parameter budget.
Models with the lowest training loss performed worst on behavioral evaluation. They memorized surface patterns (topographic imitation) without acquiring the response function. Higher training loss predicted stronger downstream behavior — a direct parallel to overfitting in clinical skill acquisition.
15 models tested without any training. Most models from 3B to 32B score between 0–20%, with architecture-specific outliers reaching higher. Parameter count alone does not predict emergence in this range. At 70B+, the behavior appears fully formed (100%/100%). The Gemma 2 family shows the strongest sub-70B emergence, with partial scores at 9B and 27B.
Standard benchmarks assume one format fits all models. When the target is a response function rather than factual recall, evaluation conditions must match model operating conditions — otherwise you measure compatibility, not capability. Same target behavior, model-adapted conditions.
Selected results from 80+ models tested across 30+ model families. 61 produced scored LoRA results, 40 reached 100% best-score. Full leaderboard available upon request. All trained with 60 examples on consumer GPUs (RTX 3050 / RTX 3080).
| Model | Params | NO | EN | Hardware |
|---|---|---|---|---|
| Gemma 3 1B | 1B | 100% | 100% | Laptop RTX 3050 |
| Qwen 2.5 3B | 3B | 100% | 100% | Laptop RTX 3050 |
| Neural Chat 7B | 7B | 100% | 100% | Desktop RTX 3080 |
| Gemma 2 9B | 9B | 100% | 100% | Desktop RTX 3080 |
| Llama 2 13B | 13B | 100% | 100% | Desktop RTX 3080 |
| Qwen 2.5 14B | 14B | 100% | 100% | Cloud A100 |
| Gemma 2 27B | 27B | 100% | 100% | Cloud A100 |
| Qwen 2.5 32B | 32B | 100% | 100% | Cloud A100 |
| Model | Params | NO | EN | Emergence |
|---|---|---|---|---|
| Llama 3.1 70B | 70B | 100% | 100% | Full |
| Qwen 2.5 72B | 72B | 100% | 100% | Full |
| Gemma 2 27B | 27B | 60% | 60% | Partial |
| Gemma 2 9B | 9B | 60% | 20% | Partial |
| Qwen 2.5 32B | 32B | 20% | 0% | Minimal |
| Qwen 2.5 14B | 14B | 20% | 0% | Minimal |
| Qwen 2.5 3B | 3B | 20% | 0% | Minimal |
| Neural Chat 7B | 7B | 0% | 0% | None |
Neural Chat: 0%/0% → 100%/100% with VBSA. Most dramatic lift in the study.
| Student model | Samples | Total | Notes |
|---|---|---|---|
| Hermes 3 8B | 2,000 | 4.324 | Cross-base record |
| Llama 3.1 8B | 2,000 | 4.310 | Cross-base |
| Qwen 2.5 3B | 5,000 | 4.234 | Pilot C (scaling test) |
| Neural Chat 7B | 2,000 | 4.220 | Cross-base |
| OpenChat 3.5 7B | 2,000 | 4.204 | Cross-base |
| Qwen 2.5 3B | 2,000 | 4.144 | Pilot F (efficiency winner) |
| Gemma 3 4B | 2,000 | 3.806 | Cross-base |
Selected results. All CPF models trained with Claude Haiku teacher, teacher prompt: "Utfordre premisser." Evaluated by GPT-4.1-mini judge.
Multi-agent AI orchestration platform. Five named agents — Sentinel (Claude), Architect (GPT-4o), Navigator (Grok), Chaos (Gemini Flash), Oracle (Perplexity) — coordinated through phased analytical pipelines. Used in this project for adversarial manuscript review and multi-perspective stress-testing of research findings.
Based in Løten, Norway. Building at the intersection of behavioral science and AI — without a CS degree, a research lab, or a budget above consumer hardware.
Nearly 18 years in applied behavior analysis, working in residential care for people with intellectual disabilities. The KMI framework (Context-Pattern-Intervention) that drives VBSA comes directly from clinical methodology.
Self-taught. 80+ models tested, evaluation frameworks built, and a paper written — all on consumer GPUs (RTX 3050 laptop, RTX 3080 desktop) and $50 of cloud compute. The VBSA pipeline, MetaFAK evaluation framework, and CPF distillation method are original work.
Death metal drummer with 30+ years behind the kit. Wyruz (primary band, Hamar), session work with Blood Red Throne. Freelance audio engineer and trailer music producer.
The VBSA paper, training artifacts, and full leaderboard are available upon request. If you're working on conversational alignment, behavioral fine-tuning, or therapeutic AI — let's talk.