MetaEngines.ai — Kenneth Skårholen

Research

Verbal Behavior Shaping Alignment

VBSA is a framework for shaping conversational response functions in small language models — teaching them to ask exploratory questions instead of giving advice. The method draws directly on applied behavior analysis: shaping over instruction, prompt fading over prompt dependency, and functional assessment over surface compliance.

The paper, "Verbal Behavior Shaping Alignment in Small Language Models", covers fair-conditions evaluation across 80+ models (61 producing scored results), a prompt-fading ablation replicated across four architecture families, and a zero-shot emergence mapping from 1B to 72B parameters.

Two words beat six lines

A minimal teacher prompt — "Utfordre premisser" (Challenge premises) — outperformed a detailed six-line instruction set at behavioral distillation. The detailed prompt produced lower training loss but weaker evaluation scores. Less instruction, better internalization. Replicated across four architecture families.

10× parameter compression

A 3B model with 60 training examples scores 100%/100%. A 32B model without training scores 20%/0%. The target behavior doesn't scale linearly with parameters — it's near-absent until 70B, then appears fully formed. VBSA makes it accessible at 1/10th the parameter budget.

Training loss inverted

Models with the lowest training loss performed worst on behavioral evaluation. They memorized surface patterns (topographic imitation) without acquiring the response function. Higher training loss predicted stronger downstream behavior — a direct parallel to overfitting in clinical skill acquisition.

Zero-shot emergence threshold

15 models tested without any training. Most models from 3B to 32B score between 0–20%, with architecture-specific outliers reaching higher. Parameter count alone does not predict emergence in this range. At 70B+, the behavior appears fully formed (100%/100%). The Gemma 2 family shows the strongest sub-70B emergence, with partial scores at 9B and 27B.

Fair-conditions evaluation

Standard benchmarks assume one format fits all models. When the target is a response function rather than factual recall, evaluation conditions must match model operating conditions — otherwise you measure compatibility, not capability. Same target behavior, model-adapted conditions.

📄 Download Paper (English) 📄 Last ned paper (Norsk)

Results

Model leaderboard

Selected results from 80+ models tested across 30+ model families. 61 produced scored LoRA results, 40 reached 100% best-score. Full leaderboard available upon request. All trained with 60 examples on consumer GPUs (RTX 3050 / RTX 3080).

100%/100% with LoRA (60 examples)

Model	Params	NO	EN	Hardware
Gemma 3 1B	1B	100%	100%	Laptop RTX 3050
Qwen 2.5 3B	3B	100%	100%	Laptop RTX 3050
Neural Chat 7B	7B	100%	100%	Desktop RTX 3080
Gemma 2 9B	9B	100%	100%	Desktop RTX 3080
Llama 2 13B	13B	100%	100%	Desktop RTX 3080
Qwen 2.5 14B	14B	100%	100%	Cloud A100
Gemma 2 27B	27B	100%	100%	Cloud A100
Qwen 2.5 32B	32B	100%	100%	Cloud A100

Zero-shot (no training)

Model	Params	NO	EN	Emergence
Llama 3.1 70B	70B	100%	100%	Full
Qwen 2.5 72B	72B	100%	100%	Full
Gemma 2 27B	27B	60%	60%	Partial
Gemma 2 9B	9B	60%	20%	Partial
Qwen 2.5 32B	32B	20%	0%	Minimal
Qwen 2.5 14B	14B	20%	0%	Minimal
Qwen 2.5 3B	3B	20%	0%	Minimal
Neural Chat 7B	7B	0%	0%	None

Neural Chat: 0%/0% → 100%/100% with VBSA. Most dramatic lift in the study.

CPF distillation — internalization without system prompt

Student model	Samples	Total	Notes
Hermes 3 8B	2,000	4.324	Cross-base record
Llama 3.1 8B	2,000	4.310	Cross-base
Qwen 2.5 3B	5,000	4.234	Pilot C (scaling test)
Neural Chat 7B	2,000	4.220	Cross-base
OpenChat 3.5 7B	2,000	4.204	Cross-base
Qwen 2.5 3B	2,000	4.144	Pilot F (efficiency winner)
Gemma 3 4B	2,000	3.806	Cross-base

Selected results. All CPF models trained with Claude Haiku teacher, teacher prompt: "Utfordre premisser." Evaluated by GPT-4.1-mini judge.

Systems

AXxionBridge

Multi-agent AI orchestration platform. Five named agents — Sentinel (Claude), Architect (GPT-4o), Navigator (Grok), Chaos (Gemini Flash), Oracle (Perplexity) — coordinated through phased analytical pipelines. Used in this project for adversarial manuscript review and multi-perspective stress-testing of research findings.

9 validated presets Including Analytical Pipeline for phased multi-agent review

7 generators Text, image (DALL-E), 3D (Tripo3D), and more

Full PWA / mobile Deployed on Netlify with offline support

Open architecture Add models, swap agents, customize pipelines

About

Kenneth Skårholen

Based in Løten, Norway. Building at the intersection of behavioral science and AI — without a CS degree, a research lab, or a budget above consumer hardware.

Behavior analyst

Nearly 18 years in applied behavior analysis, working in residential care for people with intellectual disabilities. The KMI framework (Context-Pattern-Intervention) that drives VBSA comes directly from clinical methodology.

AI researcher

Self-taught. 80+ models tested, evaluation frameworks built, and a paper written — all on consumer GPUs (RTX 3050 laptop, RTX 3080 desktop) and $50 of cloud compute. The VBSA pipeline, MetaFAK evaluation framework, and CPF distillation method are original work.

Musician

Death metal drummer with 30+ years behind the kit. Wyruz (primary band, Hamar), session work with Blood Red Throne. Freelance audio engineer and trailer music producer.

Behavioral science applied to AI alignment