AVA v2 — a 42 MB adapter that scores 82.0% ARC-Challenge on the full 1,172-question set, beating Llama 3.2 3B-Instruct (78.6%). Trained in 100 minutes on a single consumer card. No cloud, no cluster, no budget.
Most AI labs assume 80 GB GPUs and cluster compute. AVA assumes one laptop, no cloud, no budget. Everything in this project — training, evaluation, serving — fits in 4 GB of VRAM and runs on a $400 laptop GPU.
Q8_0 GGUF served via llama-server. Letter-argmax MCQ scoring. 95% Wilson CI. Full per-benchmark JSON in the repo.
| Benchmark | Category | n | Accuracy ▼ | 95% CI | Range |
|---|
The full Qwen3.5-2B base loads in NF4 4-bit through BitsAndBytes — about 1.4 GB resident. A LoRA adapter (rank 16, alpha 32) attaches to every attention and MLP projection. We update 10.9M parameters out of 1.89B and freeze the rest. Peak VRAM during training stays at 1.81 GB.
20,741 prompt-response pairs, hand-mixed: GSM8K-style step-by-step math, ARC + SciQ + OpenBookQA-style science, instruction following, tool use, and teacher-distilled chains for harder reasoning. v1 used 5K examples and showed zero ARC gain. v2 used 4× the data and jumped +13 pp.
Compiling SDPA attention through Triton on Windows took the per-step time from 25.0 s to 5.8 s — a 10.7× speedup. The full 20K corpus fits in one epoch (2,593 steps) in 100 minutes. Gradient checkpointing and 8-bit paged AdamW keep memory pinned under 2 GB on a 4 GB card.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3.5-2B", quantization_config=bnb_config, device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa", # required when FLA is installed ) lora_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable: 10,911,744 / 1,889,488,896 = 0.58%
# 1. download a GGUF from huggingface.co/NAME0x0/AVA-v2-GGUF # 2. build the local model from the bundled Modelfile ollama create ava-v2 -f Modelfile # 3. chat ollama run ava-v2 # works on CPU, Apple Silicon, AMD, NVIDIA. No Python needed.
# pip install -e .[bench]; pip install peft from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel import torch bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B") m = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3.5-2B", quantization_config=bnb, device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa") m = PeftModel.from_pretrained(m, "NAME0x0/AVA-v2").merge_and_unload() msgs = [{"role": "user", "content": "Why does ice float on water?"}] text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) ids = tok(text, return_tensors="pt").to(m.device) out = m.generate(**ids, max_new_tokens=512, temperature=0.7, do_sample=True) print(tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True))
# LoRA adapter (42 MB, safetensors) huggingface.co/NAME0x0/AVA-v2 # pre-quantized GGUFs for Ollama / llama.cpp huggingface.co/NAME0x0/AVA-v2-GGUF ├─ AVA-v2-Q4_K_M.gguf ~1.5 GB recommended └─ AVA-v2-Q8_0.gguf ~2.0 GB near-lossless (used for the 17-bench eval) # one-line load in code from peft import PeftModel PeftModel.from_pretrained(base, "NAME0x0/AVA-v2")
Distill Qwen 3.6 35B-A3B into a 6-8B ternary MoE student with native Gated DeltaNet attention and MCP-based tool routing. Ternary FFN experts + BF16 router + one BF16 shared expert (MoTE). 3:1 hybrid attention with linear-time decode. BitDistiller 3-stage QAT.
Click any phase below to expand its plan.
Per-benchmark methodology, prompts, refusals, sample budgets, per-subject breakdowns, and the 4h 18m wall log.
Per-run history. Every config, every loss curve, every dead-end. The corpus jump from 5K to 20K and what fell out of it.
Ternary MoE student from a Qwen 3.6 35B-A3B teacher. Gated DeltaNet 3:1 hybrid. MCP via FastMCP 3.0 + XGrammar.
The released AVA v2 LoRA adapter in standard PEFT format. Tokenizer, training report, and the full reproducibility table.
Training scripts, evaluation harness, GGUF conversion, all configs, every corpus recipe. Reproduce the numbers end-to-end.
Three install paths (Ollama, Python, HF). On 4 GB+ VRAM the full benchmark sweep runs in ~4h 18m.
Quickstart, reproduce, Windows setup, experiment progression, roadmap, architecture, benchmark rationale. Every doc is short and answers one question.
The project costs $0 to keep running, but funding accelerates AVA v3 — longer training runs, broader benchmarks, more tool integrations. GitHub Sponsors live.