AVA AVA v2 · 2B · QLoRA
GitHub ↗
Released   AVA v2 Author   Muhammad Afsah Mumtaz Hardware   1× RTX A2000 Laptop · 4 GB VRAM License   Apache-2.0 / Qwen

Capable AI on a 4 GB laptop GPU.

AVA v2 — a 42 MB adapter that scores 82.0% ARC-Challenge on the full 1,172-question set, beating Llama 3.2 3B-Instruct (78.6%). Trained in 100 minutes on a single consumer card. No cloud, no cluster, no budget.

View benchmarks GitHub
─ training summary ─
base modelQwen3.5-2B
attentionSDPA + Triton
quantizationNF4 (BitsAndBytes)
optimizerpaged_adamw_8bit
adapterLoRA r=16, α=32
lr / schedule1.5e-4 cosine
trainable10.9M / 1.89B · 0.58%
corpus20,741 examples
peak VRAM1.81 GB
final loss0.4145
wall time100 min · 2,593 steps
adapter size42 MB
ARC-Challenge
82.0%
MMLU 5-shot
59.2%
Adapter
42MB
Training peak
1.81GB
Train wall
100min
§01 THE CONSTRAINT One laptop · No cloud · No budget

Most AI labs assume 80 GB GPUs and cluster compute. AVA assumes one laptop, no cloud, no budget. Everything in this project — training, evaluation, serving — fits in 4 GB of VRAM and runs on a $400 laptop GPU.

typical lab

8× H100, 640 GB VRAM

  • GPUs 8× H100 80GB
  • VRAM total 640 GB
  • Cluster NVLink / Infiniband
  • Cloud cost ~$25,000 / mo
  • Engineers research team
  • Setup cluster, scheduler, ops
ava

1× RTX A2000, 4 GB VRAM

  • GPUs 1× RTX A2000 Laptop
  • VRAM total 4 GB
  • Cluster none — Windows + WSL
  • Cloud cost $0 / mo
  • Engineers 1 person
  • Setup pip install
§02 THE NUMBERS 17 benchmarks · 16,872 tasks · 4h 18m wall

Q8_0 GGUF served via llama-server. Letter-argmax MCQ scoring. 95% Wilson CI. Full per-benchmark JSON in the repo.

ARC-Challengescience reasoningn = 1,172
GSM8Kmath reasoningn = 1,319
Filter
Click column to sort · Hover row to highlight
Benchmark Category n Accuracy 95% CI Range
§03 HOW IT WORKS QLoRA · curated 20K · Triton SDPA
01 · QLoRA

4-bit base, 0.58% trainable

The full Qwen3.5-2B base loads in NF4 4-bit through BitsAndBytes — about 1.4 GB resident. A LoRA adapter (rank 16, alpha 32) attaches to every attention and MLP projection. We update 10.9M parameters out of 1.89B and freeze the rest. Peak VRAM during training stays at 1.81 GB.

Adapter42 MB
02 · Curated 20K

Math, science, reasoning, instructions

20,741 prompt-response pairs, hand-mixed: GSM8K-style step-by-step math, ARC + SciQ + OpenBookQA-style science, instruction following, tool use, and teacher-distilled chains for harder reasoning. v1 used 5K examples and showed zero ARC gain. v2 used 4× the data and jumped +13 pp.

Lift over base+16.0 pp ARC-C
03 · Triton SDPA

10.7× faster on Windows

Compiling SDPA attention through Triton on Windows took the per-step time from 25.0 s to 5.8 s — a 10.7× speedup. The full 20K corpus fits in one epoch (2,593 steps) in 100 minutes. Gradient checkpointing and 8-bit paged AdamW keep memory pinned under 2 GB on a 4 GB card.

Steps / sec0.43
finetune_v2_full.py · NF4 + LoRA setup
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-2B",
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",   # required when FLA is installed
)

lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: 10,911,744 / 1,889,488,896  =  0.58%
RTX A2000 LAPTOP  ·  4 GB VRAM  ·  32 GB RAM  ·  WINDOWS
corpus 20,741 pairs
  • math · science
  • reasoning · tools
  • JSONL on disk
─►
trainer Qwen3.5-2B · NF4 4-bit
  • q · k · v · o · gate · up · down projections
  • attn: Triton-compiled SDPA
  • optim: paged_adamw_8bit · bf16
  • gradient checkpointing
─►
adapter LoRA r=16 · 42 MB
  • safetensors
  • 0.58% trainable
Q4_K_M GGUF Q8_0 GGUF
peak 1.81 GB wall 100 min · 2,593 steps final loss 0.4145 cost $0 / month
§04 TRY AVA V2 Local · CPU or GPU · < 2 GB VRAM
Q4_K_M ~1.5 GB Q8_0 ~2.0 GB
terminal · no python required
# 1. download a GGUF from huggingface.co/NAME0x0/AVA-v2-GGUF
# 2. build the local model from the bundled Modelfile
ollama create ava-v2 -f Modelfile

# 3. chat
ollama run ava-v2

# works on CPU, Apple Silicon, AMD, NVIDIA. No Python needed.
python · transformers + peft
# pip install -e .[bench]; pip install peft
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True,
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
m  = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen3.5-2B", quantization_config=bnb,
        device_map="auto", dtype=torch.bfloat16,
        attn_implementation="sdpa")
m  = PeftModel.from_pretrained(m, "NAME0x0/AVA-v2").merge_and_unload()

msgs = [{"role": "user", "content": "Why does ice float on water?"}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids  = tok(text, return_tensors="pt").to(m.device)
out  = m.generate(**ids, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True))
huggingface · model card + adapter
# LoRA adapter (42 MB, safetensors)
huggingface.co/NAME0x0/AVA-v2

# pre-quantized GGUFs for Ollama / llama.cpp
huggingface.co/NAME0x0/AVA-v2-GGUF
  ├─ AVA-v2-Q4_K_M.gguf    ~1.5 GB    recommended
  └─ AVA-v2-Q8_0.gguf      ~2.0 GB    near-lossless (used for the 17-bench eval)

# one-line load in code
from peft import PeftModel
PeftModel.from_pretrained(base, "NAME0x0/AVA-v2")
§05 HONEST LIMITS What v2 is not yet good at
MATH
Math is the weak area. GSM8K reaches 35.3% greedy and 44.0% with k=5 self-consistency. MATH-500 sits at 18.8%. Self-consistency buys roughly +9 pp but doesn't close the gap with stronger 2-3B math models.
35.3% / 18.8%
TOOLS
Tool-use was trained but mostly latent. The agentic GSM8K harness invoked the calculator on only 0.6% of problems. The fine-tune corpus had ~55 tool-use examples vs 20K math examples; the model defaults to direct chain-of-thought.
0.6% invoke rate
MULTILINGUAL
Multilingual transfer is partial. MGSM lands at 42.8% in English, 32.0% Spanish, 28.4% French. The model can read other languages but reasoning quality degrades outside English.
en 42.8 → fr 28.4
COMMONSENSE
HellaSwag is below most peers. 56.8% on the full 10,042-question set. Narrative commonsense is under-represented in the v2 corpus and shows it.
56.8%
PROTOCOL
Eval uses letter-argmax MCQ scoring. Directionally comparable to leaderboards but not numerically identical to lm-eval-harness. Treat the AVA numbers as internally consistent and the relative ranking as the headline.
Wilson 95% CI
§06 WHAT'S NEXT — AVA V3 Currently P0 · scaffolding

5× capacity per VRAM.

Distill Qwen 3.6 35B-A3B into a 6-8B ternary MoE student with native Gated DeltaNet attention and MCP-based tool routing. Ternary FFN experts + BF16 router + one BF16 shared expert (MoTE). 3:1 hybrid attention with linear-time decode. BitDistiller 3-stage QAT.

Click any phase below to expand its plan.

✓ done   ▣ active   · queued
§07 RESEARCH TRACK Logs · reports · weights