Released AVA v2 Author Muhammad Afsah Mumtaz Hardware 1× RTX A2000 Laptop · 4 GB VRAM License Apache-2.0 / Qwen

Capable AI on a 4 GB laptop GPU.

AVA v2 — a 42 MB adapter that scores 82.0% ARC-Challenge on the full 1,172-question set, beating Llama 3.2 3B-Instruct (78.6%). Trained in 100 minutes on a single consumer card. No cloud, no cluster, no budget.

View benchmarks GitHub

─ training summary ─

base modelQwen3.5-2B

attentionSDPA + Triton

quantizationNF4 (BitsAndBytes)

optimizerpaged_adamw_8bit

adapterLoRA r=16, α=32

lr / schedule1.5e-4 cosine

trainable10.9M / 1.89B · 0.58%

corpus20,741 examples

peak VRAM1.81 GB

final loss0.4145

wall time100 min · 2,593 steps

adapter size42 MB

§01 THE CONSTRAINT One laptop · No cloud · No budget

Most AI labs assume 80 GB GPUs and cluster compute. AVA assumes one laptop, no cloud, no budget. Everything in this project — training, evaluation, serving — fits in 4 GB of VRAM and runs on a $400 laptop GPU.

typical lab

8× H100, 640 GB VRAM

GPUs 8× H100 80GB
VRAM total 640 GB
Cluster NVLink / Infiniband
Cloud cost ~$25,000 / mo
Engineers research team
Setup cluster, scheduler, ops

ava

1× RTX A2000, 4 GB VRAM

GPUs 1× RTX A2000 Laptop
VRAM total 4 GB
Cluster none — Windows + WSL
Cloud cost $0 / mo
Engineers 1 person
Setup pip install

§03 HOW IT WORKS QLoRA · curated 20K · Triton SDPA

01 · QLoRA

4-bit base, 0.58% trainable

The full Qwen3.5-2B base loads in NF4 4-bit through BitsAndBytes — about 1.4 GB resident. A LoRA adapter (rank 16, alpha 32) attaches to every attention and MLP projection. We update 10.9M parameters out of 1.89B and freeze the rest. Peak VRAM during training stays at 1.81 GB.

Adapter42 MB

02 · Curated 20K

Math, science, reasoning, instructions

20,741 prompt-response pairs, hand-mixed: GSM8K-style step-by-step math, ARC + SciQ + OpenBookQA-style science, instruction following, tool use, and teacher-distilled chains for harder reasoning. v1 used 5K examples and showed zero ARC gain. v2 used 4× the data and jumped +13 pp.

Lift over base+16.0 pp ARC-C

03 · Triton SDPA

10.7× faster on Windows

Compiling SDPA attention through Triton on Windows took the per-step time from 25.0 s to 5.8 s — a 10.7× speedup. The full 20K corpus fits in one epoch (2,593 steps) in 100 minutes. Gradient checkpointing and 8-bit paged AdamW keep memory pinned under 2 GB on a 4 GB card.

Steps / sec0.43

finetune_v2_full.py · NF4 + LoRA setup

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-2B",
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",   # required when FLA is installed
)

lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: 10,911,744 / 1,889,488,896  =  0.58%

RTX A2000 LAPTOP · 4 GB VRAM · 32 GB RAM · WINDOWS

corpus 20,741 pairs

math · science
reasoning · tools
JSONL on disk

─►

trainer Qwen3.5-2B · NF4 4-bit

q · k · v · o · gate · up · down projections
attn: Triton-compiled SDPA
optim: paged_adamw_8bit · bf16
gradient checkpointing

─►

adapter LoRA r=16 · 42 MB

safetensors
0.58% trainable

Q4_K_M GGUF Q8_0 GGUF

peak 1.81 GB wall 100 min · 2,593 steps final loss 0.4145 cost $0 / month

§04 TRY AVA V2 Local · CPU or GPU · < 2 GB VRAM

Q4_K_M ~1.5 GB Q8_0 ~2.0 GB

terminal · no python required

# 1. download a GGUF from huggingface.co/NAME0x0/AVA-v2-GGUF
# 2. build the local model from the bundled Modelfile
ollama create ava-v2 -f Modelfile

# 3. chat
ollama run ava-v2

# works on CPU, Apple Silicon, AMD, NVIDIA. No Python needed.

python · transformers + peft

# pip install -e .[bench]; pip install peft
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True,
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
m  = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen3.5-2B", quantization_config=bnb,
        device_map="auto", dtype=torch.bfloat16,
        attn_implementation="sdpa")
m  = PeftModel.from_pretrained(m, "NAME0x0/AVA-v2").merge_and_unload()

msgs = [{"role": "user", "content": "Why does ice float on water?"}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids  = tok(text, return_tensors="pt").to(m.device)
out  = m.generate(**ids, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True))

huggingface · model card + adapter

# LoRA adapter (42 MB, safetensors)
huggingface.co/NAME0x0/AVA-v2

# pre-quantized GGUFs for Ollama / llama.cpp
huggingface.co/NAME0x0/AVA-v2-GGUF
  ├─ AVA-v2-Q4_K_M.gguf    ~1.5 GB    recommended
  └─ AVA-v2-Q8_0.gguf      ~2.0 GB    near-lossless (used for the 17-bench eval)

# one-line load in code
from peft import PeftModel
PeftModel.from_pretrained(base, "NAME0x0/AVA-v2")

§05 HONEST LIMITS What v2 is not yet good at

MATH

Math is the weak area. GSM8K reaches 35.3% greedy and 44.0% with k=5 self-consistency. MATH-500 sits at 18.8%. Self-consistency buys roughly +9 pp but doesn't close the gap with stronger 2-3B math models.

35.3% / 18.8%

TOOLS

Tool-use was trained but mostly latent. The agentic GSM8K harness invoked the calculator on only 0.6% of problems. The fine-tune corpus had ~55 tool-use examples vs 20K math examples; the model defaults to direct chain-of-thought.

0.6% invoke rate

MULTILINGUAL

Multilingual transfer is partial. MGSM lands at 42.8% in English, 32.0% Spanish, 28.4% French. The model can read other languages but reasoning quality degrades outside English.

en 42.8 → fr 28.4

COMMONSENSE

HellaSwag is below most peers. 56.8% on the full 10,042-question set. Narrative commonsense is under-represented in the v2 corpus and shows it.

56.8%

PROTOCOL

Eval uses letter-argmax MCQ scoring. Directionally comparable to leaderboards but not numerically identical to lm-eval-harness. Treat the AVA numbers as internally consistent and the relative ranking as the headline.

Wilson 95% CI

Capable AI on a 4 GB laptop GPU.

8× H100, 640 GB VRAM

1× RTX A2000, 4 GB VRAM

4-bit base, 0.58% trainable

Math, science, reasoning, instructions

10.7× faster on Windows

5× capacity per VRAM.

Full evaluation report

EXPERIMENT_LOG.md

AVA v3 DESIGN.md

HuggingFace model card

GitHub repository

Reproduce locally

Full documentation

Sponsor AVA development