İvme-Conversate-22M-Base

Conversate-22M Logo

İvme (Turkish: acceleration) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.

The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.


Model Details

Parameter Value
Architecture Decoder-only transformer
Parameters 22,028,160
Layers 10
Hidden dim 384
FFN dim 1024 (SwiGLU)
Attention heads 6 query / 2 KV (GQA)
Context length 1024 tokens
Vocab size 16,384 (custom BPE)
Positional encoding RoPE (θ=10,000)
Normalization RMSNorm (pre-norm)
Embeddings Tied input/output
Biases None

Benchmarks

All benchmarks run via EleutherAI lm-evaluation-harness, 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.

Benchmark Score Notes
WikiText-2 (byte_perplexity) ↓ 2.96 Lower is better
BLiMP ↑ 61.40% Average over 67 subtasks; random baseline 50%
ARC-Easy ↑ 30.85% acc_norm, 0-shot

Training

Data Mix (~1.57B tokens, Chinchilla-optimal)

Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.

Source Tokens Share
epfml/FineWeb-HQ (score > 0.8) ~710M 45%
bigcode/python-stack-v1-functions-filtered ~160M 10%
HuggingFaceTB/finemath (finemath-4plus) ~235M 15%
HuggingFaceTB/cosmopedia (stanford + wikihow) ~395M 25%
wikimedia/wikipedia (EN, 20231101) ~80M 5%

Hyperparameters

Setting Value
Optimizer Muon (body weights) + AdamW (embeddings, norms)
Muon lr 0.02
AdamW lr 3e-4
LR schedule Warmup-Stable-Decay (WSD)
Warmup steps 100
Decay fraction 20% of training
Weight decay 0.1
Gradient clipping 1.0
Effective batch ~1.05M tokens/step
Total steps 1,447
Precision bfloat16
Attention Flash Attention 2 (HF Kernels)
Final weights EMA (β=0.999) of training trajectory

Hardware

Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately 20 minutes.


Tokenizer

Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.

Special tokens: <|pad|>, <|bos|>, <|eos|>, <|unk|>, <|user|>, <|assistant|>, <|system|>


Usage

import torch
from tokenizers import Tokenizer

# Load with custom code (not a standard HF AutoModel — see model.py)
from model import IvmeConfig, IvmeConversate

tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
cfg = ckpt["cfg"]
cfg.attn_backend = "sdpa"  # or "kernels" for HF Kernels flash-attn
model = IvmeConversate(cfg).cuda()
model.load_state_dict(ckpt["model"])
model.eval()

prompt = "The theory of relativity states that"
ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0].tolist()))

Limitations

  • Base model only — not instruction tuned, will not follow instructions or answer questions
  • English only (v1)
  • Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
  • Repetition at higher temperatures without repetition_penalty
  • 1024 token context window

What's Next

  • İvme-Conversate-22M-Instruct — SFT on smol-smoltalk for instruction following
  • İvme-Conversate-v2 — extended training (~15B tokens), reordered curriculum
  • Turkish support — v2 will add EN+TR with a dedicated bilingual tokenizer
  • İvme-Classify — encoder-only series for classification tasks

Citation

@misc{ivme-conversate-22m,
  author       = {IvmeLabs},
  title        = {İvme-Conversate-22M-Base},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
}

Built by IvmeLabs. Small models, deliberate choices.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train IvmeLabs/Ivme-Conversate-22M-Base

Space using IvmeLabs/Ivme-Conversate-22M-Base 1