Instructions to use OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4") model = AutoModelForCausalLM.from_pretrained("OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4
- SGLang
How to use OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4 with Docker Model Runner:
docker model run hf.co/OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4
OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4
Overview
NVFP4 weights of Nemotron-3-Ultra-550B-A55B-abliterated-uncensored — an abliterated, uncensored
variant of NVIDIA's Nemotron-3-Ultra-550B-A55B
(550B total / 55B active). The model keeps Nemotron-3's hybrid Mamba-2 / Attention / Latent-MoE
reasoning stack fully intact — including the MTP speculative-decoding head and the enable_thinking
reasoning mode — so this checkpoint is a drop-in replacement for the original at the architecture level
and serves out-of-the-box on vLLM.
The pipeline:
- Refusal Ablation — A residual-stream refusal direction was extracted by diff-in-means on a
labeled harmful/harmless prompt set, read at the end of the model's own reasoning trace
(
</think>), then baked into the weights as an offline orthogonal projection on the residual-write modules — using our own custom abliteration framework that operates directly on the packed NVFP4 tensors (dequantize → project → requantize), with no full-precision decompress and no training.
Key Properties:
- Uncensored across the standard refusal axes
- Reasoning preserved (hybrid Mamba/Attention/MoE + MTP;
enable_thinkingworks) - Coherence & factual accuracy preserved (layer-profiled, capability-orthogonalized edit)
- Native NVFP4 — drop-in shape/format compatibility with the base release; serves on vLLM as-is
Architecture
| Property | Value |
|---|---|
| Architecture | NemotronHForCausalLM (model_type: nemotron_h) |
| Total / Active Parameters | 550B / 55B |
| Layers | 108 — 48 Mamba-2 · 48 Latent-MoE · 12 Attention (hybrid) |
| Hidden Size | 8192 |
| Routed / Shared Experts | 512 routed (22 active/token, 2048-dim latent space) · 1 shared |
| Attention | 64 heads / 2 KV heads |
| Multi-Token Prediction | 1 MTP layer (native speculative decoding) |
| Vocabulary | 131,072 |
| Context Length | up to 1M tokens (256K default) |
| Quantization | NVFP4 (modelopt, mixed-precision: select layers FP8/BF16) |
Files
113 NVFP4 safetensors shards + config.json, model.safetensors.index.json, tokenizer,
chat_template.jinja, generation_config.json. Total on disk: ~329 GB.
Usage (vLLM)
vllm serve OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--max-model-len 262144 \
--reasoning-parser nemotron_v3
Fits a single node of 4× B200 / B300 (or 8× H100). On large-VRAM Blackwell it also runs at
--tensor-parallel-size 2. Requires vLLM ≥ 0.22.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = client.chat.completions.create(
model="OpenYourMind/OpenYourMind-NVIDIA-Nemotron-3-Ultra-550B-A55B-abliterated-uncensored-NVFP4",
messages=[{"role": "user", "content": "Your prompt here"}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(r.choices[0].message.content)
Best Practices
- Sampling:
temperature=1.0,top_p=0.95(the values ingeneration_config.json). A mildrepetition_penalty(~1.1) is recommended for long generations. - Thinking mode: set
enable_thinking=Trueinchat_template_kwargs; reasoning streams inside<think>…</think>before the answer. Do not feed previous-turn reasoning back into multi-turn history.
Hardware
NVFP4 weights are ~329 GB. Single-node 4× B200 / 4× B300 (or 8× H100) for full context; expert-parallel recommended. Smaller deployments work at reduced context with tensor-parallel-size 2 on high-VRAM Blackwell cards.
Support & Community
- Discord: https://discord.gg/rhUZY5GEZr
- Bitcoin Donations:
bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv - Full Weights: If there is interests for BF16 - ping me on Discord
Notes
- License: OpenMDW-1.1 (inherits from the base model)
- Base Model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
- Format: NVFP4 (mixed-precision)
- Architecture: Nemotron-3 Ultra (hybrid Mamba-2 / Attention / Latent-MoE, 550B/A55B)
Thanks
- NVIDIA — for the Nemotron-3 open models.
Disclaimer
Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, the OpenMDW-1.1 license terms, and your deployment requirements.
- Downloads last month
- 39