April 20, 20265 min read

GPU in your YAML: Llama 70B for the price of coffee

One line in nexlayer.yaml pins a production LLM to your deployment. No CUDA, no model pulls, no cold starts. Mode 2 large-pinned inference at $1.25 an hour.

gpuaideployment

Add gpu to your deployment and get production LLMs instantly — no CUDA setup, no model downloads, no cold-start waits.

If you have shipped an AI product recently, you already know the tax. Provision a GPU node. Install CUDA. Pull 70 gigabytes of weights. Warm the KV cache. Keep a card hot 24/7 so the first request does not hang. Per-token prices on hosted APIs look reasonable until the invoice lands at $2,000 a month for a chatbot that could have run on a fraction of a card.

We fixed it. You chose the model, Nexlayer launches your environment, and your GPU is already set.

application:
  name: my-chat-backend
  pods:
    - name: api
      path: /api
      image: myorg/chat-backend:v1
      servicePorts: [8000]
      gpu:
        enabled: true
        model: llama-3.3-70b
        priority: inference

Deploy. The scheduler pins Llama 3.3 70B to a card. It injects NXL_INFERENCE_URL into your pod's env before your container starts. Your code calls that URL with OpenAI-compatible requests. You do not see CUDA. You do not see the weights. You do not wait for a model pull.

What you actually get

Three modes, one syntax.

Mode 3 — Shared pinned. Small and mid-size models (Llama 3.1 8B, Qwen 2.5 Coder 7B, Phi 3.5 Mini, embeddings) packed onto a card alongside other tenants. 500 credits per hour. That is fifty cents.

Mode 2 — Large pinned. 70B-class reasoning and chat models (Llama 3.3 70B, DeepSeek-R1 Distill 70B, Qwen 2.5 Coder 32B). Dedicated slot on a 96GB card. 1,250 credits per hour. That is a dollar and a quarter.

Mode 1 — Dedicated (Enterprise). Your own model server. gpu.model: custom, set memoryGB, bring vLLM or TGI or raw CUDA. Full card, no neighbors. 2,500 credits per hour.

1,000 credits equal one dollar. A Pro subscription at $29 a month covers sixty hours on a shared card. Scale at $299 covers 240 hours on a large pinned card. Enterprise at $2,999 covers a full month of a dedicated card for one workload or a fleet of smaller ones.

The hardware

NVIDIA RTX PRO 6000. 96GB GDDR7. Blackwell architecture. 240 TFLOPS of FP16. One card fits a 70B FP8 model with room for a healthy KV cache, or four or five small models at once. We run this across our fleet today. We are adding more as we grow.

Why this matters if you are building

If you are a chatbot startup, your per-user inference cost just dropped from twenty dollars a month on a hosted API to less than a dollar on Nexlayer. You keep the margin.

If you are a coding assistant, autocomplete on qwen-2.5-coder-7b runs at sub-50ms time-to-first-token on shared pinned. Good enough to put in the editor.

If you are running an agent, deepseek-r1-distill-llama-70b-fp8 gives you GPT-4 class reasoning at the price point of a hobby project. Route agent calls at the YAML level, not through an OpenRouter layer.

If you are doing embeddings, nomic-embed is a pinned Mode 3 model. MTEB-competitive, 768-dim, priced like a rounding error.

If you have your own model — a fine-tune, a niche multimodal thing, an unreleased weight set you got from a research lab — Mode 1 on Enterprise gives you a raw card and your choice of runtime. We stay out of the way.

Free tier, actually

Every account starts on Free with 5,000 credits a month. gpu.model: auto lets the scheduler drop you on a Mode 3 small model. Ten hours of actual GPU time, no card on file, no call with sales. Build a demo, burn through credits, upgrade when it matters.

What is next

A model-browser UI that matches the catalog on our pricing page, so you can pick a slug without reading docs. Streaming response adapters for agents that want server-sent events instead of the OpenAI protocol. More cards — H100 class for the workloads that actually need it. A Mode-1-lite for users who want BYO model at Scale plan.

The GPU section of our pricing page has the full catalog: chat, reasoning, code, embeddings, and custom. Every slug, every mode, every plan. If you are ready to stop paying the CUDA tax, the entire onboarding is one YAML field.

Nexlayer Team

Author