Serverless

Most cost-effective AI Inference

4 out of 5 developers ranked us as the most cost-effective GenAI inferencing provider - with access to popular models and zero rate limits.

Text Generation
Llama 4 Scout 17B 16E Instruct
Meta
Text Generation
Qwen 2.5 Coder 3B Instruct
Qwen
Text Generation
Qwen 2.5 Coder 7B Instruct
Qwen
Text Generation
Qwen 2.5 Coder 32B Instruct
Qwen
Text Generation
Qwen QwQ 32B
Qwen
Text Generation
Llama 3.3 70B Instruct
Meta
Lower cost, more power
Our fully optimised stack cuts out the inefficiencies you pay for elsewhere. You get high-performance serverless at a fraction of the usual cost - savings we pass directly to you.
Engineered for AI workloads
Get all the cost and performance benefits of our fully integrated stack, purpose built for AI workloads at any scale.
Scale without the overhead
From testing to production, scale your AI workloads with no bottlenecks, no setup, just results.

Models & Pricing

Prices are per 1 million tokens including input and output tokens for Chat, Multimodal, Language and Code models. Image models is based on image size and steps.
Request Access
swipe for more info
arrow pointing right
Serverless Endpoint
Qwen QwQ 32B
Qwen 2.5 Coder 32B Instruct
Qwen 2.5 Coder 7B Instruct
Qwen 2.5 Coder 3B Instruct
Llama 4 Scout 17B 16E Instruct
Llama 3.3 70B Instruct
Llama 3.2 11B Vision Instruct
Llama 3.1 8B Instruct
DeepSeek R1 Distill Llama 70B
DeepSeek R1 Distill Llama 8B
DeepSeek R1 Distill Qwen 32B
DeepSeek R1 Distill Qwen 14B
DeepSeek R1 Distill Qwen 7B (Math)
DeepSeek R1 Distill Qwen 1.5B (Math)
Stable Diffusion XL 1.0
Flux.1 [schnell]
Mixtral 8x22B Instruct v0.1
Type
Price
Text Generation
$0.18 Input / $0.20 Output
per 1m tokens
Text Generation
$0.06 Input / $0.20 Output
per 1m tokens
Text Generation
$0.01 Input / $0.03 Output
per 1m tokens
Text Generation
$0.01 Input / $0.03 Output
per 1m tokens
Text Generation
$0.09 Input / $0.29 Output
per 1m tokens
Text Generation
$0.4
per 1m tokens
Image-Text-to-Text
$0.06
per 1m tokens
Text Generation
$0.06
per 1m tokens
Text Generation
$0.75
per 1m tokens
Text Generation
$0.05
per 1m tokens
Text Generation
$0.3
per 1m tokens
Text Generation
$0.14
per 1m tokens
Text Generation
$0.4
per 1m tokens
Text Generation
$0.18
per 1m tokens
Text-to-Image
$0.003 (@ 20 steps)
per mega-pixel
Text-to-Image
$0.0013 (@ 4 steps)
per mega-pixel
Text Generation
$1.2
per 1m tokens
“A dark grid showcasing AI models categorized by their functionality. Top row: ‘Text Generation’ with ‘LLAMA 3.2 11B Instruct’ by Meta, ‘LLAMA 3 70B Instruct’ by Meta, and ‘Mixtral 8x22B Instruct’ by Mistral AI. Bottom row: ‘Text Generation’ with ‘AMD LLAMA 135M’ by AMD, ‘Text-to-Image’ with ‘Stable Diffusion 3 Medium’ by Stability AI, and ‘Text-to-Image’ with ‘Flux.1 [Schnell]’ by Black Forest Labs.”

Savings by design, not compromise

Our vertically integrated stack is optimised at every layer, from hardware to orchestration - driving down compute costs and delivering consistent performance. The result? Real savings that we pass directly to our customers, without sacrificing speed, scale, or security.
Get Started

Serverless without trade-offs

Serverless without compromise. Your models remain yours, and your data is never reused or retrained. Get full tenant isolation, built-in compliance, and high-performance compute - all delivered instantly, without the complexity of managing infrastructure.
Learn More
“Close-up of an AMD Instinct GPU system, featuring multiple modules with ‘AMD Instinct’ branding on metallic heat sinks. The hardware reveals intricate circuit boards and processors underneath, showcasing the advanced design of high-performance computing components.”
OUR HARDWARE

Access cutting-edge hardware

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Get Started
AMD MI300X

Harness the power of AMD's MI300X GPUs for unparalleled compute performance and efficiency.

Contact Sales
AMD MI250X

Instant access to AMD MI250X GPUs to drive results for all your computational needs.

Contact Sales
NVIDIA H100

Experience the pinnacle of AI performance with Nvidia H100 GPUs available instantly.

Contact Sales

Zero rate limits, maximum reliability

No rate limits, no cold starts, and no waiting - just fast, reliable inference with automatic scaling built to handle any AI workload. We handle scaling, monitoring, and operations behind the scenes, so your team can focus on building.

Start Now
Serverless
Marketplace
Training
Inference
GPU nodes
Nscale's Datacentres
Powered by 100% renewable energy
LLM Library
Pre-configured Software
Pre-configured Infrastructure
Job Management
Job Scheduling
Container Orchestration
Optimised Libraries
Optimised Compilers and Tools
Optimised Runtime
EYEBROW TEXT

Feature List

Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here
Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here
Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here
Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here

FAQs

What is Nscale Serverless Inference?

Nscale Serverless Inference is a fully managed platform that enables AI model inference without requiring complex infrastructure management. It provides instant access to leading Generative AI models with a simple pay-per-use pricing model.

Who is this service for?

This service is designed for developers, startups, enterprises, and research teams who want to deploy AI-powered applications quickly and cost-effectively without handling infrastructure complexities.

What AI models are available?

At launch, Nscale supports popular open-source models for text generation, image generation, and computer vision. We continuously expand our offerings based on user feedback.

How does the pricing work?

Nscale follows a pay-per-request model:
- Text models: Billed based on input and output tokens.
- Image models: Pricing depends on output image resolution.
- Vision models: Charged based on processing requirements.
- New users receive free credits to explore the platform.

What are the key benefits of using Nscale Serverless Inference?

No infrastructure hassles: We handle scaling, monitoring, and resource allocation.
Cost-effective: Our vertically integrated stack minimises compute costs.
Scalable & Reliable: Automatic scaling ensures optimal performance.
Secure & Private: No request or response data is logged or used for training.
OpenAI API & SDK compatibility: Easily integrate with existing tools.

How does scaling work?

Nscale automatically adjusts capacity based on real-time demand. There’s no need for manual configuration, making it easy to scale applications seamlessly.

Access thousands of GPUs tailored to your requirements.