Deploying Qwen 3 on an OpenAI-compatible endpoint: a practical walkthrough
Most production AI codebases look roughly the same. There is an OpenAI client instantiated somewhere, a set of prompt templates tuned over weeks of iteration, and a billing page that nobody wants to look at too closely. The model behind that client has become load-bearing infrastructure, and changing it feels risky.
This guide covers how to run Qwen 3 behind an endpoint that speaks the same API format as OpenAI, so that your existing application code continues to work with minimal modification. We will go from model selection through to a production inference endpoint on H100 GPUs, with code you can copy directly into your project.
Why this is worth your time
Open-weight models have reached the point where they handle the majority of production NLP tasks (classification, extraction, summarisation, conversational AI) at quality levels that are difficult to distinguish from proprietary alternatives in blind evaluation. This holds particularly for classification, extraction, summarisation, and customer support chat workflows. It does not universally apply to advanced reasoning, complex tool planning, multi-step code generation, or research-grade reasoning tasks, where proprietary frontier models may still maintain a measurable advantage.
The Qwen 3 family, released by Alibaba's Qwen team, is a strong example: the 8B instruction-tuned variant delivers solid performance across standard benchmarks while running comfortably on a single H100.
The practical obstacle to adopting these models has always been the integration cost. If your codebase is built around OpenAI's API format, switching to a different model traditionally means rewriting your client code, adjusting your prompt templates, and re-testing everything end to end. An OpenAI-compatible API wrapper removes most of that work.
Hyperfusion's inference endpoints use the OpenAI chat completions format natively. If your code calls client.chat.completions.create() through the OpenAI Python SDK, you can point it at a Qwen 3 endpoint by changing the base URL and API key. The request format, response structure, and streaming behaviour all remain identical.
What you need
A Hyperfusion account (sign up at console.hyperfusion.io), Python 3.8 or later, and the openai Python package (pip install openai). That is the complete dependency list.
Step 1: choose your model
Hyperfusion has native Hugging Face integration, which means you can deploy any compatible open-weight model directly from the Hub. For this guide, we will use Qwen/Qwen3-8B-Instruct as a starting point because it offers a reasonable balance of quality and throughput for most production use cases. If you need stronger reasoning capability, the 72B variant is available on the same infrastructure.
From the Hyperfusion console, select your model and target hardware. The platform's sizing tool will recommend a GPU configuration based on the model's memory footprint and your expected throughput. For Qwen 3 8B, a single H100 handles roughly 800 to 1,200 tokens per second depending on batch size and sequence length.
Step 2: deploy the endpoint
Once you have selected the model and confirmed the pricing (which is fixed for the workload rather than per-token), the platform provisions your endpoint. This typically takes two to five minutes for cached models. You receive an endpoint URL and an API key.
Step 3: call it from your existing code
If you are currently using the OpenAI Python SDK, the migration looks like this:
# Before: calling OpenAI
from openai import OpenAI
client = OpenAI(api_key="sk-your-openai-key")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain transformers"}]
)
# After: calling Qwen 3 on Hyperfusion
from openai import OpenAI
client = OpenAI(
base_url="https://api.hyperfusion.io/v1",
api_key="hf-your-hyperfusion-key"
)
response = client.chat.completions.create(
model="Qwen/Qwen3-8B-Instruct",
messages=[{"role": "user", "content": "Explain transformers"}]
)
Two lines changed: base_url and api_key. The rest of your application, including your prompt templates, streaming logic, and error handling, stays identical.
Step 4: streaming responses
If you are already using OpenAI's streaming, it works the same way:
stream = client.chat.completions.create(
model="Qwen/Qwen3-8B-Instruct",
messages=[{"role": "user", "content": "Write a short poem"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
The server-sent events format is identical to OpenAI's implementation, so any frontend code consuming streamed chunks will work without modification.
A note on prompts
Qwen 3 uses a chat template that maps cleanly to the OpenAI messages format. System messages, user messages, and assistant messages all behave as expected. If your prompts were written for GPT-4, they will generally produce good results with Qwen 3 as well, though you should run an evaluation pass to check for quality differences on your specific use case.
One thing worth being aware of: Qwen 3's instruction following is strong but not identical to GPT-4. For tasks that require very precise output formatting (strict JSON schemas, for example), you may need to adjust your system prompt. In practice, most teams find that their existing prompts work with minimal or no changes.
Benchmark numbers from our H100 cluster
We ran these benchmarks on Hyperfusion's H100 infrastructure in the UAE:
|
Model |
Tokens/sec (batch 32) |
Tokens/sec (batch 1) |
Time to first token |
|
Qwen 3 8B Instruct |
~950 |
~380 |
42ms |
|
Qwen 3 72B Instruct |
~210 |
~85 |
120ms (4x H100, tensor parallel) |
Latency from Europe to our UAE endpoints sits at roughly 80 to 90ms round trip, which is workable for interactive applications though not ideal for real-time voice.
Cost comparison
With OpenAI, you pay per token and costs scale linearly with usage. A production workload processing 10 million tokens per day on GPT-4 can run to $300 to $600 daily. The same workload on Qwen 3 8B through Hyperfusion, with outcome-based pricing, gives you a fixed monthly cost that you know before you commit.
The exact numbers depend on your workload profile, but the pricing calculator at hyperfusion.io/pricing lets you model your specific case. No account required.
When to stay on OpenAI
Being straightforward about this: if your application depends heavily on GPT-4's reasoning at the top end (complex multi-step logic, advanced code generation, highly nuanced creative writing), then switching to Qwen 3 8B will involve a quality trade-off. The 72B model closes much of that gap, but it is worth testing on your actual prompts before committing.
For classification, extraction, summarisation, conversational AI, and most production NLP tasks, Qwen 3 performs well enough that the cost and control advantages make the switch worthwhile.
Getting started
Sign up at console.hyperfusion.io, deploy a Qwen 3 endpoint, and run your test suite against it. The whole process takes about fifteen minutes. If you have questions about model selection, sizing, or performance tuning, the engineering team is reachable through the console's support channel and typically responds within a few hours
