The core of any request is the payload, which defines the model and the conversational roles to guide the AI.
The payload must include the model (specifying the LLM, e.g., qwen/qwen3-32b).
Every message in the request must include a distinct role (system, user, or assistant) and corresponding content.
The following section details the primary methods for issuing API requests.
Single queries are ideal for standalone questions or commands that do not require conversation history. The user role is employed to send a direct request to the model.
To preserve context across multi-step interactions, the assistant role is employed. This allows the model to reference its previous responses, enabling a coherent flow for long-running conversations.
The system role allows us to define the model’s behavior, tone, or persona. This is essential for tailoring responses to specific use cases (e.g., a friendly travel advisor, a strict code reviewer, or a creative storyteller).
We enable streaming to receive partial results in real-time as the model generates them. This approach is ideal for long responses, interactive applications, or when incremental progress display is required.
Asynchronous programming is used to execute multiple independent queries in parallel. This is ideal for applications requiring high throughput, such as batch processing, chatbots, or multi-user services.
Capability |
Core Technical Benefit |
|
Single queries |
Ideal for standalone Q&A using the user role. |
|
Maintaining context |
Preserves conversation history for multi-step interactions (assistant role). |
|
Model behavior customization |
Defines model persona/tone for specific use cases (system role). |
|
Real-time output (Streaming) |
Eliminates perceived latency by delivering partial results instantly. |
|
High-throughput execution (Async support) |
Executes multiple independent queries in parallel for high-scale applications. |