
Starting today, POST /v1/predictions stops holding the connection open until the model finishes. It returns immediately now with a prediction in pending. The model runs in the background. You check for the result by polling.
If your integration reads output from the initial POST response, it will be null now. Here is what changed and how to update.
How model execution works on WriftAI
WriftAI runs all kinds of models. Text in, image out. Image in, video out. Audio in, transcript out. The inputs and outputs look different for every model type, and so does how long the inference takes.
When a request comes in for a model that has not run recently, WriftAI allocates a new runtime. That runtime has to be downloaded and prepared, the model weights loaded into memory, before inference can start.
Those runtimes are not small. A model runtime can be tens of gigabytes. Large vision and language models can be in the hundreds. Preparing a cold runtime can sometimes take several minutes before a single token of inference has happened. Once a runtime is warm, subsequent requests skip straight to inference. But that first cold request pays the full cost.
Given this, the previous sync model was never really right. Clients were blocking on an HTTP connection waiting for what they thought was an inference result, but in reality they were waiting for a runtime to be allocated, a model to be downloaded, weights to be loaded into memory, and then inference on top of that. Making this async was not a design preference. It is what the underlying execution model actually looks like. The API now reflects that.
What changed
POST /v1/predictions returns immediately:
1curl https://api.wrift.ai/v1/predictions \2 -H "Authorization: Bearer $WRIFTAI_ACCESS_TOKEN" \3 -H "Content-Type: application/json" \4 -d '{5 "model": "owner/model_name",6 "input": { "prompt": "Summarize quantum computing." }7 }'
1{2 "id": "a1b2c3d4-5678-9abc-def0-1234567890ab",3 "status": "pending",4 "output": null,5 "error": null,6 "created_at": "2024-06-10T09:00:00.000Z"7}
The full lifecycle:
pending: queued, not yet assigned to hardwarestarted: runtime provisioning, model loading into memory, inference runningsucceeded: done,outputis populatedfailed:erroris populated
Poll GET /v1/predictions/{id} to check progress:
1curl https://api.wrift.ai/v1/predictions/a1b2c3d4-5678-9abc-def0-1234567890ab \2 -H "Authorization: Bearer $WRIFTAI_ACCESS_TOKEN"
Polling
Poll at a reasonable interval. Every couple of seconds is fine for most models. Tighter than that does not speed anything up and will count against your rate limit.
Check status before reading output. Only read output when status is succeeded. If status is failed, read error. Put a ceiling on your polling loop. A prediction that stays in started for longer than you expect is worth treating as failed and retrying rather than waiting indefinitely.
What did not change
The request structure is unchanged. Authentication, the model field, the input object, the prediction ID format. The only breaking change is that output is null on the creation response.
Where this is going
Sync-like waiting
Async is the right default for the reasons above. But some integrations genuinely want to block on a result without managing a polling loop. We are going to add support for holding the connection open and returning the result directly when inference completes, with proper handling if the connection drops mid-wait. The intent is that you should not have to feel the async model if you do not want to.
Webhook delivery
Polling works but it is not always an ideal production pattern. We are working on webhook support so the result gets pushed to a server you control the moment there are updates. That is coming later this year.
Cold starts
The runtime provisioning time is the biggest source of latency on cold requests and the most impactful thing we can improve. A few directions we are actively working on:
- Caching runtimes closer to the hardware that is likely to serve them again, reducing or eliminating the download on repeated use. For popular models with consistent traffic this already helps. The harder problem is infrequently used models where there is no good prediction of when or where the next request will come from.
- Starting runtime initialization before the download has fully completed. The runtime does not need to be fully present before it can start doing useful work. Certain layers can be prioritized and execution can begin earlier in the sequence.
- Caching the CUDA context between runs. Initializing the CUDA runtime and allocating GPU memory is not free, and that cost is paid on every cold start independent of how large the model is. We are experimenting with holding that context warm across requests. The target is that a cold runtime gets a pre-initialized CUDA environment rather than building one from scratch.
- Keeping runtimes permanently warm by setting a minimum replica count. At least one runtime stays allocated at all times, so the next request skips the cold start entirely. The cost runs continuously regardless of traffic, which can get expensive for low-volume models. We already do this for dedicated deployments. Extending it as a configurable option for private model owners is on the roadmap.
We are working on all four. If you run models regularly, you may start to notice the difference in the coming months.