How WriftAI Works
Understand the execution lifecycle of a model on WriftAI
WriftAI takes your models and turns them into callable runtimes.
You send prediction requests; WriftAI decides where to run them, manages scaling, and returns outputs, logs, and metrics.
What Happens When You Call a Model?
Local Runs Work Differently
Local models run immediately. Cold / warm / scale-to-zero applies only in the cloud.
WriftAI receives your request
You send inputs to a model endpoint using the Web interface, HTTP or an SDK. WriftAI validates the request and associates it with a specific model version.
A runtime is allocated or reused
WriftAI routes the request to a runtime for that version:
- If the runtime is cold, compute is provisioned and the model is loaded.
- If the runtime is warm, the request is handled immediately.
- Under higher load, multiple runtimes can be created in parallel.
The model runs and emits logs/metrics
The model processes the inputs. Outputs, logs, and metrics are sent back as they are produced.
Scaling adjusts
As traffic changes, WriftAI may keep runtimes warm, scale them out, or scale them down toward zero when idle.
Autoscaling
Scaling happens automatically and instantly — no DevOps required.
Runtime States
WriftAI runtimes move through a small set of states based on traffic and scaling decisions:
| State | Description |
|---|---|
| Cold | No instance is running. The next prediction incurs startup time while the model is loaded. |
| Warm | At least one instance is active and ready. Predictions are served with minimal latency. |
| Scaled Out | Multiple instances are running to handle concurrent predictions or high throughput. |
| Idle | A runtime has no active requests and is a candidate to be scaled down to zero. |
This behavior is what lets WriftAI handle bursty workloads without you managing servers or GPU pools, or paying for GPUs when they aren’t being used.