How WriftAI Works

WriftAI takes your models and turns them into callable runtimes.

You send prediction requests; WriftAI decides where to run them, manages scaling, and returns outputs, logs, and metrics.

What Happens When You Call a Model?

Local Runs Work Differently

Local models run immediately. Cold / warm / scale-to-zero applies only in the cloud.

WriftAI receives your request

You send inputs to a model endpoint using the Web interface, HTTP or an SDK. WriftAI validates the request and associates it with a specific model version.

A runtime is allocated or reused

WriftAI routes the request to a runtime for that version:

If the runtime is cold, compute is provisioned and the model is loaded.
If the runtime is warm, the request is handled immediately.
Under higher load, multiple runtimes can be created in parallel.

The model runs and emits logs/metrics

The model processes the inputs. Outputs, logs, and metrics are sent back as they are produced.

Scaling adjusts

As traffic changes, WriftAI may keep runtimes warm, scale them out, or scale them down toward zero when idle.

Autoscaling

Scaling happens automatically and instantly — no DevOps required.

Runtime States

WriftAI runtimes move through a small set of states based on traffic and scaling decisions:

State	Description
Cold	No instance is running. The next prediction incurs startup time while the model is loaded.
Warm	At least one instance is active and ready. Predictions are served with minimal latency.
Scaled Out	Multiple instances are running to handle concurrent predictions or high throughput.
Idle	A runtime has no active requests and is a candidate to be scaled down to zero.

This behavior is what lets WriftAI handle bursty workloads without you managing servers or GPU pools, or paying for GPUs when they aren’t being used.