How Vision works.
Real-time vision is a local-compute architecture. This page is the architectural reference — the pipeline, the integration surface, the local deployment topology, and the sovereignty model that follows.
Real-time vision requires local compute. Video is too large to send to a remote endpoint and too sensitive to leave the perimeter. Vision is built for the architectural reality that follows from those two facts. The pipeline runs on GPU-equipped nodes within your organization's perimeter, configured by prompt in plain language, and produces structured events into the systems your operations team already runs.
A pipeline of specialized models plus a configurable reasoning layer.
The architecture is a pipeline of specialized models plus a configurable reasoning layer. Specialized models do what they are best at — fast, precise detection and tracking. The VLM does what only a VLM can do — zero-shot semantic reasoning, configured by prompt. The two layers work together so the pipeline stays both fast and flexible.
- 01Component
Workflows
No-code configuration in plain language — when a condition is met, where in the camera grid, analyze with this prompt. Triggers and prompts are the source of truth. When the rule changes, the prompt changes; the pipeline stays the same.
- 02Component
Real-time pipeline
Specialized detector, tracker, and cross-camera embeddings running at frame speed. Persistent identity within and across cameras — an entity is the same entity from the moment it appears to the moment it leaves the scene.
- 03Component
VLM reasoning, three levels
Vision-language models invoked only when the pipeline requires semantic reasoning: Level A on a single object crop (attribute classification), Level B on a full frame (spatial relations), Level C on a temporal sequence (what happened over time).
- 04Component
Plugins
Opt-in vertical capabilities for domain-specific tasks — license plate OCR, PPE detection, human pose, face identity, person re-identification. Activated per use case.
- 05Component
Event stream
Structured events flow through a pub/sub broker to your downstream systems — security operations dashboards, ticketing, observability stack, audit pipelines — with OpenTelemetry traces and full audit history.
Every integration is a standard interface.
Vision integrates with the systems your operations team already runs. The pattern is the same across the board: every integration is a standard interface, not a vendor-specific binding. The day a camera is replaced, an event broker is migrated, or a VLM is swapped, the rest of the architecture keeps working.
- 01Camera ingestion
Standard streaming protocols (RTSP, ONVIF) — any IP camera, any network video recorder.
- 02Event bus
Pub/sub messaging — any compatible broker.
- 03Observability
OpenTelemetry — any OTel-compatible observability stack.
- 04Downstream systems
Webhook + structured events — security operations dashboards, ticketing, incident management, audit pipelines.
- 05VLM providers
Standard inference APIs + custom adapters — any vision-language model, open weights you self-host or hosted endpoints.
- 06Specialized models
Standard model serving primitives — any detector, tracker, embedding model.
- 07Plugins
Plugin SDK — vertical capabilities (OCR, PPE, pose, face, re-identification).
- 08Container orchestration
Kubernetes API — full distributions in the datacenter, lightweight distributions at the edge.
A local-compute architecture by design.
Vision runs on GPU-equipped nodes within your organization's perimeter — in your datacenter or at the edge near the cameras. The real-time pipeline requires local compute. Three deployment topologies are in production use.
Datacenter mode
GPU-equipped nodes in your organization's datacenter. Suitable when cameras and operations both feed into a central facility.
Edge mode
GPU-equipped nodes near the cameras — factory floor, government building, security perimeter, retail location. Suitable when video must be processed close to the feed and the datacenter is too distant for frame-speed inference.
Hybrid mode
Edge nodes for real-time inference on the feed, datacenter nodes for cross-camera aggregation and consultation. Suitable for distributed organizations with feeds across multiple sites.
Four layers. Standard interfaces at every layer.
Every layer in the deployment stack uses standard interfaces, so the choice of model, runtime distribution, or observability stack remains the organization's — not Alquimia's.
- L01Vision models
Any VLM and any specialized detection / tracking model. The pipeline does not require any specific model or vendor.
- L02Runtime layer
Kubernetes-native — full distributions in the datacenter, lightweight distributions at the edge.
- L03Compute proximity
Local to the cameras. Nodes with their own GPUs, in your datacenter or at the edge near the feed. The real-time pipeline requires local compute.
- L04Data sovereignty
Video, inference, and events stay inside your perimeter by architectural design. The pipeline does not allow a data path through Alquimia or any third party.
Alquimia Agentic Platform deploys on any conformant Kubernetes — including major public clouds — because language-model inference can live wherever your organization runs Kubernetes. Vision is different.
The pipeline runs on GPU-equipped nodes within your perimeter, in your datacenter or at the edge near the cameras. This is a consequence of what real-time vision requires: video is too large to send to a remote endpoint and too sensitive to leave the perimeter.
Specific deployment choices — which Kubernetes distribution, which detection or vision-language model, which observability stack — are decisions each organization makes against its own constraints. Where we have customers using Vision in production, we have shared the patterns in our use cases. Architectural choices on the broader sovereign AI stack are covered in our Insights track.
Every Workflow invocation produces an inspectable record.
The pipeline captures the six artifacts that make any Workflow invocation reproducible end to end.
- 01Entity identity
Who or what was detected, tracked persistently across the camera grid.
- 02Workflow trigger
The condition that fired the analysis.
- 03VLM prompt and level
The reasoning that was requested and at what depth.
- 04VLM output
The answer the model produced.
- 05Event publication
The structured event sent to downstream systems, with timestamp and source frames.
- 06Evidence frames
The specific frames that informed the decision, retained for audit on demand.
Because Vision is a local-compute architecture, sovereignty is a property of the design. Video does not leave the perimeter. Inference happens on hardware your organization owns. Events flow to systems your organization operates. The audit committee's question — where is this video processed, and who has access to it? — has a single, unambiguous answer: here, on our own infrastructure.
The observability surface composes the platform-level traces (entity identity, Workflow trigger, VLM call, event publication) with the runtime-level metrics (GPU utilization, queue depths, inference latency). When a customer interaction or an incident requires investigation, the team can trace from the structured event, down through the Workflow that produced it, into the frames that informed it, onto the GPU node that ran the inference.
For reproducible behavioral metrics on vision models, the architecture integrates with Gaussia — the open evaluation suite crafted by Alquimia.
Two Workflows. A single frame, end to end.
// Hypothetical illustration
Take the security checkpoint use case — the canonical example shown on the Vision Home and expanded as a full deep page in our use case archive.
Guards in orange vests review visitors who arrive without vests. The protocol: every visitor must be reviewed by a guard before continuing through the checkpoint. The goal is a real-time event stream that confirms compliance for every visitor. Two Workflows configured in plain language are enough.
- WHEN
- A new entity is detected (class: person)
- WHERE
- Main entry, anywhere in frame
- ANALYZE
- Use VLM Level A on a crop of the person. Prompt: "Is this person a guard wearing an orange vest, or a visitor without a vest?"
- RESULT
- Each person is tagged with their role, once, the moment they appear. The role travels with the entity through the rest of its time in the scene.
- WHEN
- A tracked entity leaves the scene (class: person, role: visitor, duration > 60 seconds)
- WHERE
- Anywhere in the camera grid
- ANALYZE
- Use VLM Level C on five crops sampled across the entity's time on scene. Prompt: "Was this person reviewed by a guard? At what moment, and by whom?"
- RESULT
- A structured event is published with the visitor's identity, the answer, the timestamp, and a reference to the guard who reviewed them.
Frame ingest to event publication, end to end.
- 01Frame arrives
A frame arrives at the local node, processed by the detector. A new entity of class person is detected.
- 02Identity assigned
The tracker assigns a persistent identity, maintained across cameras through cross-camera embeddings.
- 03Workflow 1 fires
On first appearance, the VLM at Level A classifies the role — guard or visitor — using a crop of the person. The role attaches to the tracked entity.
- 04Temporal sequence accrues
Throughout the visitor's time in the scene, every frame contributes to the temporal sequence the pipeline retains for that entity.
- 05Workflow 2 fires
When the visitor leaves the scene, the VLM at Level C samples five crops across the entity's time on scene and answers the protocol question — was this person reviewed by a guard, when, and by whom.
- 06Event published
A structured event is published. It flows in parallel to the security operations dashboard, the audit pipeline, and the ticketing system. Each downstream system subscribes to the event stream and receives the same record.
- 07Evidence retained
The evidence frames are retained for the period defined in the governance policy.
Every step happens on local compute. No frame leaves the perimeter. The event stream is the only thing that flows outward, and it flows to systems your organization owns.
Bring your camera grid. We'll walk you through it.
We work with organizations running real-time vision on their own infrastructure — security operations, government, manufacturing, public safety, and compliance teams. A short call is enough to see if Alquimia Vision is the right fit for your case.
Get in touch