Looking for Agentic Platform?Get in touch

002 · How it works

How Vision works.

Real-time vision is a local-compute architecture. This page is the architectural reference — the pipeline, the integration surface, the local deployment topology, and the sovereignty model that follows.

Real-time vision requires local compute. Video is too large to send to a remote endpoint and too sensitive to leave the perimeter. Vision is built for the architectural reality that follows from those two facts. The pipeline runs on GPU-equipped nodes within your organization's perimeter, configured by prompt in plain language, and produces structured events into the systems your operations team already runs.

003 · The architecture

A pipeline of specialized models plus a configurable reasoning layer.

The architecture is a pipeline of specialized models plus a configurable reasoning layer. Specialized models do what they are best at — fast, precise detection and tracking. The VLM does what only a VLM can do — zero-shot semantic reasoning, configured by prompt. The two layers work together so the pipeline stays both fast and flexible.

Architecture · pipeline view

Frame → Event

Detector

Per-frame entity detection

Tracker

Persistent identity within a camera

Cross-camera embeddings

Identity across the camera grid

VLM reasoning

Invoked on demand

Level Asingle crop

Level Bfull frame

Level Ctemporal sequence

Event stream

Structured events to downstream systems, with audit trail.

Pipeline · Reasoning · Outputalquimia / vision

003.1 · The five components

01Component
Workflows
No-code configuration in plain language — when a condition is met, where in the camera grid, analyze with this prompt. Triggers and prompts are the source of truth. When the rule changes, the prompt changes; the pipeline stays the same.
02Component
Real-time pipeline
Specialized detector, tracker, and cross-camera embeddings running at frame speed. Persistent identity within and across cameras — an entity is the same entity from the moment it appears to the moment it leaves the scene.
03Component
VLM reasoning, three levels
Vision-language models invoked only when the pipeline requires semantic reasoning: Level A on a single object crop (attribute classification), Level B on a full frame (spatial relations), Level C on a temporal sequence (what happened over time).
04Component
Plugins
Opt-in vertical capabilities for domain-specific tasks — license plate OCR, PPE detection, human pose, face identity, person re-identification. Activated per use case.
05Component
Event stream
Structured events flow through a pub/sub broker to your downstream systems — security operations dashboards, ticketing, observability stack, audit pipelines — with OpenTelemetry traces and full audit history.

004 · Integration surface

Every integration is a standard interface.

Vision integrates with the systems your operations team already runs. The pattern is the same across the board: every integration is a standard interface, not a vendor-specific binding. The day a camera is replaced, an event broker is migrated, or a VLM is swapped, the rest of the architecture keeps working.

SurfaceWhat it integrates with

01Camera ingestion
Standard streaming protocols (RTSP, ONVIF) — any IP camera, any network video recorder.
02Event bus
Pub/sub messaging — any compatible broker.
03Observability
OpenTelemetry — any OTel-compatible observability stack.
04Downstream systems
Webhook + structured events — security operations dashboards, ticketing, incident management, audit pipelines.
05VLM providers
Standard inference APIs + custom adapters — any vision-language model, open weights you self-host or hosted endpoints.
06Specialized models
Standard model serving primitives — any detector, tracker, embedding model.
07Plugins
Plugin SDK — vertical capabilities (OCR, PPE, pose, face, re-identification).
08Container orchestration
Kubernetes API — full distributions in the datacenter, lightweight distributions at the edge.

005 · Where it runs

A local-compute architecture by design.

Vision runs on GPU-equipped nodes within your organization's perimeter — in your datacenter or at the edge near the cameras. The real-time pipeline requires local compute. Three deployment topologies are in production use.

T1Topology

Datacenter mode

Full Kubernetes distribution

GPU-equipped nodes in your organization's datacenter. Suitable when cameras and operations both feed into a central facility.

T2Topology

Edge mode

Lightweight Kubernetes distribution

GPU-equipped nodes near the cameras — factory floor, government building, security perimeter, retail location. Suitable when video must be processed close to the feed and the datacenter is too distant for frame-speed inference.

T3Topology

Hybrid mode

Edge + core, events flow through the stream

Edge nodes for real-time inference on the feed, datacenter nodes for cross-camera aggregation and consultation. Suitable for distributed organizations with feeds across multiple sites.

005.1 · Agnostic deployment

Four layers. Standard interfaces at every layer.

Every layer in the deployment stack uses standard interfaces, so the choice of model, runtime distribution, or observability stack remains the organization's — not Alquimia's.

Deployment · agnostic stack

L01 → L04

L01
Vision models
Any VLM and any specialized detection / tracking model. The pipeline does not require any specific model or vendor.
L02
Runtime layer
Kubernetes-native — full distributions in the datacenter, lightweight distributions at the edge.
L03
Compute proximity
Local to the cameras. Nodes with their own GPUs, in your datacenter or at the edge near the feed. The real-time pipeline requires local compute.
L04
Data sovereignty
Video, inference, and events stay inside your perimeter by architectural design. The pipeline does not allow a data path through Alquimia or any third party.

Models · Runtime · Proximity · Sovereigntyalquimia / vision

005.2 · The contrast with Agentic Platform

Alquimia Agentic Platform deploys on any conformant Kubernetes — including major public clouds — because language-model inference can live wherever your organization runs Kubernetes. Vision is different.
The pipeline runs on GPU-equipped nodes within your perimeter, in your datacenter or at the edge near the cameras. This is a consequence of what real-time vision requires: video is too large to send to a remote endpoint and too sensitive to leave the perimeter.

Specific deployment choices — which Kubernetes distribution, which detection or vision-language model, which observability stack — are decisions each organization makes against its own constraints. Where we have customers using Vision in production, we have shared the patterns in our use cases. Architectural choices on the broader sovereign AI stack are covered in our Insights track.

006 · Governance & observability

Every Workflow invocation produces an inspectable record.

The pipeline captures the six artifacts that make any Workflow invocation reproducible end to end.

01
Entity identity
Who or what was detected, tracked persistently across the camera grid.
02
Workflow trigger
The condition that fired the analysis.
03
VLM prompt and level
The reasoning that was requested and at what depth.
04
VLM output
The answer the model produced.
05
Event publication
The structured event sent to downstream systems, with timestamp and source frames.
06
Evidence frames
The specific frames that informed the decision, retained for audit on demand.

Because Vision is a local-compute architecture, sovereignty is a property of the design. Video does not leave the perimeter. Inference happens on hardware your organization owns. Events flow to systems your organization operates. The audit committee's question — where is this video processed, and who has access to it? — has a single, unambiguous answer: here, on our own infrastructure.

The observability surface composes the platform-level traces (entity identity, Workflow trigger, VLM call, event publication) with the runtime-level metrics (GPU utilization, queue depths, inference latency). When a customer interaction or an incident requires investigation, the team can trace from the structured event, down through the Workflow that produced it, into the frames that informed it, onto the GPU node that ran the inference.

For reproducible behavioral metrics on vision models, the architecture integrates with Gaussia — the open evaluation suite crafted by Alquimia.

007 · A walk-through

Two Workflows. A single frame, end to end.

// Hypothetical illustration

Take the security checkpoint use case — the canonical example shown on the Vision Home and expanded as a full deep page in our use case archive.

Guards in orange vests review visitors who arrive without vests. The protocol: every visitor must be reviewed by a guard before continuing through the checkpoint. The goal is a real-time event stream that confirms compliance for every visitor. Two Workflows configured in plain language are enough.

Workflow 01

Classify role on entry

WHEN: A new entity is detected (class: person)
WHERE: Main entry, anywhere in frame
ANALYZE: Use VLM Level A on a crop of the person. Prompt: "Is this person a guard wearing an orange vest, or a visitor without a vest?"
RESULT: Each person is tagged with their role, once, the moment they appear. The role travels with the entity through the rest of its time in the scene.

Workflow 02

Verify review on exit

WHEN: A tracked entity leaves the scene (class: person, role: visitor, duration > 60 seconds)
WHERE: Anywhere in the camera grid
ANALYZE: Use VLM Level C on five crops sampled across the entity's time on scene. Prompt: "Was this person reviewed by a guard? At what moment, and by whom?"
RESULT: A structured event is published with the visitor's identity, the answer, the timestamp, and a reference to the guard who reviewed them.

007.1 · The trace of a single visitor

Frame ingest to event publication, end to end.

01
Frame arrives
A frame arrives at the local node, processed by the detector. A new entity of class person is detected.
02
Identity assigned
The tracker assigns a persistent identity, maintained across cameras through cross-camera embeddings.
03
Workflow 1 fires
On first appearance, the VLM at Level A classifies the role — guard or visitor — using a crop of the person. The role attaches to the tracked entity.
04
Temporal sequence accrues
Throughout the visitor's time in the scene, every frame contributes to the temporal sequence the pipeline retains for that entity.
05
Workflow 2 fires
When the visitor leaves the scene, the VLM at Level C samples five crops across the entity's time on scene and answers the protocol question — was this person reviewed by a guard, when, and by whom.
06
Event published
A structured event is published. It flows in parallel to the security operations dashboard, the audit pipeline, and the ticketing system. Each downstream system subscribes to the event stream and receives the same record.
07
Evidence retained
The evidence frames are retained for the period defined in the governance policy.

Every step happens on local compute. No frame leaves the perimeter. The event stream is the only thing that flows outward, and it flows to systems your organization owns.

008 · Get in touch

Bring your camera grid. We'll walk you through it.

We work with organizations running real-time vision on their own infrastructure — security operations, government, manufacturing, public safety, and compliance teams. A short call is enough to see if Alquimia Vision is the right fit for your case.

Get in touch