002 · How it works

How Vision works.

Real-time vision is a local-compute architecture. This page is the architectural reference — the pipeline, the integration surface, the local deployment topology, and the sovereignty model that follows.

Real-time vision requires local compute. Video is too large to send to a remote endpoint and too sensitive to leave the perimeter. Vision is built for the architectural reality that follows from those two facts. The pipeline runs on GPU-equipped nodes within your organization's perimeter, configured by prompt in plain language, and produces structured events into the systems your operations team already runs.

003 · The architecture

A pipeline of specialized models plus a configurable reasoning layer.

The architecture is a pipeline of specialized models plus a configurable reasoning layer. Specialized models do what they are best at — fast, precise detection and tracking. The VLM does what only a VLM can do — zero-shot semantic reasoning, configured by prompt. The two layers work together so the pipeline stays both fast and flexible.

Architecture · pipeline view
Frame → Event
01
Detector
Per-frame entity detection
02
Tracker
Persistent identity within a camera
03
Cross-camera embeddings
Identity across the camera grid
04
VLM reasoning
Invoked on demand
Level Asingle crop
Level Bfull frame
Level Ctemporal sequence
05
Event stream
Structured events to downstream systems, with audit trail.
Pipeline · Reasoning · Outputalquimia / vision
003.1 · The five components
  • 01Component

    Workflows

    No-code configuration in plain language — when a condition is met, where in the camera grid, analyze with this prompt. Triggers and prompts are the source of truth. When the rule changes, the prompt changes; the pipeline stays the same.

  • 02Component

    Real-time pipeline

    Specialized detector, tracker, and cross-camera embeddings running at frame speed. Persistent identity within and across cameras — an entity is the same entity from the moment it appears to the moment it leaves the scene.

  • 03Component

    VLM reasoning, three levels

    Vision-language models invoked only when the pipeline requires semantic reasoning: Level A on a single object crop (attribute classification), Level B on a full frame (spatial relations), Level C on a temporal sequence (what happened over time).

  • 04Component

    Plugins

    Opt-in vertical capabilities for domain-specific tasks — license plate OCR, PPE detection, human pose, face identity, person re-identification. Activated per use case.

  • 05Component

    Event stream

    Structured events flow through a pub/sub broker to your downstream systems — security operations dashboards, ticketing, observability stack, audit pipelines — with OpenTelemetry traces and full audit history.

004 · Integration surface

Every integration is a standard interface.

Vision integrates with the systems your operations team already runs. The pattern is the same across the board: every integration is a standard interface, not a vendor-specific binding. The day a camera is replaced, an event broker is migrated, or a VLM is swapped, the rest of the architecture keeps working.

SurfaceWhat it integrates with
  • 01Camera ingestion

    Standard streaming protocols (RTSP, ONVIF) — any IP camera, any network video recorder.

  • 02Event bus

    Pub/sub messaging — any compatible broker.

  • 03Observability

    OpenTelemetry — any OTel-compatible observability stack.

  • 04Downstream systems

    Webhook + structured events — security operations dashboards, ticketing, incident management, audit pipelines.

  • 05VLM providers

    Standard inference APIs + custom adapters — any vision-language model, open weights you self-host or hosted endpoints.

  • 06Specialized models

    Standard model serving primitives — any detector, tracker, embedding model.

  • 07Plugins

    Plugin SDK — vertical capabilities (OCR, PPE, pose, face, re-identification).

  • 08Container orchestration

    Kubernetes API — full distributions in the datacenter, lightweight distributions at the edge.

005 · Where it runs

A local-compute architecture by design.

Vision runs on GPU-equipped nodes within your organization's perimeter — in your datacenter or at the edge near the cameras. The real-time pipeline requires local compute. Three deployment topologies are in production use.

T1Topology

Datacenter mode

Full Kubernetes distribution

GPU-equipped nodes in your organization's datacenter. Suitable when cameras and operations both feed into a central facility.

T2Topology

Edge mode

Lightweight Kubernetes distribution

GPU-equipped nodes near the cameras — factory floor, government building, security perimeter, retail location. Suitable when video must be processed close to the feed and the datacenter is too distant for frame-speed inference.

T3Topology

Hybrid mode

Edge + core, events flow through the stream

Edge nodes for real-time inference on the feed, datacenter nodes for cross-camera aggregation and consultation. Suitable for distributed organizations with feeds across multiple sites.

005.1 · Agnostic deployment

Four layers. Standard interfaces at every layer.

Every layer in the deployment stack uses standard interfaces, so the choice of model, runtime distribution, or observability stack remains the organization's — not Alquimia's.

Deployment · agnostic stack
L01 → L04
  • L01
    Vision models

    Any VLM and any specialized detection / tracking model. The pipeline does not require any specific model or vendor.

  • L02
    Runtime layer

    Kubernetes-native — full distributions in the datacenter, lightweight distributions at the edge.

  • L03
    Compute proximity

    Local to the cameras. Nodes with their own GPUs, in your datacenter or at the edge near the feed. The real-time pipeline requires local compute.

  • L04
    Data sovereignty

    Video, inference, and events stay inside your perimeter by architectural design. The pipeline does not allow a data path through Alquimia or any third party.

Models · Runtime · Proximity · Sovereigntyalquimia / vision
005.2 · The contrast with Agentic Platform

Alquimia Agentic Platform deploys on any conformant Kubernetes — including major public clouds — because language-model inference can live wherever your organization runs Kubernetes. Vision is different.

The pipeline runs on GPU-equipped nodes within your perimeter, in your datacenter or at the edge near the cameras. This is a consequence of what real-time vision requires: video is too large to send to a remote endpoint and too sensitive to leave the perimeter.

Specific deployment choices — which Kubernetes distribution, which detection or vision-language model, which observability stack — are decisions each organization makes against its own constraints. Where we have customers using Vision in production, we have shared the patterns in our use cases. Architectural choices on the broader sovereign AI stack are covered in our Insights track.

006 · Governance & observability

Every Workflow invocation produces an inspectable record.

The pipeline captures the six artifacts that make any Workflow invocation reproducible end to end.

  • 01
    Entity identity

    Who or what was detected, tracked persistently across the camera grid.

  • 02
    Workflow trigger

    The condition that fired the analysis.

  • 03
    VLM prompt and level

    The reasoning that was requested and at what depth.

  • 04
    VLM output

    The answer the model produced.

  • 05
    Event publication

    The structured event sent to downstream systems, with timestamp and source frames.

  • 06
    Evidence frames

    The specific frames that informed the decision, retained for audit on demand.

Because Vision is a local-compute architecture, sovereignty is a property of the design. Video does not leave the perimeter. Inference happens on hardware your organization owns. Events flow to systems your organization operates. The audit committee's question — where is this video processed, and who has access to it? — has a single, unambiguous answer: here, on our own infrastructure.

The observability surface composes the platform-level traces (entity identity, Workflow trigger, VLM call, event publication) with the runtime-level metrics (GPU utilization, queue depths, inference latency). When a customer interaction or an incident requires investigation, the team can trace from the structured event, down through the Workflow that produced it, into the frames that informed it, onto the GPU node that ran the inference.

For reproducible behavioral metrics on vision models, the architecture integrates with Gaussia — the open evaluation suite crafted by Alquimia.

007 · A walk-through

Two Workflows. A single frame, end to end.

// Hypothetical illustration

Take the security checkpoint use case — the canonical example shown on the Vision Home and expanded as a full deep page in our use case archive.

Guards in orange vests review visitors who arrive without vests. The protocol: every visitor must be reviewed by a guard before continuing through the checkpoint. The goal is a real-time event stream that confirms compliance for every visitor. Two Workflows configured in plain language are enough.

Workflow 01
Classify role on entry
WHEN
A new entity is detected (class: person)
WHERE
Main entry, anywhere in frame
ANALYZE
Use VLM Level A on a crop of the person. Prompt: "Is this person a guard wearing an orange vest, or a visitor without a vest?"
RESULT
Each person is tagged with their role, once, the moment they appear. The role travels with the entity through the rest of its time in the scene.
Workflow 02
Verify review on exit
WHEN
A tracked entity leaves the scene (class: person, role: visitor, duration > 60 seconds)
WHERE
Anywhere in the camera grid
ANALYZE
Use VLM Level C on five crops sampled across the entity's time on scene. Prompt: "Was this person reviewed by a guard? At what moment, and by whom?"
RESULT
A structured event is published with the visitor's identity, the answer, the timestamp, and a reference to the guard who reviewed them.
007.1 · The trace of a single visitor

Frame ingest to event publication, end to end.

  1. 01
    Frame arrives

    A frame arrives at the local node, processed by the detector. A new entity of class person is detected.

  2. 02
    Identity assigned

    The tracker assigns a persistent identity, maintained across cameras through cross-camera embeddings.

  3. 03
    Workflow 1 fires

    On first appearance, the VLM at Level A classifies the role — guard or visitor — using a crop of the person. The role attaches to the tracked entity.

  4. 04
    Temporal sequence accrues

    Throughout the visitor's time in the scene, every frame contributes to the temporal sequence the pipeline retains for that entity.

  5. 05
    Workflow 2 fires

    When the visitor leaves the scene, the VLM at Level C samples five crops across the entity's time on scene and answers the protocol question — was this person reviewed by a guard, when, and by whom.

  6. 06
    Event published

    A structured event is published. It flows in parallel to the security operations dashboard, the audit pipeline, and the ticketing system. Each downstream system subscribes to the event stream and receives the same record.

  7. 07
    Evidence retained

    The evidence frames are retained for the period defined in the governance policy.

Every step happens on local compute. No frame leaves the perimeter. The event stream is the only thing that flows outward, and it flows to systems your organization owns.

008 · Get in touch

Bring your camera grid. We'll walk you through it.

We work with organizations running real-time vision on their own infrastructure — security operations, government, manufacturing, public safety, and compliance teams. A short call is enough to see if Alquimia Vision is the right fit for your case.

Get in touch