Laryaa
All Insights
04Technical ArchitectureNovember 22, 2025· 6 min read

On-Device Vision Is Not a Hack — It's the Only Universal Interface

Screens are the last stable API.

Visual automation — reading screens and clicking elements — is often dismissed as a crude workaround for proper integration. This dismissal reflects a misunderstanding of enterprise software reality. Vision isn't the fallback interface. It's the only interface that works everywhere.

Why DOM and API Approaches Don't Generalize

Web automation relies on DOM access — inspecting HTML structure to locate elements. This works for web apps but fails for desktop applications, remote desktops, virtualized environments, and embedded systems.

API automation requires... APIs. Which don't exist for most enterprise software. Even when they exist, they expose limited functionality.

Both approaches require per-application integration work. Different selectors for each app. Different API calls for each system. Each integration is a custom project.

There is no universal DOM. There is no universal API. But there is a universal interface: the screen.

Why Vision Scales Across Legacy Software

Every application that runs on a computer produces visual output. Windows from 1995, Citrix sessions, Java Swing apps, mainframe terminals, proprietary kiosks — they all draw to screens.

Visual automation doesn't care about underlying technology. It sees what humans see. A button is a button whether it's HTML, WPF, Qt, or custom-rendered.

This universality eliminates per-application integration. One visual system can potentially automate thousands of applications without custom code for each.

The math is compelling: visual automation scales linearly with capability, while integration-based automation scales linearly with applications.

Why This Is Harder Than It Looks

Vision-based automation is not screenshot + OCR + click coordinates. That naive approach fails constantly — same failure modes as traditional RPA.

Robust visual automation requires semantic understanding. Not "click pixel 340, 220" but "click the Submit button." Not "read text at region" but "find the account balance field."

The system must handle resolution changes, DPI scaling, theme variations, font differences, and partial occlusion. It must distinguish interactive elements from decorative ones. It must understand spatial relationships and reading order.

Building reliable visual automation requires solving multiple computer vision problems simultaneously — while running fast enough for interactive use on constrained hardware.

The Vision Sovereignty Principle

If vision is the universal interface, then robust on-device vision is the foundation of universal automation. Not vision as a cloud service — that reintroduces compliance problems. Vision running locally, processing sensitive screens without external transmission.

This is what we call Visual Sovereignty: the ability to perceive and understand any visual interface, entirely on-device, without external dependencies.

Visual Sovereignty isn't about avoiding cloud costs. It's about enabling automation where cloud access is prohibited.

Key Takeaway

Vision-based automation isn't a workaround for missing APIs — it's the only approach that generalizes across the full diversity of enterprise software. Building reliable, on-device visual automation is hard, but it's the correct hard problem to solve.

Topics covered

Visual AutomationAPI LimitationsCross-Platform

Questions about this analysis?

We discuss technical architecture with teams evaluating solutions.

Request Technical Discussion