Developer for Vision AI Desktop Automation Agent

FreelanceJobs

Canada

Canada

Jetzt Bewerben

Über

Project Overview
We are building an AI-powered desktop automation agent that interacts with a Windows pharmacy management application. The agent uses a local vision-language model (Qwen3-VL 8B served via Ollama) to read the application's GUI through screenshots, then uses PyAutoGUI to execute mouse clicks and keyboard inputs based on what the model sees. All AI inference runs locally on an NVIDIA RTX 5080 GPU — no cloud APIs are used.
The agent automates a repetitive data entry workflow: it reads incoming electronic prescription data from a queue, processes each prescription by populating form fields, performing drug lookups, calculating values from text directions, and saving the completed entry. The application is a legacy Windows desktop program (not a web app) with no API — all interaction must happen through the GUI.
We have the hardware, AI models, and target application installed and running. We have a detailed architecture document and the vision model is already communicating with the application. We need an experienced developer to build the reliable screenshot-to-action pipeline and complete the automation workflow.
The Core Technical Challenge
Many areas of the target application do not support keyboard shortcuts or programmatic access. The agent must visually identify UI elements (buttons, form fields, table rows, checkboxes, dialog boxes) from screenshots, calculate their exact pixel coordinates, and click them precisely using PyAutoGUI. This requires:
• Sending screenshots (base64-encoded PNG) to the Qwen3-VL 8B model via Ollama's local REST API and parsing structured JSON responses
• Extracting element coordinates from the vision model's output and translating them to screen-space click targets
• Handling DPI scaling, window positioning, and resolution consistency on Windows 11 (fixed 1920x1080)
• Building verify-after-action loops: take screenshot, act, take screenshot again, confirm success
• Robust error recovery when clicks miss, dialogs appear unexpectedly, or the application state is unrecognized
What You Will Build
1. Screenshot-to-Coordinate Pipeline (Priority 1)
The foundational system that all automation depends on. Capture the application window, send to the vision model with a structured prompt, receive JSON with element locations, and convert to reliable PyAutoGUI click coordinates. This must work consistently across all application screens.
2. Screen State Detection & Routing
The agent must identify which of approximately 7 different application screens is currently displayed (login, main menu, data queue, processing form, search results, modal dialog, error). Based on the detected state, it routes to the correct action handler.
3. Complete Workflow Automation
A multi-step automation sequence that handles: logging in with credentials, navigating to a data queue, filtering and selecting items from a table, processing each item through a multi-field form, performing database searches within the application, populating calculated values, saving entries, and looping until the queue is empty.
4. Orchestrator / Watchdog Process
A parent process that monitors the agent via heartbeat, auto-restarts on crash with exponential backoff, and logs all events. Standard watchdog pattern using Python subprocess.
5. Screen Recording for Audit Trail
Lightweight screen recording (ffmpeg with gdigrab on Windows) that captures all agent sessions for review.
Technical Stack (Already In Place)
Component Details
Hardware NVIDIA RTX GB, AMD Ryzen 9 9900X, 32GB DDR5-6000
OS Windows 11 (native, no WSL/Linux)
Vision Model Qwen3-VL 8B via Ollama (localhost:11434)
Text/Reasoning Model Qwen3 8B via Ollama (localhost:11434)
Automation PyAutoGUI + Python 3.12+
Screenshot PIL ImageGrab / mss
Screen Recording ffmpeg (gdigrab)
Target Application Legacy Windows desktop software at 1920x1080
IDE Claude Code (AI coding assistant, already in use)
Required Skills
• Must Have:
• Strong Python experience (3+ years)
• PyAutoGUI or equivalent desktop GUI automation on Windows
• Experience with screenshot-based automation or computer vision coordinate extraction
• PIL/Pillow image processing
• REST API integration (calling local LLM endpoints, parsing JSON responses)
• Windows 11 development experience (no Linux/Mac workarounds)
• subprocess management and process monitoring in Python
• Strongly Preferred:
• Experience with Ollama, vLLM, or other local LLM serving frameworks
• Experience with vision-language models (Qwen, LLaVA, UI-TARS, or similar)
• Prior RPA (Robotic Process Automation) development
• OpenCV or advanced image processing for element detection
• Experience building watchdog/orchestrator processes
• Nice to Have:
• Healthcare or pharmacy software experience
• Experience with pharmacy management systems
• Prompt engineering for vision-language models
How We Will Work Together
I am the domain expert (pharmacist) and will provide:
• Complete workflow documentation with screenshots of every application screen
• Detailed business logic for all data calculations and decision rules
• Real-time feedback and testing against the target application
• Prompt templates and domain-specific instructions for the vision model
• Access to a demo/training environment of the target application (no real patient data)
You will focus on the engineering:
• Building the reliable screenshot-to-action pipeline
• PyAutoGUI automation sequences with proper timing and verification
• Structuring the codebase into clean, maintainable modules
• Error recovery and production reliability
• Orchestrator process and logging infrastructure
We will have regular check-ins (daily or every other day). Development will be done in a monitored environment during development.
To Apply, Please Include
1. A brief description of your most relevant desktop automation or RPA project
2. Your experience with local LLMs (Ollama, vLLM) or vision-language models, if any
3. Your approach to the core challenge: extracting reliable click coordinates from a vision model's analysis of a screenshot
4. Your availability and estimated timeline
5. Any questions about the project
Tags / Skills for Upwork
Python, PyAutoGUI, Desktop Automation, RPA, Computer Vision, Ollama, LLM Integration, Windows Automation, PIL/Pillow, REST API, JSON Parsing, Screen Automation, AI Agent Development
Contract duration of 1 to 3 months. with 30 hours per week.
Mandatory skills: Python, Artificial Intelligence, Machine Learning, PyAutoGUI , Ollama, REST API, PIL/Pillow , Vision-Language Model, Windows App Development, Robotic Process Automation

Canada

Sprachkenntnisse

English

Hinweis für Nutzer

Dieses Stellenangebot stammt von einer Partnerplattform von TieTalent. Klicken Sie auf „Jetzt Bewerben“, um Ihre Bewerbung direkt auf deren Website einzureichen.

Jetzt Bewerben