Automation
Intelligent Document Processing
RPA
Transformer

How to fine-tune GPT-4o for industry-specific document processing and robotic process automation

How Automat is using OpenAI's vision fine-tuning to enhance IDP & RPA.

Lucas Ochoa

10.1.2024

Share:

Thanks to OpenAI's new vision fine-tuning, we are now able to demonstrate promising quantitative results that increase accuracy for both: intelligent document processing (IDP) and robotic process automation (RPA).

Read more about the launch and OpenAI's article featuring Automat here

Automat builds enterprise software automations using AI. We use agents to autonomously create and manage complex workflows, making them 10x faster to build and maintain than traditional processes. Here’s how we to use OpenAI’s vision fine-tuning to supplement and enhance our software:

  1. Industry-specific document parsers: Custom intelligent document processing (IDP) models trained with proprietary customer data improve accuracy and reduce development time.
  2. UI-based automation building agents: New agent-driven workflows for Robotic Process Automation (RPA) on Web and Windows native apps accelerate the build process on legacy systems.

UI element localization from text descriptions allows planning agents to perform actions using screen coordinates only

Use Cases:

At Automat, we leverage vision models to build bots that streamline our customers' operations. Our bots perform actions by navigating GUI software systems, as well as understanding and extracting information from unstructured documents. Many of our enterprise customers process millions of physical documents each year, and go on to build UI-based automations.

End-to-end vision document extraction with fine-tuned vision model and JSON schema

For our case study, we used an anonymized dataset of unstructured documents to benefit one of our customers, a Fortune-100 insurance company. We trained a combination of traditional and transformer-based IDP models to analyze and extract information from medical documents such as prescriptions. Below, we highlight the differences and advantages of end-to-end systems, enhanced by GPT-4 vision fine-tuning.

Once data is extracted from the relevant documents, RPA automations begin to reconcile and input the information into legacy internal systems and third-party websites that lack accessible APIs. Traditional RPA methods have several limitations, detailed in the examples below. We conclude that vision fine-tuning, even on a small dataset of interfaces can significantly accelerate the RPA build process.

Intelligent Document Processing (IDP)

We used a mix of complex digital and handwritten health insurance documents (>15 schema fields with recursion). The evaluation is based on similarity between our ground truth dataset of manually labeled documents. Looking forward, we are certain that GPT vision models can surpass the performance of traditional IDP systems due to their end-to-end design. This allows the model to capture the relationships that exist within industry-specific data.

Robotic Process Automation (RPA) Software Agents:

Historically, RPA software development has been slow to build and challenging to maintain. The improved accuracy and reliability of models like GPT-4o, with advancements in multi-modality and structured outputs, have enabled new agentic applications. In many cases, customers need to run automations on a virtual desktop, restricted to video streams and mouse/keyboard input. This type of automation involves targeting UI-elements by their screen position, which can be tedious and unreliable. Additionally, using fixed coordinates is not robust as the position can vary due to factors like software updates, changing UI, and monitor resolution.

Example: Bots interacting with a variety of UI elements across one website

Vision models like GPT-4o struggle with tasks that require precise spatial localization. We evaluated the performance of GPT-4o (zero-shot) to estimate the center coordinates of a UI element from a text description. Then, using a dataset of website screenshots, we fine-tuned the base model improving the success hit rate by 272%.

Looking ahead

We are confident that a high-quality dataset can achieve human-level performance. We are excited about the potential of text-to-coordinate agents and reasoning/planning agents like GPT-o1. This opens the door to end-to-end RPA automation, significantly improving efficiency of our company and paving the way for new opportunities with our existing and future customers.

Authors: Perdo Martinez Lopez, Aaron Bannin, Gautam Bose, Shareef El-Sayed

[button] Get in touch [button]

From the blog

Enterprise
RPA
AI
Intelligent Automation
Automation

Why Companies Are Leaving UiPath in 2026

The $13B RPA market is fracturing. Here's what's driving the exodus from legacy automation - and what the winners are doing differently.

Lucas Ochoa

4.10.2026

Read
AI
Automation
Enterprise
Intelligent Automation

Zapier and N8N Are Great. They Won't Replace Your Operations Team.

Self-service automation tools solve a real problem. But it's a different problem than the one enterprise operations teams face.

Pablo Lleras

4.2.2026

Read
AI
RPA
Enterprise
Computer Vision
Automation

Automat vs Sola: Two AI-Native Approaches to RPA

Both platforms use AI to replace legacy RPA. The difference is in how they deliver: self-serve vs. fully managed.

Lucas Ochoa

3.19.2026

Read
AI
Intelligent Document Processing
Enterprise
Automation

Automat vs ABBYY: IDP Alone Isn't Enough Anymore

ABBYY does document extraction well. But documents are only one piece of the automation puzzle. Here's where the two diverge.

Gautam Bose

3.5.2026

Read
Enterprise
RPA
AI
Intelligent Automation

Automat vs Blue Prism: What's Different in 2026

Blue Prism pioneered enterprise RPA. But the enterprise has moved on. Here's where the two platforms diverge.

Pablo Lleras

2.26.2026

Read
Previous
Next