University of Minnesota NLP Class Project Proposal - Spring 2026

How Much Labeled Data Does a GUI Agent Need?

Data Scaling Analysis for Inverse Dynamics Models on macOS Workflows

Desi Kings

University of Minnesota

NLP, Spring 2026

Minnesota NLP Lab Minnesota NLP logo

, University of Minnesota

Proposal Code Soon Dataset Soon

Dataset Goal

20 Hours

Action Space

8 Classes

Training Splits

1h / 3h / 5h

Backbone

Frozen CLIP

Research Question

How much labeled demonstration data does an Inverse Dynamics Model require to reliably recognize user actions in macOS GUI workflows, and can automatic capture-time labeling replace expensive manual annotation?

Abstract

Building agents that operate GUIs remains a central challenge in computer use. Prior inverse dynamics work has shown that capture-time instrumentation can replace expensive manual annotation in game environments, but the desktop setting is less explored and structurally different. This project studies whether a narrower, more consistent macOS Finder workflow domain allows IDMs to learn reliable action recognition from substantially less labeled data.

We build a Python-based recording pipeline that captures screen frames at 10 fps together with OS-level input events, aligning the two streams automatically at collection time. Using that pipeline, we assemble 20 hours of Finder demonstrations and train three IDMs on 1 hour, 3 hours, and 5 hours of labeled data. The central output is a performance-versus-data curve on a shared held-out test set.

Motivation and Contributions

Motivation

GUI workflows differ from game environments in ways that may reduce sample complexity: the visual state space is smaller, layouts are more consistent, actions are discrete and low-dimensional, and tasks have clearer success criteria. That makes this setting a clean testbed for measuring how much labeled data an IDM really needs.

If performance saturates early, capture-time labeling becomes a practical recipe for open desktop datasets. If performance keeps climbing, that is still a useful result because it quantifies the role of scale in structured GUI action understanding.

Core Contributions

Automatic data collection pipeline. A Python system that records screen frames and OS-level input events simultaneously, producing labels at capture time with no post-hoc annotation.
macOS GUI workflow dataset. Twenty hours of Finder file-management demonstrations labeled automatically and prepared for public release.
IDM data scaling study. Three controlled training runs at 1 hour, 3 hours, and 5 hours, evaluated on the same held-out session to map the scaling curve.

Dataset and Action Space

Collection Protocol

A single operator performs macOS Finder tasks across multiple sessions with varied folder layouts and file types. Tasks include moving, copying, drag-and-drop, bulk selection, renaming, deleting, creating folders, and navigating nested directories.

Folder structures are randomized between sessions to increase visual diversity while keeping the workflow family fixed. Coordinates and key values are stored as metadata for future policy work, but the prediction target in this project is the action class only.

Dataset Scale

Total Data	Train Splits	Test Set	FPS	Resolution
20 hours	1 hour, 3 hours, 5 hours	30 min held-out session	10	Native Retina

ID	Action	Metadata	Workflow Role	Visual Signal in s_t+1
0	NONE	-	Idle	Embeddings nearly identical
1	LEFT CLICK	(x, y), ts	Select, navigate	Selection highlight appears
2	DOUBLE CLICK	(x, y), ts	Open item	New window opens
3	RIGHT CLICK	(x, y), ts	Context menu	Context menu appears
4	DRAG START	(x, y), ts	Pick up file	Icon becomes displaced
5	DRAG END	(x, y), ts	Drop file	Icon jumps to new location
6	SCROLL	(dx, dy), ts	Navigate directory	Folder list shifts
7	KEY PRESS	key char, ts	Rename, delete	Filename changes or file disappears

Keyboard actions are collapsed into one class because adjacent frames are often visually identical regardless of which key was pressed.

Capture-Time Labeling Pipeline

Pipeline Components

Image worker

`mss` captures screenshots at 10 fps and stores frame timestamps.

Input worker

`pynput` listeners capture clicks, scrolls, drags, and key presses as structured OS-level events.

Orchestrator

A sliding-window drain aligns all input events with frame timestamp t where event.ts ≤ t.

Export

PNG frames are converted to H.264 MP4 using PyAV/libav, and JSON metadata is written alongside frame labels.

Why This Matters

Labels are generated during recording rather than through costly post-hoc annotation.
The collection process stays faithful to real user interaction because it uses native OS-level instrumentation.
The resulting data is suitable for controlled scaling experiments because all 20 hours are labeled the same way.
The pipeline also preserves richer metadata for future extensions such as coordinate prediction or policy learning.

for each frame timestamp t: consume all input events where event.ts ≤ t

Inverse Dynamics Model

Architecture

IDM(s_t, s_t+1) -> a_t

Component	Detail	Params
Backbone	CLIP ViT-B/32, fully frozen	~150M frozen
Input	[e_t \|\| e_t+1] in R¹⁰²⁴	-
MLP Layer 1	Linear(1024->512), ReLU, Dropout(0.3)	524K
MLP Layer 2	Linear(512->256), ReLU, Dropout(0.3)	131K
MLP Layer 3	Linear(256->8), softmax	2K
Total Trainable	MLP head only	~660K

Why Frozen CLIP + MLP

CLIP already captures high-level GUI semantics such as menus, cursors, icons, and window changes.
The classification problem is small: only eight action classes with a narrow workflow domain.
Keeping the encoder frozen reduces overfitting risk on session-specific Finder layouts.
The main challenge is not classifier size but encoding the visual difference between adjacent frames well enough to separate subtle actions.

Alternative Considered

A small temporal transformer was considered, but concat + MLP is the cleaner first experiment. It is cheaper to train across multiple splits, easier to interpret, and likely sufficient given CLIP's representational strength in this structured setting.

Experiments and Evaluation

Training Setup

Model	Training Data	Evaluation
IDM1hr	1 hour labeled	30 min held-out session
IDM3hr	3 hours labeled	30 min held-out session
IDM5hr	5 hours labeled	30 min held-out session

All 20 hours are labeled by the same capture-time pipeline. The smaller training splits are deliberate label-hiding conditions, not partially annotated data.

Metrics

Macro F1 is the primary metric because the NONE class dominates the frame distribution.
Per-class accuracy highlights which actions remain difficult, especially visually similar drag transitions.
Confusion matrices show where the model collapses nearby interaction classes.
Performance-vs-data curve is the main scientific output of the project.

If the curve flattens at 1 hour, automatic labeling is highly efficient. If it continues improving through 5 hours, data scaling itself becomes the main result.

Timeline and Scope

Timeline

Period	Milestone	Deliverable
March 2026	Finish DataCollector	Working pipeline: frames + JSON
Early April	Data collection	20 hours of Finder sessions
Mid April	Train IDM1, IDM3, IDM5	Training curves and checkpoints
Late April	Evaluation	F1 tables and confusion matrices
May	Paper writeup	Final submission

Limitations

One operator only.
One task family: Finder file management.
One operating system: macOS.
Action-class prediction only, not coordinate regression or closed-loop control.

These constraints are intentional. They keep the experiment manageable, isolate the data-scaling question, and make the results interpretable within one semester.

BibTeX

@misc{desikings2026guiidm,
  title={How Much Labeled Data Does a GUI Agent Need? Data Scaling Analysis for Inverse Dynamics Models on macOS Workflows},
  author={Desi Kings},
  year={2026},
  note={University of Minnesota NLP Class Project Proposal, Spring 2026}
}