University of Minnesota NLP Class Project Proposal - Spring 2026

How Much Labeled Data Does a GUI Agent Need?

Data Scaling Analysis for Inverse Dynamics Models on macOS Workflows

Desi Kings
University of Minnesota
NLP, Spring 2026

Minnesota NLP Lab Minnesota NLP logo , University of Minnesota
Dataset Goal
20 Hours
Action Space
8 Classes
Training Splits
1h / 3h / 5h
Backbone
Frozen CLIP

Research Question

How much labeled demonstration data does an Inverse Dynamics Model require to reliably recognize user actions in macOS GUI workflows, and can automatic capture-time labeling replace expensive manual annotation?

Abstract

Building agents that operate GUIs remains a central challenge in computer use. Prior inverse dynamics work has shown that capture-time instrumentation can replace expensive manual annotation in game environments, but the desktop setting is less explored and structurally different. This project studies whether a narrower, more consistent macOS Finder workflow domain allows IDMs to learn reliable action recognition from substantially less labeled data.

We build a Python-based recording pipeline that captures screen frames at 10 fps together with OS-level input events, aligning the two streams automatically at collection time. Using that pipeline, we assemble 20 hours of Finder demonstrations and train three IDMs on 1 hour, 3 hours, and 5 hours of labeled data. The central output is a performance-versus-data curve on a shared held-out test set.

Motivation and Contributions

Motivation

GUI workflows differ from game environments in ways that may reduce sample complexity: the visual state space is smaller, layouts are more consistent, actions are discrete and low-dimensional, and tasks have clearer success criteria. That makes this setting a clean testbed for measuring how much labeled data an IDM really needs.

If performance saturates early, capture-time labeling becomes a practical recipe for open desktop datasets. If performance keeps climbing, that is still a useful result because it quantifies the role of scale in structured GUI action understanding.

Core Contributions

  • Automatic data collection pipeline. A Python system that records screen frames and OS-level input events simultaneously, producing labels at capture time with no post-hoc annotation.
  • macOS GUI workflow dataset. Twenty hours of Finder file-management demonstrations labeled automatically and prepared for public release.
  • IDM data scaling study. Three controlled training runs at 1 hour, 3 hours, and 5 hours, evaluated on the same held-out session to map the scaling curve.

Dataset and Action Space

Collection Protocol

A single operator performs macOS Finder tasks across multiple sessions with varied folder layouts and file types. Tasks include moving, copying, drag-and-drop, bulk selection, renaming, deleting, creating folders, and navigating nested directories.

Folder structures are randomized between sessions to increase visual diversity while keeping the workflow family fixed. Coordinates and key values are stored as metadata for future policy work, but the prediction target in this project is the action class only.

Dataset Scale

Total Data Train Splits Test Set FPS Resolution
20 hours 1 hour, 3 hours, 5 hours 30 min held-out session 10 Native Retina
ID Action Metadata Workflow Role Visual Signal in st+1
0NONE-IdleEmbeddings nearly identical
1LEFT CLICK(x, y), tsSelect, navigateSelection highlight appears
2DOUBLE CLICK(x, y), tsOpen itemNew window opens
3RIGHT CLICK(x, y), tsContext menuContext menu appears
4DRAG START(x, y), tsPick up fileIcon becomes displaced
5DRAG END(x, y), tsDrop fileIcon jumps to new location
6SCROLL(dx, dy), tsNavigate directoryFolder list shifts
7KEY PRESSkey char, tsRename, deleteFilename changes or file disappears

Keyboard actions are collapsed into one class because adjacent frames are often visually identical regardless of which key was pressed.

Capture-Time Labeling Pipeline

Pipeline Components

1
Image worker

`mss` captures screenshots at 10 fps and stores frame timestamps.

2
Input worker

`pynput` listeners capture clicks, scrolls, drags, and key presses as structured OS-level events.

3
Orchestrator

A sliding-window drain aligns all input events with frame timestamp t where event.ts ≤ t.

4
Export

PNG frames are converted to H.264 MP4 using PyAV/libav, and JSON metadata is written alongside frame labels.

Why This Matters

  • Labels are generated during recording rather than through costly post-hoc annotation.
  • The collection process stays faithful to real user interaction because it uses native OS-level instrumentation.
  • The resulting data is suitable for controlled scaling experiments because all 20 hours are labeled the same way.
  • The pipeline also preserves richer metadata for future extensions such as coordinate prediction or policy learning.
for each frame timestamp t: consume all input events where event.ts ≤ t

Inverse Dynamics Model

Architecture

IDM(st, st+1) -> at

Component Detail Params
BackboneCLIP ViT-B/32, fully frozen~150M frozen
Input[et || et+1] in R1024-
MLP Layer 1Linear(1024->512), ReLU, Dropout(0.3)524K
MLP Layer 2Linear(512->256), ReLU, Dropout(0.3)131K
MLP Layer 3Linear(256->8), softmax2K
Total TrainableMLP head only~660K

Why Frozen CLIP + MLP

  • CLIP already captures high-level GUI semantics such as menus, cursors, icons, and window changes.
  • The classification problem is small: only eight action classes with a narrow workflow domain.
  • Keeping the encoder frozen reduces overfitting risk on session-specific Finder layouts.
  • The main challenge is not classifier size but encoding the visual difference between adjacent frames well enough to separate subtle actions.

Alternative Considered

A small temporal transformer was considered, but concat + MLP is the cleaner first experiment. It is cheaper to train across multiple splits, easier to interpret, and likely sufficient given CLIP's representational strength in this structured setting.

Experiments and Evaluation

Training Setup

Model Training Data Evaluation
IDM1hr1 hour labeled30 min held-out session
IDM3hr3 hours labeled30 min held-out session
IDM5hr5 hours labeled30 min held-out session

All 20 hours are labeled by the same capture-time pipeline. The smaller training splits are deliberate label-hiding conditions, not partially annotated data.

Metrics

  • Macro F1 is the primary metric because the NONE class dominates the frame distribution.
  • Per-class accuracy highlights which actions remain difficult, especially visually similar drag transitions.
  • Confusion matrices show where the model collapses nearby interaction classes.
  • Performance-vs-data curve is the main scientific output of the project.

If the curve flattens at 1 hour, automatic labeling is highly efficient. If it continues improving through 5 hours, data scaling itself becomes the main result.

Timeline and Scope

Timeline

Period Milestone Deliverable
March 2026Finish DataCollectorWorking pipeline: frames + JSON
Early AprilData collection20 hours of Finder sessions
Mid AprilTrain IDM1, IDM3, IDM5Training curves and checkpoints
Late AprilEvaluationF1 tables and confusion matrices
MayPaper writeupFinal submission

Limitations

  • One operator only.
  • One task family: Finder file management.
  • One operating system: macOS.
  • Action-class prediction only, not coordinate regression or closed-loop control.

These constraints are intentional. They keep the experiment manageable, isolate the data-scaling question, and make the results interpretable within one semester.

BibTeX

@misc{desikings2026guiidm,
  title={How Much Labeled Data Does a GUI Agent Need? Data Scaling Analysis for Inverse Dynamics Models on macOS Workflows},
  author={Desi Kings},
  year={2026},
  note={University of Minnesota NLP Class Project Proposal, Spring 2026}
}