Research Question
How much labeled demonstration data does an Inverse Dynamics Model require to reliably recognize user actions in macOS GUI workflows, and can automatic capture-time labeling replace expensive manual annotation?
University of Minnesota NLP Class Project Proposal - Spring 2026
, University of Minnesota
Research Question
How much labeled demonstration data does an Inverse Dynamics Model require to reliably recognize user actions in macOS GUI workflows, and can automatic capture-time labeling replace expensive manual annotation?
GUI workflows differ from game environments in ways that may reduce sample complexity: the visual state space is smaller, layouts are more consistent, actions are discrete and low-dimensional, and tasks have clearer success criteria. That makes this setting a clean testbed for measuring how much labeled data an IDM really needs.
If performance saturates early, capture-time labeling becomes a practical recipe for open desktop datasets. If performance keeps climbing, that is still a useful result because it quantifies the role of scale in structured GUI action understanding.
A single operator performs macOS Finder tasks across multiple sessions with varied folder layouts and file types. Tasks include moving, copying, drag-and-drop, bulk selection, renaming, deleting, creating folders, and navigating nested directories.
Folder structures are randomized between sessions to increase visual diversity while keeping the workflow family fixed. Coordinates and key values are stored as metadata for future policy work, but the prediction target in this project is the action class only.
| Total Data | Train Splits | Test Set | FPS | Resolution |
|---|---|---|---|---|
| 20 hours | 1 hour, 3 hours, 5 hours | 30 min held-out session | 10 | Native Retina |
| ID | Action | Metadata | Workflow Role | Visual Signal in st+1 |
|---|---|---|---|---|
| 0 | NONE | - | Idle | Embeddings nearly identical |
| 1 | LEFT CLICK | (x, y), ts | Select, navigate | Selection highlight appears |
| 2 | DOUBLE CLICK | (x, y), ts | Open item | New window opens |
| 3 | RIGHT CLICK | (x, y), ts | Context menu | Context menu appears |
| 4 | DRAG START | (x, y), ts | Pick up file | Icon becomes displaced |
| 5 | DRAG END | (x, y), ts | Drop file | Icon jumps to new location |
| 6 | SCROLL | (dx, dy), ts | Navigate directory | Folder list shifts |
| 7 | KEY PRESS | key char, ts | Rename, delete | Filename changes or file disappears |
Keyboard actions are collapsed into one class because adjacent frames are often visually identical regardless of which key was pressed.
`mss` captures screenshots at 10 fps and stores frame timestamps.
`pynput` listeners capture clicks, scrolls, drags, and key presses as structured OS-level events.
A sliding-window drain aligns all input events with frame timestamp t where event.ts ≤ t.
PNG frames are converted to H.264 MP4 using PyAV/libav, and JSON metadata is written alongside frame labels.
for each frame timestamp t: consume all input events where event.ts ≤ t
IDM(st, st+1) -> at
| Component | Detail | Params |
|---|---|---|
| Backbone | CLIP ViT-B/32, fully frozen | ~150M frozen |
| Input | [et || et+1] in R1024 | - |
| MLP Layer 1 | Linear(1024->512), ReLU, Dropout(0.3) | 524K |
| MLP Layer 2 | Linear(512->256), ReLU, Dropout(0.3) | 131K |
| MLP Layer 3 | Linear(256->8), softmax | 2K |
| Total Trainable | MLP head only | ~660K |
A small temporal transformer was considered, but concat + MLP is the cleaner first experiment. It is cheaper to train across multiple splits, easier to interpret, and likely sufficient given CLIP's representational strength in this structured setting.
| Model | Training Data | Evaluation |
|---|---|---|
| IDM1hr | 1 hour labeled | 30 min held-out session |
| IDM3hr | 3 hours labeled | 30 min held-out session |
| IDM5hr | 5 hours labeled | 30 min held-out session |
All 20 hours are labeled by the same capture-time pipeline. The smaller training splits are deliberate label-hiding conditions, not partially annotated data.
If the curve flattens at 1 hour, automatic labeling is highly efficient. If it continues improving through 5 hours, data scaling itself becomes the main result.
| Period | Milestone | Deliverable |
|---|---|---|
| March 2026 | Finish DataCollector | Working pipeline: frames + JSON |
| Early April | Data collection | 20 hours of Finder sessions |
| Mid April | Train IDM1, IDM3, IDM5 | Training curves and checkpoints |
| Late April | Evaluation | F1 tables and confusion matrices |
| May | Paper writeup | Final submission |
These constraints are intentional. They keep the experiment manageable, isolate the data-scaling question, and make the results interpretable within one semester.
@misc{desikings2026guiidm,
title={How Much Labeled Data Does a GUI Agent Need? Data Scaling Analysis for Inverse Dynamics Models on macOS Workflows},
author={Desi Kings},
year={2026},
note={University of Minnesota NLP Class Project Proposal, Spring 2026}
}