OpenAI is taking a new approach to evaluating its next generation of artificial intelligence systems. Instead of relying only on synthetic benchmarks or test prompts, the company is now asking third-party contractors to submit real assignments they have completed in actual jobs. The goal is to better understand how AI systems compare with human professionals when it comes to real, complex work.
According to internal documents and presentations reviewed by WIRED, this initiative involves collaboration with the training data firm Handshake AI. At the center of the effort is a simple idea: measure AI performance against authentic human output rather than idealized or simplified tasks.
In short, OpenAI is requesting contractors to submit samples from previous projects to help assess how well its AI agents perform.
Building a Human Benchmark for AI Models
This project appears to be part of OpenAI’s broader push to establish a “human baseline” across different professions. By collecting examples of how real people complete real tasks, OpenAI can directly compare those outcomes with AI-generated results.
Back in September, the company introduced a new evaluation framework designed to test its AI models against human experts from multiple industries. OpenAI has repeatedly stated that outperforming humans in economically valuable work is one of its core indicators of progress toward artificial general intelligence (AGI).
Internal materials emphasize that the company wants tasks that mirror everyday professional responsibilities—not artificial test cases created solely for training purposes.
What Contractors Are Being Asked to Submit
Contractors involved in the program are instructed to describe tasks they have completed either in their current roles or in previous jobs. More importantly, they are asked to upload the actual work product, not a summary or explanation.
Examples of acceptable uploads include Word documents, PDFs, spreadsheets, slide decks, images, or even full software repositories. Each submission must represent a concrete deliverable that was created in response to a real request from a manager, colleague, or client.
While OpenAI prefers authentic work, contractors are also allowed to submit carefully fabricated examples—provided they realistically demonstrate how the person would respond in a genuine professional scenario.
Breaking Down a “Real-World Task”
According to OpenAI’s internal guidance, every real-world task has two core components:
-
The task request – what the worker was asked to do
-
The task deliverable – the finished output produced in response
OpenAI repeatedly stresses that submissions should reflect work the contractor has actually done on the job. The emphasis on realism suggests the company is trying to capture the messy, nuanced nature of professional work—something AI systems often struggle with.
A Real Example from the Program
One illustrative example featured in OpenAI’s presentation involves a senior lifestyle manager at a luxury concierge firm catering to ultra-high-net-worth clients. The task request was to create a short, two-page PDF outlining a seven-day yacht itinerary in the Bahamas for a family visiting the region for the first time.
The contractor’s submission, labeled as the “experienced human deliverable,” consisted of a real itinerary previously created for an actual client. This type of example allows OpenAI to directly compare human planning, creativity, and attention to detail with AI-generated alternatives.
Safeguards Around Confidential Information
OpenAI instructs contractors to remove or anonymize sensitive material before submitting any files. This includes personal data, proprietary information, and confidential business details. The company explicitly warns against sharing internal strategies, unreleased products, or nonpublic corporate data.
One internal document references a ChatGPT-based tool called “Superstar Scrubbing,” which provides guidance on how to strip files of sensitive information before upload.
Despite these safeguards, concerns remain.
Legal Risks and Expert Concerns
Evan Brown, an intellectual property attorney at Neal & McDevitt, told WIRED that AI companies collecting large volumes of contractor-supplied data could face serious legal exposure. Even when documents are scrubbed, contractors may still risk violating nondisclosure agreements or exposing trade secrets from previous employers.
According to Brown, the burden placed on contractors to decide what qualifies as confidential is risky. If sensitive information slips through, AI labs could find themselves facing trade secret misappropriation claims.
In his view, the process requires an enormous amount of trust—trust that may not always be justified.
The Growing Industry Behind AI Training Data
These documents also highlight a broader trend across the AI sector. Companies like Anthropic and Google are increasingly relying on skilled contractors to produce high-quality training data for AI agents designed to automate professional work.
For years, AI labs have partnered with contracting firms such as Surge, Mercor, and Scale AI. However, as models become more advanced, the demand for higher-quality, domain-specific data has surged—along with the price of acquiring it.
Handshake AI was reportedly valued at $3.5 billion in 2022, while Surge has been linked to a $25 billion valuation during fundraising discussions, underscoring how lucrative this niche has become.
Exploring Other Data Sources
OpenAI has also explored alternative ways to access real company data. One individual involved in liquidating assets from defunct companies told WIRED that OpenAI inquired about purchasing internal documents and emails—provided personal information could be removed.
Ultimately, the source declined to proceed, citing concerns about whether sensitive data could ever be fully scrubbed.
Final Thoughts
By grounding AI evaluation in authentic human work, OpenAI is signaling a shift toward more realistic and demanding benchmarks. While this approach may lead to more capable AI agents, it also raises serious ethical and legal questions—especially around data ownership, confidentiality, and contractor responsibility.