The insider's benchmark of the data annotation providers that actually train frontier models in 2026: who leads, who collapsed, and what each one really costs.
Written by Yuma Heymans (@yumahey), founder of HeroHunt.ai. He built an AI Recruiter that sources expert talent across more than a billion profiles, and the single hardest problem in this market, recruiting scarce PhDs and domain specialists fast enough, is the same problem he solves for a living.
A single $14.3 billion check rearranged the entire AI training-data industry in under a year. When Meta took a 49% stake in Scale AI in June 2025 and pulled founder Alexandr Wang into its superintelligence lab, the company that had defined data annotation for a decade lost Google, OpenAI, Microsoft and xAI as customers almost overnight, because no frontier lab wants to hand its unreleased data to a vendor half-owned by a direct competitor - CNBC. The throne did not stay empty. A bootstrapped company with roughly 110 employees, Surge AI, had already quietly passed Scale in revenue, and a 22-year-old's startup, Mercor, would hit a $10 billion valuation within months.
Here is the problem with almost every "top data annotators" list you will find: it is recycled vendor marketing, ranked by SEO budget rather than by what AI labs actually buy. The real market is opaque on purpose, its public rankings are inverted, and choosing the wrong partner does not just waste money. It can leak your model roadmap to a rival, bury your training run in data that labelers faked with ChatGPT, or attach your brand to a labor scandal in Nairobi. The stakes moved from procurement to strategy, and the names that matter in 2026 are not the names that mattered in 2023.
This guide is the benchmark that cuts through it. It maps how the data-annotation market actually works in 2026, the criteria labs really use to judge a vendor, the full story of the Scale AI collapse, a ranked breakdown of the top 10 providers for frontier model builders, the cheap bulk layer and its ethics reckoning, the specialists in medicine and physical AI, the new battleground of RL environments, and a practical framework for choosing. It draws on funding filings, lawsuits, leaked documents, breach disclosures and the rates contractors actually earn, not the glossy claims on vendor homepages.
Contents
- The 2026 Data Annotation Market: From Cheap Labels to Frontier Data
- The Benchmark: How AI Labs Actually Judge a Data Provider
- The Scale AI Earthquake: The Deal That Broke Neutrality
- The Top 10 Data Annotation Providers for AI Labs, Ranked
- The Bulk Layer and the Global South Reckoning
- The Specialists: Medical, Physical AI, and the Crypto Experiment
- RL Environments: The New Battleground for Human Data
- How to Choose: A Buyer's Framework for AI Teams
- The 2026 to 2027 Outlook: Agents in the Loop and the Expert Bottleneck
1. The 2026 Data Annotation Market: From Cheap Labels to Frontier Data
The most important thing to understand about data annotation in 2026 is that it split into two completely different businesses that happen to share a name. At the bottom sits commodity labeling, the repetitive drawing of bounding boxes and tagging of images that built the industry, work now worth a few dollars an hour and increasingly automated away. At the top sits frontier data, the work of getting PhD physicists, practicing lawyers and senior engineers to write reasoning traces, design evaluation rubrics and grade model outputs at $50 to $200 an hour. These two layers are converging in the same vendor pitch decks but they are different markets, with different economics, different geographies and different winners.
This split is recent. As late as 2022, data annotation mostly meant computer-vision work: drawing boxes around pedestrians for self-driving cars, the business that made Scale AI a multi-billion-dollar company and Appen a stock-market darling. The launch of ChatGPT inverted the field almost overnight, because large language models are trained and aligned not by labeling images but by humans writing, ranking and critiquing text, the work known as reinforcement learning from human feedback. Within two years the most valuable annotation task on earth went from "is this a stop sign" to "which of these two legal arguments is more sound, and why," and the providers that could not make that leap were left selling a commodity into a shrinking market.
The shift matters because the money has moved decisively upmarket. The pure training-dataset market is still modest in absolute terms, estimated at roughly $3.2 billion in 2025 and projected to reach about $16.3 billion by 2033 at a 22.6% compound rate - Grand View Research. But that figure dramatically understates reality, because the largest spend is now bespoke human-feedback work that does not show up in tidy market reports. Labelbox's CEO has said each leading lab now spends more than $1 billion a year on data, a claim echoed across the sector - Cognitive Revolution. When you add post-training, evaluations and reinforcement-learning data, the human-data economy is an order of magnitude larger than the "dataset" line item suggests.
This is the financial bar chart of the market, the one that explains why bootstrapped founders are becoming billionaires labeling data for chatbots.
AI Training Dataset Market Size (Reported Segment Only)
The reason this layer became so valuable is structural: data, not compute or algorithms, is now widely seen as the binding constraint on model quality. The old joke that data scientists spend 80% of their time preparing data rather than modeling turned out to be the whole story of frontier AI, and the labs that win are the ones with the best human data, not just the biggest clusters - CloudFactory. As public internet text runs thin, the marginal token of progress increasingly comes from a human expert producing something that does not exist anywhere online, which is exactly the work that commands premium rates. This is the sense in which expert human data has become the new oil of AI: a scarce, extractable resource whose owners can name their price, and whose supply, unlike crude, has to be recruited one specialist at a time - TIME.
The clearest evidence that this top layer carries software-like margins is the people getting rich from it. The data business minted a cluster of new billionaires in 2025 alone, from a bootstrapped founder who never raised a dollar to three former debate-team friends barely out of their teens, which only happens when revenue per employee runs into the millions and gross margins clear 50% - Inc.. That is also why the reported revenue figures in this market are so slippery: many vendors quote gross marketplace volume, the entire sum a customer pays before the contractor's share is removed, so a headline number can overstate true net revenue by a factor of three. Treat every billion-dollar claim in this guide as a direction of travel rather than an audited fact, and read the contractor-payout share alongside it.
It helps to picture the market as a stack, because each layer concentrates in different vendors, different countries and different price points. A team that maps each task to the right layer will out-buy one that sends everything to a single vendor and hopes for the best.
The practical takeaway from this opening map is that "data annotation provider" is no longer a useful category on its own. The vendor that labels your autonomous-vehicle LiDAR frames is almost never the right vendor to design a reinforcement-learning environment that teaches an agent to use a spreadsheet, and the pricing differs by a factor of fifty. The rest of this guide treats the top of the stack, the frontier-data and RLHF work that AI labs fight over, as the main event, while giving the bulk layer and the specialists their due. Before ranking anyone, though, you need to know what "good" even means here, because the benchmark is not what the vendors advertise.
2. The Benchmark: How AI Labs Actually Judge a Data Provider
The criteria that AI labs use to choose a data partner in 2026 bear almost no resemblance to the feature lists on vendor websites. Nobody at a frontier lab is comparing annotation-tool UIs. They are asking a much harder set of questions about trust, talent and security, and the answers separate the survivors from the casualties in this guide. Understanding these criteria is what lets you read past the marketing, because every controversy and every collapse in the sections that follow maps directly onto a vendor failing one of them.
The first and now dominant criterion is neutrality. A data vendor sees a lab's unreleased capabilities, its failure modes and the exact shape of its training priorities. That visibility is only acceptable if the vendor is not owned by, and not feeding, a competitor. This single criterion is what detonated Scale AI's business in 2025, and it is why "independent, not lab-owned" became the most repeated phrase in every rival's sales pitch. The second criterion is expert sourcing speed, the ability to recruit and vet credentialed specialists fast. When a lab needs 500 oncologists or derivatives traders this quarter, the vendor that can find, verify and onboard them in days wins, and the one that takes months loses the contract regardless of platform quality.
How vendors win on sourcing speed is itself a competitive frontier worth understanding. Mercor screens candidates through 20-minute AI video interviews, Micro1 accepts only the top 1% of applicants and runs an anti-cheat system that monitors gaze and browser tabs to enforce a minimum integrity score, and Handshake leans on a pre-verified academic credential graph that took twelve years to build - Sacra. These are not gimmicks. When a lab needs a specific cohort of derivatives traders or trauma surgeons within weeks, the vendor with the fastest, most trustworthy vetting funnel wins, which is why several of the leading data companies were literally founded as AI recruiters before they ever sold a labeled dataset.
The third criterion is raw data quality, which sounds obvious but is the dirty secret of the industry. Labs have learned that headcount and quality are inversely correlated above a certain scale, because mass crowds invite cheating. The fourth is security and confidentiality, which graduated from a checkbox to a board-level concern after a string of breaches. The fifth is throughput and reliability, the unglamorous ability to deliver large, consistent volumes on deadline without quality drift.
Quality, in particular, is where the romance of "human data" meets reality. Researchers estimated that between 33% and 46% of Amazon Mechanical Turk crowdworkers used large language models to complete a text task, meaning a meaningful share of supposedly human data was quietly machine-generated - arXiv. Leaked Scale AI documents from its Google Bard work described one program flooded with spam, gibberish and GPT-generated reasoning, with 800 spammers dumped into a single group and managers resorting to a public AI-detector to catch fakes - Futurism. This is why labs increasingly pay a 20-to-40x premium for verified experts: not because experts are smarter, but because a credentialed, well-paid, accountable contractor is far less likely to cheat, and at the frontier a single contaminated batch can poison an entire training run.
Security became the fifth criterion the hard way. In June 2025, Scale AI was found to have left more than 85 Google Docs publicly accessible, exposing Bard training manuals, Meta audio projects, xAI's confidentially-named "Project Xylophone" and contractor personal data - TechRepublic. Less than a year later, Mercor disclosed a supply-chain breach in which an estimated four terabytes of data, including contractor identity documents and the video interviews used to vet them, were allegedly exfiltrated, triggering five federal lawsuits within a week - Kaizen AI Lab. For a lab, a vendor breach is not an IT problem, it is the potential leak of an unreleased model's training methodology. A sixth criterion has quietly hardened into a procurement requirement: labor ethics and transparency. After the Kenya content-moderation scandals and a string of worker-misclassification lawsuits against Surge, Mercor and Scale, a lab's choice of data vendor is now a reputational and legal exposure, not just a quality decision. Enterprise and government buyers increasingly ask where the work is done, what the workers are paid, and whether the vendor has been sued over it, because the answer can surface in a headline or a courtroom with the lab's name attached - Independent Contractor Compliance. The vendors that treat this as a feature, paying well, disclosing conditions and avoiding the most harmful work, are turning it into a competitive advantage.
When you read the rankings below, score each vendor silently against these criteria, because that is exactly how their actual customers do it.
3. The Scale AI Earthquake: The Deal That Broke Neutrality
The defining event of the modern data-annotation market was self-inflicted, and it is worth telling in full because everything else in 2026 is a consequence of it. On June 12, 2025, Meta announced it would pay roughly $14.3 billion for a 49% non-voting stake in Scale AI, valuing the company at about $29 billion and bringing founder Alexandr Wang into Meta's new superintelligence lab - CNBC. On paper it looked like a triumph, the richest validation any data company had ever received. In practice it severed the one thing Scale could not afford to lose, which was the trust of every lab that competed with Meta.
The logic of the exodus was immediate and unforgiving. Scale's value to a customer depended on it being a neutral middleman that could see a lab's data without that data informing a rival. With Wang sitting on Scale's board and simultaneously running Meta's lab, that neutrality evaporated. Google, which had been Scale's largest customer at roughly $150 million in 2024, began planning its exit within days of the announcement - CNBC. OpenAI confirmed it was winding down work it had already been diversifying away, and Microsoft and xAI pulled back over the same confidentiality fears - Fortune.
This is Alexandr Wang, the founder whose move to Meta triggered the realignment, and whose deal is now studied as a cautionary tale in vendor neutrality.
The $14.3 billion man

The damage was not only about neutrality, and this is the part Scale's competitors most enjoy repeating. Reporting indicated that Meta's own researchers viewed Scale's data as lower quality and preferred working with Surge and Mercor, which means the company lost customers over neutrality and lost the quality argument at the same time - TechCrunch. In July 2025 Scale cut about 200 full-time employees, roughly 14% of staff, and ended work with around 500 contractors, while new CEO Jason Droege admitted in an internal memo that the company had ramped up its generative-AI capacity too quickly and built excessive bureaucracy - TechRepublic.
The redistribution was not subtle. Within months, Surge AI was reported in talks to raise at $25 billion or more, Mercor's run-rate multiplied, Handshake tripled its lab contracts after launching its AI unit, and Labelbox's CEO publicly told defecting customers his door was open - Computerworld. The turbulence reached even the labs themselves: in September 2025 xAI laid off about 500 workers, roughly a third of its data-annotation team, as it pivoted from generalist tutors to a specialist program ten times larger, a sign that the whole industry was simultaneously moving upmarket toward expert data - TechCrunch.
The durable lesson, and the reason this episode reshaped how every lab now thinks about data, is that vertical integration of a neutral supplier destroys the very neutrality that made it valuable. Analysts framed the core fear precisely: with Wang on Scale's board and inside Meta's lab, rivals worried their training data, labeling logic and product roadmaps could indirectly inform a competitor, and some responded by moving annotation in-house entirely - Everest Group. No amount of reassurance from Scale's new management could fully undo that structural conflict, because the conflict was baked into the cap table rather than the contract.
Scale's response was to pivot toward the one category insulated from the neutrality problem: government and defense, where being aligned with American interests is a feature rather than a liability. It won a $99 million Army contract, became prime on the Pentagon's "Thunderforge" agentic-warfare prototype, and saw the ceiling on a production contract raised five-fold to $500 million in 2026 - GovConWire. Management insists 2025 was its strongest financial year ever, with revenue reaching around $2 billion and more than $1 billion in new business signed, and the CFO has publicly fought the "zombie company" label - Sacra. Whether that pivot works long-term is an open question, but the immediate effect on the commercial market was a multi-billion-dollar redistribution to independents, which is precisely what the next section ranks.
4. The Top 10 Data Annotation Providers for AI Labs, Ranked
This ranking judges providers on the five criteria from Section 2, weighted for what frontier and applied AI labs actually need in 2026: neutrality, expert sourcing, quality, security and reliability. It deliberately favors the providers building the high-value frontier-data and RLHF layer over pure bulk labelers, because that is where lab spend has migrated. Pricing here reflects what the work actually costs, drawn from funding disclosures and reported contractor rates rather than published rate cards, most of which do not exist. Treat the order as a considered argument, not gospel, because the right vendor depends heavily on the task.
One caveat on the revenue figures that recur below, because they are routinely misreported. Several of these companies cite "ARR" that is actually gross marketplace volume, the total a customer pays before the contractor's cut is removed. Mercor's CEO has confirmed this gross-accounting practice is standard across the sector, which means a headline like "$1.5 billion" can correspond to net revenue closer to a few hundred million once contractor payouts are subtracted - Sacra. The chart below compares the leaders on reported revenue or run-rate, with that gross-versus-net distinction flagged in the prose rather than hidden in the bars.
Reported Revenue or Run-Rate of Leading Providers (2025-2026)
A note on what is not in the top ten. Micro1, a fast-rising Scale competitor that crossed $100 million in annualized revenue in 2025 and counts Microsoft as a client, narrowly misses the list but is a credible challenger built on the same expert-vetting model as Mercor - TechCrunch. So do the platform-first players like SuperAnnotate and the vertical specialists like Centaur and iMerit, all covered in later sections. The ten below are the providers a frontier or applied AI lab is most likely to shortlist for high-value human data in 2026, ranked for that use case specifically rather than for raw size or breadth.
1. Surge AI
Surge AI is the new center of gravity in frontier data, and the most quietly remarkable company in this guide. Founded in 2020 by Edwin Chen, an MIT-trained engineer who previously built machine-learning systems at Google, Facebook and Twitter, it bootstrapped to more than $1 billion in revenue with around 110 employees and no outside capital for five years, reportedly overtaking Scale's $870 million in 2024 to become the highest-grossing data labeler in the world - Inc.. Its model is the opposite of mass crowd labeling: hours-long tasks performed by credentialed experts, with real-time dashboards tracking inter-annotator agreement and per-worker trust scores.
The clearest signal of its position is its customer list. Surge is widely reported to be Anthropic's core human-feedback partner for training Claude, and it counts OpenAI, Google, Microsoft and Meta among its roughly twelve frontier-lab accounts - Sacra. In mid-2025 it opened its first external raise, seeking around $1 billion at a valuation of at least $25 billion, with Andreessen Horowitz, Warburg Pincus and TPG reported in talks - Bloomberg. The single best primary-source explanation of why it won comes from Chen himself, who argues that quality human data, not headcount or leaderboard scores, is what actually advances models.
The $1B AI company training ChatGPT, Claude & Gemini | Edwin Chen
The honest critique is about labor and opacity, not capability. A May 2025 class action by the Clarkson Law Firm alleges Surge deliberately misclassified its annotators as contractors, denying overtime and minimum wage to "tens of thousands of Californians," and a separate July 2025 incident saw an internal AI-safety guidelines document left publicly exposed - Clarkson Law Firm. Its consumer recruitment brand, DataAnnotation.tech, draws steady complaints about sudden account deactivations and erased balances. Best for: frontier labs that need the highest-quality RLHF and evaluation data and will pay for it, and that value a vendor with no Big Tech ownership.
2. Mercor
Mercor is the fastest-scaling company in the entire sector and the purest expression of the expert-marketplace model. Founded in 2023 by three Thiel Fellows in their early twenties, Brendan Foody, Adarsh Hiremath and Surya Midha, it began as an AI recruiting platform that scored candidates through 20-minute video interviews, then pivoted that vetting engine into a labor supplier for AI labs - TechCrunch. Its October 2025 Series C raised $350 million at a $10 billion valuation, a five-fold jump from its $2 billion round eight months earlier, making its founders among the youngest self-made billionaires on record - TechCrunch.
The scale of the operation is genuinely staggering for a three-year-old company. Mercor pays out more than $2 million per day to a network of over 30,000 weekly-active experts drawn from a 300,000-person pool, with average pay around $85 an hour and senior specialists like ex-bankers and physicians earning $200 or more - Sacra. Its pitch is precise: labs hire former employees of the industries they want to automate, because the companies that own that data will never hand it over directly. Clients include OpenAI, Anthropic and Meta.
Mercor's founding team

The risks are real and recent. In April 2026 Mercor disclosed a supply-chain breach via a compromised open-source package, with an extortion group claiming roughly four terabytes of data including contractor IDs and vetting videos, after which Meta paused its work with the company and at least five class actions were filed - Fortune. Mercor also faces worker-misclassification suits and was sued by Scale AI in 2025 over alleged trade-secret theft. Buyers should note that its headline run-rate is gross marketplace volume, so true net revenue after a roughly 30% to 40% take rate is materially smaller than the billion-dollar figure suggests - TechCrunch. Foody's combative public style, including a June 2026 broadside against Sequoia over valuation tactics, is part of the package. Best for: labs that need to spin up large, specialized expert teams quickly across many domains, and can tolerate a young vendor still hardening its security and compliance.
3. Handshake AI
Handshake is the most elegant pivot in the market, because it already owned the asset everyone else was renting. Founded in 2013 as the dominant US college recruiting network, with 20 million students and more than 500,000 verified PhDs across 1,500 universities, it launched Handshake AI in January 2025 to sell that credentialed network directly to labs as expert annotators - Sacra. The insight was sharp: Scale and Mercor had been sourcing PhD annotators from Handshake, so Handshake simply cut out the middleman and kept the margin.
The growth has been explosive. Handshake AI reportedly serves eight top AI labs including OpenAI, scaled from zero to roughly $300 million in annualized revenue by the end of 2025, and Sacra estimates gross annualized revenue passed $1 billion by April 2026, with experts paid $100 to $125 an hour - TechCrunch. Its moat is the verified-credential layer: it owns the academic identity graph that rivals must reconstruct from scratch, and in January 2026 it acqui-hired the data-quality startup Cleanlab to deepen that advantage. To fund the pivot, it cut about 96 roles, roughly 15% of its workforce.
The blemish is a contractor-pay scandal that surfaced in early 2026, when several workers on OpenAI projects had their accounts suspended without appeal or payment, with Handshake citing credential discrepancies and abnormal task times - AOL. It also carries the structural strain of running two very different businesses, a stalling legacy career-software product and a hypergrowth data marketplace, at once. Best for: labs that need verified, credentialed academic experts at scale and value provenance over the lowest possible rate.
4. Turing
Turing is the coding-and-reasoning data specialist, the vendor most associated with teaching models to write software. Founded in 2018 by Jonathan Siddharth and Vijay Krishnan as a remote-developer hiring marketplace, it repurposed its network of more than four million vetted engineers into a supplier of code, reasoning traces and agentic data for frontier labs - Sacra. In March 2025 it raised $111 million at a $2.2 billion valuation, led by Malaysia's sovereign wealth fund, and is described as a key coding-data provider for OpenAI - TechCrunch.
Turing's revenue reached roughly $350 million by the end of 2025, and unlike most peers it is profitable, having repositioned its entire brand around frontier-lab data under the banner "Turing AGI Advancement" - The Twenty Minute VC. In April 2026 it launched Turing Frontier, a platform connecting labs with elite US-based experts across software, finance, law and medicine. The vendor's coding heritage gives it genuine depth in the most commercially valuable model capability of 2026, which is agentic software engineering.
The critique is twofold. The legacy staffing business left a trail of Glassdoor and Blind complaints about repeated stealth layoffs and thin contractor benefits, and structurally Sacra warns of a "platform squeeze" as labs partner directly with consultancies and AI compresses staffing economics - Sacra. Lab concentration is the standing risk: a heavy dependence on a few large accounts means one churned contract reshapes the financials. Best for: labs prioritizing code, reasoning and agentic software data, especially those wanting elite US-based engineering talent.
5. Scale AI
Scale AI remains one of the largest and most capable data operations on earth, which is why it ranks here despite the 2025 collapse described in Section 3. It still runs the deepest contractor supply infrastructure in the industry through its Outlier and Remotasks platforms, operates the respected SEAL evaluation leaderboards that ran more than 450 evaluations across 50-plus models in 2025, and co-created the "Humanity's Last Exam" benchmark - Sacra. For raw scale, government work and computer-vision and autonomous-vehicle data, it is still formidable.
Its annotation platform, Scale Data Engine, remains a genuinely strong product for sensor and multimodal data, the kind of LiDAR and 3D labeling that built the company's autonomous-vehicle business.
Scale's Data Engine

The reason it is not ranked higher is the neutrality problem, which is permanent rather than temporary for any lab that competes with Meta. The Meta stake makes Scale a difficult choice for OpenAI, Google or Anthropic regardless of how good the data is, and its pivot toward defense, while shrewd, signals where it can still win rather than where it is preferred - GovConWire. Its legacy of labor controversies, including a US Department of Labor investigation and lawsuits over pay and psychological harm, adds reputational weight - TechCrunch. Best for: government and defense programs, autonomous-vehicle and computer-vision data, and Meta-aligned work where neutrality is not a concern.
6. Labelbox
Labelbox repositioned itself as the neutral "data factory," and the timing was perfect. Founded in 2018 by Manu Sharma, it built a respected annotation and model-evaluation platform, then bundled it with Alignerr, its managed network of expert annotators, to sell end-to-end reinforcement-learning data directly to labs - Cognitive Revolution. When the Scale exodus hit, Sharma publicly courted the defectors and told Reuters he expected hundreds of millions in new revenue, positioning Labelbox explicitly as the independent alternative with no Big Tech ownership - Nasdaq.
Its differentiator is vertical integration plus AI-powered expert sourcing, including a claim of more than 2,000 AI-conducted candidate interviews per day, and in February 2026 it acquired the sales-automation startup Upcraft to scale that recruiting further - PR Newswire. The platform-plus-network model appeals to labs that want both tooling and managed delivery from one vendor.
The reality check is that Alignerr has one of the most divided worker reputations in the sector, scoring 2.1 out of 5 on Glassdoor, with widespread reports of workers ejected from the platform without warning, withheld pay despite strong quality scores and long unpaid evaluation phases - Breaking Even. Its disclosed revenue, in the tens of millions, is small relative to Surge, and its specific lab relationships are unconfirmed, so the "hundreds of millions" guidance remains unverified. Best for: teams wanting an integrated platform and managed expert network from a vendor with no lab ownership, especially mid-sized labs and enterprises.
7. Invisible Technologies
Invisible Technologies has the deepest pedigree of any independent here: it helped OpenAI fine-tune the model behind ChatGPT's original launch. Founded in 2015 by Francis Pedraza, it blends RLHF and model-evaluation services with an "operations-as-a-service" automation business, and in September 2025 it raised $100 million at a valuation above $2 billion - SiliconAngle. Its standout trait is profitability: it reported $134 million in 2024 revenue, up 123% year over year, with roughly 11% EBITDA margins, a rare combination in a sector that mostly burns cash - Sacra.
The dual model is the strategic bet. Invisible sells both "train the model" and "run the operation," giving it an enterprise-automation market far larger than pure labeling, and its OpenAI provenance is a credibility moat it has parlayed into work with Amazon, Microsoft and Cohere. Its distinctive partnership culture, where workers hold equity and pay is linked to company profit, doubles as a recruiting tool and a labor model.
That same profit-linked pay model is the main critique, because it shifts business risk onto the 3,000-plus distributed workers, who only receive full compensation when the company's profits grow. Leadership churn at the top, with a new CEO installed in January 2025 as founder Pedraza moved to chairman, signals an operating model still evolving under investor pressure - Sacra. Best for: labs and enterprises that want RLHF data plus broader process automation from a profitable, established independent.
8. Toloka
Toloka is the independent that solved its own provenance problem and attracted a marquee backer. Founded in 2014 as Yandex's internal crowdsourcing platform, it spun out, sold its Russian operations to local investors in 2024 and re-domiciled in Amsterdam, then in May 2025 raised $72 million in a round led by Jeff Bezos's personal investment firm, with Shopify's Mikhail Parakhin joining as executive chairman - BusinessWire. That clean break from its Russian roots is what made it usable by Western labs like Anthropic.
Its genuine differentiator is range: it runs a mature, large-scale crowdsourcing engine inherited from Yandex alongside a vetted expert layer, so it can flex from cheap-and-broad microtasks to expensive-and-deep RLHF and agentic-task work. Named clients include Amazon, Microsoft, Anthropic and the coding-model lab poolside, and its independence from any single lab is a deliberate selling point in the post-Scale environment - Wikipedia.
The honest framing is scale and hype. Toloka's audited revenue was a modest $26.4 million in 2024, growing 138%, which makes it roughly an order of magnitude smaller than Surge or Mercor, and the multi-billion-dollar valuations floating around aggregator sites are unverified scraper artifacts that contradict its parent's filings - BusinessWire. Its crowdsourcing roots also carry the usual microtask quality and pay concerns. Best for: labs wanting a credible independent that spans both bulk crowdsourcing and expert RLHF, with strong backing and no lab ownership.
9. Snorkel AI
Snorkel AI is the technically distinctive entry, the one vendor whose core is software rather than labor. Spun out of the Stanford AI Lab in 2019 by Alexander Ratner and Christopher Re, it pioneered programmatic labeling, where code and expert-written heuristics generate and de-noise labels at scale instead of brute-force human clicking, and in May 2025 it raised $100 million at a $1.3 billion valuation - BusinessWire. It has since layered an expert network on top, branded Expert Data-as-a-Service, drawing on MS and PhD specialists in fields like oncology, aerospace and law.
Its frontier credibility comes from named partners Google and Anthropic, the latter having publicly endorsed working with Snorkel on alignment data, plus enterprise and government users including the Bank of New York and the US Air Force - Benzatine. The programmatic approach is a real moat for situations where labeling logic can be encoded once and applied to millions of examples, which is far cheaper than per-item human labor.
The wrinkle is that the original software thesis appears to have underperformed. In September 2025 Snorkel laid off 13% of staff, concentrated in software engineering, explicitly to redirect toward the lower-margin, services-heavy expert-data model, and its $1.3 billion mark is widely read as flat-to-down versus its 2021 peak - AIM Media House. It now competes directly with Surge and Mercor on a field they dominate. Best for: labs and enterprises with structured, encodable labeling problems, or those wanting programmatic efficiency combined with specialized expert data.
10. Prolific
Prolific earns the final spot as the quality-and-evaluation specialist, the vendor labs use when they need verified, representative humans rather than the cheapest possible labelers. Founded in 2014 at Oxford as an academic-research participant platform, it repositioned around AI human data, model evaluation, red-teaming and safety testing, and reached an estimated $350 million in annualized revenue by April 2026 on only about $33 million ever raised, an unusual capital efficiency - Sacra. Its pool of more than 200,000 vetted participants across 40-plus countries is built for demographic representativeness, which matters for bias and safety work.
Its 2025 flagship is HUMAINE, a human-centered model leaderboard that used roughly 27,000 evaluators and over 100,000 multi-turn comparisons to rank 29 frontier models, with Gemini 3 Pro on top - Prolific. The differentiator is data integrity: aggressive anti-bot and anti-AI safeguards and a "100% Human Guarantee," a direct answer to the MTurk-cheating problem that plagues the rest of the sector. CEO Phelim Bradley has been publicly vocal against AI agents polluting research data, which is the platform's existential fight.
The limitation is that Prolific's participant rewards, historically modest hourly equivalents, are far below the $100-plus rates that Mercor, Surge and Handshake pay, which limits its access to the most elite domain experts for the deepest technical RLHF - Sacra. Its frontier-lab positioning also outpaces its named-client disclosures. Best for: model evaluation, red-teaming, safety and bias testing, and any work where verified, representative real humans matter more than deep domain seniority.
5. The Bulk Layer and the Global South Reckoning
Beneath the expert marketplaces sits the layer that built this industry and now faces an ethics and economics reckoning: the bulk annotation workforce, concentrated in the Global South, that labels the repetitive data the frontier players no longer touch. This layer matters to AI labs for two reasons. First, plenty of training still requires it, especially in computer vision and multilingual data. Second, the labor conditions in this layer have become a genuine brand and governance risk, the kind that lands a lab's name in a human-rights lawsuit. Understanding it is not optional even for buyers who only purchase at the top of the stack.
The defining fact of this layer is its wage structure, which is the clearest illustration of how AI value is distributed around the world. The canonical example came from OpenAI's content-moderation work through the vendor Sama, where OpenAI paid roughly $12.50 an hour while Kenyan workers took home between $1.32 and $2.00 after the outsourcing firm's cut, for the traumatic work of labeling depictions of abuse and violence to make ChatGPT safer - TIME. Commodity labeling wages have roughly halved since 2022 as supply expanded and automation advanced, leaving the bottom of the market exposed to both price erosion and synthetic-data substitution.
Typical Hourly Pay by Annotation Tier (2026)
The human cost of this layer is documented and severe, and it is why the location of data work has become a strategic decision rather than a procurement detail. Of 144 Kenyan content moderators assessed in litigation against Meta and Samasource, 81% were diagnosed with severe PTSD, and a Kenyan court ruled in 2024 that the cases could proceed to trial - CNN. Scale AI's Remotasks platform abruptly deactivated workers in Kenya and Nigeria without explanation, with one Nigerian worker reportedly earning under a dollar for more than twenty hours, and an Oxford Fairwork audit scored Remotasks one out of ten - Rest of World. The short DW News report below puts faces to these numbers and is the counterpoint every buyer at the top of the stack should watch.
How big AI companies exploit data workers in Kenya | DW News
The legal reckoning is escalating in parallel. More than 180 former Meta and Samasource moderators are pursuing claims in Kenya seeking around $1.6 billion, a Kenyan appeals court ruled that Meta can be sued in the country, settlement talks collapsed, and the workers organized the African Content Moderators Union - Business & Human Rights Resource Centre. For AI labs, the significance is that liability is climbing the supply chain: courts are increasingly willing to look past the outsourcing firm to the brand that commissioned the work, which turns a cheap labeling contract into a contingent legal exposure that can outlast the dataset itself.
The legacy incumbents of this layer are being squeezed from both ends, and Appen is the textbook case. The 30-year-old Australian giant lost its anchor Google contract in 2024, worth around $82.8 million a year and roughly 30% of revenue, and its stock fell more than 94% from its 2020 peak - CNBC. It is staging a fragile turnaround in which generative-AI work has grown to roughly a third of revenue and China sales jumped 75% to about $103 million, but it remains loss-making and its recovery now leans on a geopolitically exposed Chinese business, a reminder that scale in commodity labeling is no defense when labs move upmarket and insource - Stocks Down Under. Sama, the original "ethical AI" brand, fared no better: after a Ray-Ban Meta-glasses scandal in which annotators reviewed intimate footage without subjects' consent, Meta terminated its contract and Sama laid off 1,108 Nairobi workers in April 2026 - Tech-ish. Even the impact-sourcing pioneers were not spared: CloudFactory, which built its brand on dignified digital work in Nepal and Kenya, ran two rounds of Nepal layoffs plus a roughly 12% global cut, leaving former staff who said they felt betrayed by a mission-driven employer - Rest of World. The lesson for any AI team is that the bulk layer is cheap on the invoice and expensive on the conscience, and that buyers are increasingly held accountable for who labels their data, where, and under what conditions.
6. The Specialists: Medical, Physical AI, and the Crypto Experiment
Outside the frontier-data scramble sits a band of specialists that win not on scale but on domain depth, and for the right task they outperform every generalist on this list. These vendors matter because the hardest data problems are vertical: a radiology model needs labels from people who can read scans, a robot needs sensor data annotated by people who understand physical space, and a medical-device algorithm needs annotations that will survive regulatory scrutiny. AI labs and applied-AI companies routinely pair a generalist for breadth with a specialist for the parts that actually require expertise.
In medicine, Centaur AI, formerly Centaur Labs, is the standout. Born out of MIT's Center for Collective Intelligence in 2017, it sources labels from a network of more than 50,000 vetted physicians and clinical professionals who compete in gamified accuracy contests through its DiagnosUs app, then aggregates their opinions into high-confidence medical labels - SignalFire. It works with about half of the top ten pharmaceutical companies and has contributed to multiple FDA-cleared algorithms, a regulatory track record that generalist crowds cannot easily replicate. iMerit plays a complementary role: founded in 2012 and profitable without raising since 2020, it focuses on high-context computer vision in medical imaging and autonomous vehicles, and in 2025 launched its Scholars network of advanced-degree experts, betting the same upmarket move as Surge and Mercor but from a regulated-vertical heritage - TechCrunch.
The fastest-rising specialty is physical AI, the data behind robots, drones and autonomous systems, and Encord is the clearest pure-play. Founded in 2020, it raised a $60 million Series C in February 2026 at roughly a $550 million valuation, repositioning from a computer-vision labeling tool into "the data layer for physical AI," with native handling of LiDAR, video and medical DICOM and customers including Woven by Toyota, Skydio and Zipline - PR Newswire. Its honest limitation is that it was built computer-vision-first, so its language-model and RLHF capabilities lag the GenAI-native platforms, and its cloud-only deployment is a non-starter for some sovereign buyers - Label Studio. Riding the same robotics wave is a slice of the bulk-services world too, including Centific, the NVIDIA-partnered "data foundry" rebuilt from the former Pactera EDGE, which raised a $60 million round in 2025 to serve big-tech model labs with multilingual, multimodal data at services scale - GeekWire.
A quieter option for labs that want to keep annotation in-house is to buy the tooling rather than the workforce. SuperAnnotate, consistently among the highest-rated labeling platforms on review sites, sells a seat-and-compute software model with a strong multimodal editor and lets customers bring and manage their own annotators, an approach that appeals to teams determined to keep data and labeler management under their own roof - Tracxn. The trade-off is that a platform captures far less of the exploding managed-data-services spend than a Surge or a Mercor, and the buyer inherits the hard part, recruiting and quality-controlling the experts, which loops straight back to the sourcing bottleneck that defines this entire market.
A more dubious experiment deserves a skeptical mention, because it shows up on aggregator lists and should be discounted. Sapien is a crypto-native "data foundry" on Coinbase's Base chain that pays a decentralized contributor network in its own token using a "Proof of Quality" staking mechanism - CoinDesk. The trouble is substance: its enterprise-client claims naming the likes of Alibaba and Toyota are unverified and read as marketing, no frontier lab uses it, and its token fell roughly 86% from its November 2025 peak, which directly undermines a pay model denominated in that token - Bitget. For AI labs evaluating real data partners, the practical takeaway across this whole band is that specialists earn their place by depth and regulatory rigor, and that a credible vertical track record matters far more than a buzzword-heavy positioning or a token chart.
7. RL Environments: The New Battleground for Human Data
The frontier of data work in 2026 is no longer labeling at all, it is building reinforcement-learning environments, and this is where the smartest vendors and the most money are now flowing. An environment is an interactive sandbox, often a clone of a real software application, where an AI agent attempts a task, gets graded against an expert-written rubric, and learns from the reward signal. As models saturate static benchmarks, environments are how labs teach agents to actually do multi-step work like using a spreadsheet, navigating a website or completing a coding task end to end. The crucial point for this guide is that environments are almost entirely human data: experts design the tasks, write the gold-standard solutions and set the reward criteria.
The spending behind this shift is enormous and concentrated. The Information reported that Anthropic discussed spending more than $1 billion on RL environments over a single year and works with more than a dozen vendors, while OpenAI bought hundreds of replicated user-interface "gyms" to train agents on - TechCrunch. The pricing is steep and revealing: a single UI gym can run around $20,000 per website, a complex application clone around $300,000, individual expert-authored tasks from $200 to $2,000, and exclusive deals command four to five times the non-exclusive rate - SemiAnalysis. This is human expertise priced like enterprise software, because that is effectively what it is.
The genuinely hard part of an environment is not the sandbox but the grading, and this is where human expertise concentrates. An agent will happily exploit a sloppy reward function, passing a test without doing the real task, so experts must write rubrics that are robust to this reward-hacking and then verify the model's work, which is precisely the skilled labor that does not automate. Surge's own CoreCraft benchmark, released in early 2026, found that frontier models including GPT-5.2 and Claude Opus 4.6 solved under 30% of its full-rubric tasks, a reminder that designing evaluations hard enough to still challenge the best models is itself expert work - Wikipedia. The field is also splitting into proprietary and open camps, with Prime Intellect feeding open environments into its community-trained INTELLECT-3 model while the labs guard their best environments as competitive secrets.
A new crop of specialists has emerged specifically for this work, and they are recruiting talent at frontier-lab salaries. Mechanize, founded by former Epoch AI researchers and backed by figures like Nat Friedman and Google's Jeff Dean, builds coding environments, works with Anthropic, pays its engineers $500,000 and reportedly reached a $750 million valuation - The Information. On the open-source side, Prime Intellect launched an Environments Hub in August 2025, backed by Andrej Karpathy and Founders Fund, positioning itself as the GitHub for RL environments - Prime Intellect. Meanwhile the established providers are all racing in: Surge, Mercor, Scale and Turing are each building environment and verifier capabilities, because it is the highest-value human-data work in existence.
The reason this matters for vendor selection is that it redefines what a "data annotation provider" must be able to do. Building a good environment requires sourcing software engineers who can clone an application, domain experts who can define correct behavior, and quality processes that catch subtle reward-hacking, which is a far more demanding talent problem than crowd labeling ever was. It also explains why the providers winning this layer, Surge and Mercor above all, are the same ones with the strongest expert-sourcing engines. The companies that can recruit the right experts fastest are the companies that will own the most valuable category of AI data, which leads directly to how you should choose a partner.
8. How to Choose: A Buyer's Framework for AI Teams
Choosing a data partner in 2026 comes down to matching the task to the right layer of the stack, then filtering the candidates through the five benchmark criteria. The single most common and expensive mistake is treating "data annotation" as one purchase, sending everything to one vendor, and either overpaying for bulk work or underpowering frontier work. The right approach is closer to building a portfolio: a different provider, or tier, for each kind of data, chosen deliberately. Start by classifying the work, because that determines the geography, the price and the shortlist more than anything else.
For frontier RLHF, reasoning data, evaluations and RL environments, the shortlist is short and the price is high. This is Surge, Mercor, Handshake and Turing territory, where you pay $85 to $200-plus per expert hour and you are buying access to credentialed specialists you could not recruit yourself in time. For neutral, platform-plus-network delivery you add Labelbox, Toloka and Invisible. For specialized verticals you reach for Centaur, iMerit or Encord, and for evaluation and red-teaming with representative humans you reach for Prolific. For genuine bulk volume in vision and multilingual data, the legacy and services players still have a role, with the ethics caveats from Section 5 firmly in mind. Run each candidate through the five questions: is it neutral, can it source experts fast, is the quality verified, is it secure, and can it deliver reliably at your volume.
The deeper truth running through every section of this guide is that the binding constraint is recruiting the experts, not labeling the data. Expert annotators cost twenty to forty times what crowd workers do, and the vendors winning this market, Surge with its trust-scored network, Mercor with its AI interviews, Handshake with its credential graph, all win on sourcing speed and verification rather than tooling - TIME. That reframes the build-versus-buy decision. Many labs are concluding that the capability they actually need is the ability to find, vet and assemble expert teams on demand, which is increasingly a talent-sourcing problem dressed up as a data problem.
This is also where the market is quietly converging with autonomous recruiting technology, because both are ultimately about finding scarce specialists at scale. Vendors like Mercor and Micro1 literally began as AI recruiters before pivoting to data, and Labelbox bought a recruiting-automation startup to feed its expert pipeline. Teams that want to build expert annotation or AI-tutor pools in-house, rather than rent them at a 30% to 40% marketplace markup, increasingly lean on autonomous sourcing tools to do it. HeroHunt.ai, for instance, runs an AI Recruiter that sources across more than a billion profiles and reaches out automatically, which is one way a smaller lab assembles a vetted pool of physicians or engineers without paying a marketplace's take rate on every hour. The honest trade-off is real: a marketplace gives you speed and pre-vetting at a premium, while sourcing your own pool gives you margin and control at the cost of building the recruiting muscle yourself.
A concrete portfolio makes this tangible. A mid-sized lab training a coding-and-reasoning model might retain Surge or Turing for its core RLHF and agentic-coding data, use Prolific to run independent safety and bias evaluations so the grader is not the same vendor as the labeler, lean on Encord for any vision or sensor data, and stand up its own small pool of domain experts for a proprietary vertical like finance or medicine where it does not want a marketplace to see the task at all. That last piece, the in-house pool, is where the marketplace markup bites hardest: paying a 30% to 40% take rate on every expert hour for years is far more expensive than recruiting and managing a stable specialist bench directly, which is why labs with steady, sensitive workloads increasingly build rather than rent - Startup Riders.
Two contract details separate sophisticated buyers from naive ones. The first is exclusivity: data and environments built exclusively for you cost several times more than shared work, but shared work can mean a competitor trains on the same examples, so the premium is really a question of how proprietary the capability needs to be - SemiAnalysis. The second is quality SLAs with teeth: inter-annotator agreement thresholds, expert-credential verification and audit rights, written into the contract rather than assumed, because the gap between a vendor's marketing and its delivered quality is exactly what the leaked Scale documents exposed. Buyers who specify and measure quality get it; buyers who trust the pitch deck get spam.
Whichever route you choose, model the fully loaded cost rather than the sticker rate. A marketplace's $85 hour already embeds its 30%-plus margin, an in-house expert carries recruiting, vetting and compliance overhead, and the cheapest bulk vendor can carry the highest reputational and quality cost of all. A provider that looks cheapest on a spreadsheet can quietly become the most expensive once a quality failure, a security breach or a labor lawsuit is priced in, which is the recurring theme of this entire benchmark. The teams that buy data well in 2026 are the ones that have stopped asking "who is the best data annotator" and started asking "which layer is this task, and who is the best at that layer." That question, applied task by task, is the entire framework.
9. The 2026 to 2027 Outlook: Agents in the Loop and the Expert Bottleneck
The forward-looking thesis is that the data-annotation market keeps splitting, with the bottom automated away and the top growing more valuable and more scarce. AI is increasingly in the labeling loop itself: foundation models now correctly pre-label 60% to 90% of common cases in many tasks, leaving humans to handle the hard edges, and model-assisted labeling, LLM-as-judge grading and synthetic data generation are compressing the cost of routine work toward zero - Shaip. The clear consequence is that commodity labeling jobs shrink while the premium on genuine human expertise rises, the same two-speed pattern that runs through every section of this guide.
Automation has hard limits that keep humans essential at the top, which is the source of the enduring expert bottleneck. LLM-as-judge systems remain unreliable, with frontier judge models exceeding 50% error on some bias tests and breaking on superficial formatting changes, so human calibration stays necessary for anything that matters - Adaline. Model feedback is roughly a hundred times cheaper than human preference data, on the order of a cent per prompt against one to ten dollars for a human, but using only synthetic data risks model collapse, which forces hybrid pipelines that always need a human expert layer - RLHF Book. The supply pressure is structural: Epoch AI projects that labs will exhaust high-quality public human text somewhere between 2028 and 2032, which is exactly why the marginal valuable token now comes from a paid expert rather than a scraped web page - Epoch AI. Ilya Sutskever captured the mood at NeurIPS 2024 when he declared that pre-training as we know it will end because we have but one internet, calling data the fossil fuel of AI - The Decoder.
That scarcity reframes the competitive game as a recruiting war, and it is already visible in the numbers. xAI laid off 500 generalist data annotators in September 2025 specifically to build a specialist tutor program ten times larger, a clean illustration of the migration from cheap-and-broad to expensive-and-deep - TechCrunch. The vendors that win the next two years will be the ones with the best engine for finding, verifying and retaining scarce experts, which is why so many of them are, at their core, recruiting companies. It is a point that founders building autonomous sourcing technology, including HeroHunt.ai's Yuma Heymans, have made for years: the future of this work is less about clicking labels and more about assembling the right humans, fast, wherever they are.
Consolidation is the other near-certainty. A market this hot, with dozens of vendors chasing the same dozen frontier-lab budgets, does not sustain all of them, and the acqui-hire of Cleanlab by Handshake and Labelbox's purchase of a recruiting-automation startup are early signs of the shakeout to come - TechCrunch. Expect the strongest sourcing engines to absorb the best tooling and the best niche expert networks, while the vendors that cannot clear the neutrality, security and quality bar get bought for their talent or fade. The category will likely settle into a handful of frontier-data leaders, a band of vertical specialists, and a commoditized bulk tail, mirroring the three-layer stack this guide opened with.
The practical forecast for anyone buying or building in this space is therefore straightforward. Expect bulk labeling to keep commoditizing and consolidating, with legacy incumbents either pivoting upmarket or fading. Expect the frontier-data and RL-environment layer to keep growing, keep paying professional rates, and keep being dominated by whoever recruits experts best. And expect neutrality, security and labor ethics to remain the criteria that make or break a vendor, because the Scale AI earthquake proved that a single structural mistake can vaporize a decade of dominance in under a year. The map of who labels AI's data was redrawn in 2025. In 2026, the winners are the ones who understand that data annotation became a talent business, and bought accordingly.
Conclusion: How to Read This Ranking for Your Own Stack
The decision framework that falls out of this benchmark is simpler than the turbulence suggests. Classify the work first, because the task determines the layer and the layer determines the shortlist. For frontier RLHF, reasoning and RL environments, the answer is Surge AI if you want the highest quality and a neutral, no-Big-Tech-ownership partner, Mercor if you need to scale specialized expert teams fastest, and Handshake or Turing if you need verified academics or coding-and-agentic data specifically. There is no cheap version of this layer, and pretending otherwise wastes a training cycle.
For everything below the frontier, match the vendor to the need rather than the brand. Labelbox, Toloka and Invisible give you neutral platform-plus-network delivery, Snorkel gives you programmatic efficiency for encodable problems, and Prolific gives you verified humans for evaluation and red-teaming. The specialists, Centaur in medicine, iMerit in regulated vision, Encord in physical AI, earn their place where domain depth and regulatory rigor are non-negotiable. And the bulk layer, where Appen, Sama and the services firms still operate, should be bought with eyes open to both its falling prices and its rising reputational cost, because in 2026 buyers are accountable for who labels their data and under what conditions.
The through-line is that data annotation stopped being a procurement line item and became a strategic capability built on recruiting scarce experts. The vendors winning are the ones with the best sourcing engines, the labs winning are the ones that match each task to the right layer, and the teams that build their own expert pools, increasingly with autonomous sourcing tools like HeroHunt.ai alongside the marketplaces, are the ones reclaiming margin and control. The Scale AI era is over. The expert-data era has begun, and the benchmark that matters now is not who can label the most, but who can find the right humans fastest.
This guide reflects the AI data-annotation landscape as of July 2026, drawing on funding disclosures, lawsuits, breach reports, leaked documents and reported contractor rates. Valuations, revenue figures and vendor relationships in this market shift quickly and several headline numbers are gross marketplace volume rather than net revenue, so verify the most current details before signing a contract.








