MIT Tested 41 AI Models on 11,000 Real Tasks: The 'Good Enough' Problem Is Real, and I See It Every Day

I use AI every single day. Not casually – deeply. I run over 100 WordPress plugins, manage multiple websites, handle customer support triage, write code, draft content, and coordinate a team across time zones. Claude Code is my co-pilot for development. ChatGPT handles research. Gemini jumps in for specific tasks. AI is embedded into every layer of how I work.

So when MIT published a massive study testing 41 AI models across 11,000 real workplace tasks, I didn’t just skim the headline and move on. I read the data. I cross-referenced it with my own experience. And honestly? It confirmed something I’ve been feeling for months but couldn’t quite put into words.

AI is “good enough” – and that’s exactly the problem.

Not because “good enough” is bad. Sometimes good enough is exactly what you need. But the danger is in not knowing when good enough is genuinely sufficient and when it’s quietly eroding the quality of what you ship.

Let me walk you through what MIT actually found, what it means for people like us who build products and run businesses with AI, and the practical framework I use to make sure the “good enough” trap doesn’t eat my business alive.

What MIT Actually Found

The study is called “Crashing Waves vs Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks.” That’s a mouthful, but the methodology is what makes it remarkable.

They tested 41 different large language models – including versions of Claude, Gemini, and ChatGPT – on more than 11,000 primarily text-based tasks. These weren’t synthetic benchmarks or toy problems. They were pulled from actual Labor Department job listings. Real tasks that real people do in real jobs.

And here’s the part that matters most: the outputs weren’t graded by AI researchers in a lab. They were scored by humans with real on-the-job experience in those specific fields. A marketing manager scored marketing outputs. A legal professional scored legal drafts. An IT specialist evaluated technical writing. Real people, real standards, real stakes.

They used a 1-9 scoring scale where:

Score 7 = “Minimally sufficient” – the work is usable as-is, requires no edits to be acceptable
Score 9 = “Superior” – genuinely impressive, expert-level quality that exceeds expectations

The headline finding?

65%

of all tasks scored “minimally sufficient” or higher across 41 AI models

That sounds decent, right? Two-thirds of the time, AI produces work that’s good enough to use without editing. For anyone running a business, that seems like a massive productivity win.

Until you look deeper.

The 35% Nobody Talks About

Here’s where it gets uncomfortable. When the bar was raised to “superior” quality – a score of 9 – AI’s success rate told a very different story.

Never above 50%

No AI model exceeded 50% for “superior” quality (score 9) – across any task category

Not once. Across 41 models and 11,000 tasks. The ceiling for truly excellent AI output is stubbornly below half.

And for tasks requiring multiple steps, creativity, or precision? AI was more likely to fail than succeed.

Let me say that again because it’s important: for the work that actually matters – the complex, nuanced, multi-layered stuff that differentiates good businesses from mediocre ones – AI fails more often than it succeeds.

The researchers put it plainly:

“Widespread automation, particularly in domains with low tolerance for errors, may still be some distance away.”

This isn’t anti-AI fear-mongering. This is MIT with hard data saying what many of us feel daily: AI is incredible at getting you 65% of the way there. The last 35% is where your expertise, judgment, and reputation live.

Legal and IT tasks? Lower success rates. Construction and maintenance text tasks? Higher success. The pattern is clear: the more domain-specific knowledge and contextual judgment a task requires, the worse AI performs.

I See This Every Day – Here’s the Real Picture

I’m not sharing this study as an outside observer. I ship code with AI assistance every single day. I’ve been doing it long enough to know exactly where the 65/35 split plays out in real work. Let me give you specific examples.

Where AI Genuinely Crushes It (The 65%)

Scaffolding code. When I need a new REST API endpoint, a custom post type registration, a WooCommerce hook implementation, or boilerplate PHP structure, Claude Code generates it in seconds. Clean, functional, follows WordPress coding standards. Minimally sufficient? Often better than that. The time savings here are enormous – what used to take 30-45 minutes of typing out boilerplate now takes under 2 minutes.

Support triage. I process customer support tickets across dozens of products. AI can categorize issues, identify patterns, match tickets to known bugs, and draft initial responses faster than any human could manage. When you’re dealing with volume – 50, 100, 200 tickets – this is the difference between drowning and staying afloat.

Content first drafts. Blog post outlines, social media drafts, documentation structure, changelog summaries – AI gets the skeleton right almost every time. It saves hours of staring at blank pages. The bones are solid. You’re editing and adding voice, not starting from scratch.

Code reviews. AI catches obvious issues reliably – missing sanitization, incorrect hook usage, potential SQL injection, deprecated function calls, inconsistent naming conventions. It’s like having a tireless junior reviewer who never gets bored of checking the same patterns.

Data transformation. Reformatting data, converting between formats, generating SQL queries, parsing logs – AI handles this kind of structured-input-to-structured-output work beautifully. No creativity required, just accurate transformation.

Where AI Falls Flat (The 35%)

Architectural decisions. Should this be a custom database table or postmeta? Should we use the REST API or admin-ajax? Is this a custom block or a block pattern? These decisions require understanding the full context of the product, the user base, performance implications at scale, backwards compatibility requirements, and the three-year roadmap. AI gives you reasonable-sounding answers that can be catastrophically wrong. I’ve seen AI confidently recommend custom tables for data that should obviously be postmeta, and vice versa. The reasoning sounds perfect. The conclusion is wrong.

Client-facing copy. Support replies that AI drafts often sound technically correct but emotionally tone-deaf. They solve the problem while making the customer feel like they’re talking to a machine. That kills trust faster than a slow response time ever would. The customer doesn’t care if the solution is technically perfect if the message feels like it was generated by a bot.

Security reviews. AI finds the easy stuff – unsanitized inputs, missing nonce checks, direct database queries without preparation. But the subtle vulnerabilities? The business logic flaws, the race conditions, the privilege escalation paths that emerge from how multiple plugins interact with each other? Those require human intuition built from years of getting burned. I’ve seen AI mark code as “secure” that had a time-of-check-to-time-of-use vulnerability hiding in plain sight.

Creative strategy. “Write a marketing email for our new plugin” produces something generic every single time. The angle, the hook, the understanding of what will actually resonate with our specific audience versus every other WordPress plugin audience – that’s still entirely human. AI produces competent copy. Competent copy doesn’t sell.

Debugging complex interactions. When a bug spans three plugins, a theme, a hosting configuration, and a specific PHP version – AI can suggest possibilities, but it can’t replicate the detective work of tracing data through a live system. It doesn’t have the instinct that says “this smells like an object cache issue” before you’ve even looked at the cache.

The “Good Enough” Trap for Founders and Agencies

Here’s why MIT’s findings matter beyond academia: “good enough” is a trap for anyone running a business.

When you’re moving fast – shipping features, responding to support tickets, publishing content, fixing bugs, managing a team – there’s an enormous temptation to accept AI output at face value. It looks right. It sounds right. It probably is right… 65% of the time.

But we’re not in a business where 65% is acceptable. If 35% of your customer support replies are slightly off in tone, you’re quietly building a reputation for mediocre service. If 35% of your code has subtle architectural issues, you’re accumulating technical debt that will crush you in 18 months. If 35% of your content feels generic, your audience will stop reading long before they tell you why.

This isn’t hypothetical. The data backs it up from multiple independent sources.

AI Code Creates More Issues Than Human Code

1.7x more issues

AI-generated code introduces 1.7x more issues than human-written code in production

A comprehensive CodeRabbit report on AI vs. human code generation found that AI-generated code introduces 1.7 times more issues than human-written code in production environments. AI-authored pull requests had 75% more logic and correctness errors than their human-written counterparts.

Think about what that means at scale. If you’re shipping 50 PRs a month and half of them are AI-assisted, you’re potentially introducing 40-50% more bugs than you would with all-human code. The speed gains are real – but so are the quality costs.

Organizations are also reporting technical debt increases of 30-41% within just six months of widespread AI coding tool adoption. The code passes tests and ships. But the maintenance cost compounds silently. Six months later, your codebase is significantly harder to work with, and nobody can point to a single moment it went wrong.

IEEE Spectrum reported that AI coding quality has been showing signs of decline – tasks that might have taken five hours with AI assistance a year ago are now more commonly taking seven or eight hours. The tools are getting faster at generating code, but the time to verify, debug, and integrate that code is growing.

The Productivity Illusion

39-44% gap

between perceived and actual productivity when using AI coding tools

This one hit me hard. Studies have identified a 39-44% gap between perceived and actual productivity when developers use AI tools. Developers using AI tools felt approximately 20% faster. But measured task completion time was actually 19% slower.

Read that again. People feel faster while being slower.

Why? Because the act of generating code feels productive. You’re typing less. Code appears on screen faster. But the time you save generating is consumed – and then some – by reviewing, understanding, debugging, and integrating AI output that you didn’t fully write yourself.

I’ve caught myself in this exact trap. I’ll use Claude Code to generate a complex function, feel great about saving 20 minutes, and then spend 35 minutes debugging an edge case the AI didn’t consider. Net result: I’m 15 minutes behind where I would have been writing it myself. But I felt faster the whole time.

The solution isn’t to stop using AI – it’s to be brutally honest about where the time actually goes.

The Trust Deficit

Only 17%

of people who use AI at work trust it to run without human oversight

According to the Connext Global 2026 AI Oversight Report, only 17% of people who actually use AI at work say it can run on its own with minimal human involvement.

The other 83% say reliability requires either light review (35%), dedicated human oversight (35%), or they simply aren’t sure (13%).

This matches what Reddit is actively debating – the original thread discussing MIT’s findings has 300+ upvotes and nearly 100 comments, with developers, founders, and practitioners all sharing variations of the same experience: AI is useful, but full autonomy is a fantasy right now.

One Reddit thread that hit 507 upvotes perfectly captures this tension: “I spent a week reading through AI-generated code that’s been in production for 8 months. It was fine. That was the problem.”

The code worked. It shipped. Customers used it. But “fine” code at scale becomes a maintenance nightmare that no one fully understands because no human fully wrote it. When something breaks – and something always breaks – you’re debugging code that you don’t have mental ownership of. You’re reading someone else’s logic, except that “someone else” is a statistical model that can’t explain its reasoning.

The Chef vs. The Monkey

I’ve always believed AI is like a kitchen tool. A professional-grade blender, a sous vide machine, a set of incredibly sharp knives. Give those tools to a trained chef, and you get remarkable dishes faster than ever. The chef knows which tool to reach for, when to use it, and – critically – when to put it down and use their hands instead.

Give those same tools to someone who can’t cook, and you get… well, something that looks like food but tastes like nothing. They’ll follow the recipe exactly as the tool suggests. The output will be technically correct. It just won’t be good.

MIT’s study validates this completely. The tool isn’t the differentiator. The human wielding it is.

When I use Claude Code to scaffold a new feature, I’m not blindly accepting the output. I know WordPress internals. I know where the hooks need to fire, how the data should flow through the system, what the edge cases are for multisite environments, how WooCommerce will interact with this at scale. AI accelerates my existing knowledge. It doesn’t replace it.

But I’ve also seen what happens when someone without that foundation uses the same tools. The output looks identical on the surface. It passes basic tests. The code is syntactically correct, properly indented, follows naming conventions. But the architectural choices are wrong. The performance implications are missed. The security holes are invisible until they’re exploited.

65% “good enough” in the hands of an expert becomes 90% with quick refinement. 65% “good enough” in the hands of a novice stays at 65% – and the 35% that’s wrong becomes invisible technical debt that compounds silently until something catastrophic breaks.

This is why the “AI will replace developers” narrative misses the point entirely. AI doesn’t replace expertise. It amplifies whatever level of expertise you already have. If you’re a chef, AI makes you faster. If you’re not, AI makes you dangerously overconfident.

Where AI Actually Shines – And Why That Matters

I don’t want to paint a doom-and-gloom picture because that’s not how I feel about AI at all. AI is genuinely the most important tool I’ve adopted in a decade. Let me be specific about where it delivers real, measurable value.

Volume work. When I need to process 200 support tickets, categorize 500 GitHub issues, or review 30 pull requests in a day, AI is the difference between possible and impossible. The per-item quality might be “minimally sufficient,” but at volume, that’s genuinely sufficient. Nobody needs a Pulitzer-winning commit message.

First drafts of everything. Blog posts, documentation, changelogs, release notes, social media – getting from zero to a working draft is where AI saves the most time. The blank page problem is real, and AI eliminates it entirely. I’m editing and refining instead of creating from nothing. That’s a fundamentally different (and faster) workflow.

Pattern-based coding. WordPress development involves enormous amounts of pattern repetition. Register a post type, register a taxonomy, add a settings page, create an admin menu, build a REST endpoint. AI handles these patterns flawlessly because they are patterns. No creativity required, just accurate reproduction with project-specific details filled in.

Learning acceleration. When I need to understand a new API, a library I haven’t used, or a WordPress function I’m unfamiliar with, AI explains it in context. Not generic documentation – explanation tailored to what I’m building. This is faster than StackOverflow, more contextual than official docs, and usually good enough to get me started.

Rubber duck debugging. Sometimes I just need to explain a problem to something that responds. AI is remarkably good at being a sounding board – asking clarifying questions, suggesting angles I haven’t considered, and occasionally pointing out the obvious thing I’m too close to see. The solution often comes from my own thinking during the conversation, not from the AI’s suggestion.

What I Actually Do Differently – A Practical Framework

After more than two years of daily AI use across every aspect of my business, here’s my practical framework. This isn’t theory – this is what I do every day.

1. Trust AI for Volume, Not Judgment

Let AI handle the sheer quantity of work – drafting, scaffolding, categorizing, summarizing, reformatting, transforming. These are the tasks where “minimally sufficient” is genuinely sufficient. The output doesn’t need to be brilliant. It needs to be correct and fast.

But the moment a task requires judgment – “should we do this?” rather than “how do we do this?” – I take over. AI can generate five approaches to solving a problem. Choosing which approach fits our specific situation, user base, and long-term strategy? That’s my job.

2. Never Skip Human Review for Client-Facing Anything

Support replies, marketing copy, documentation, UI text, error messages – if a human is going to read it, a human needs to review it before it ships. AI drafts are starting points, not finished products.

This is where the 35% lives. A support reply that’s technically correct but misses the customer’s emotional state. A marketing email that sounds like every other marketing email. Documentation that explains the “what” but not the “why.” These aren’t errors in the traditional sense – they pass every automated check. But they erode trust, engagement, and brand perception in ways that are hard to measure and harder to reverse.

3. Be Brutally Honest About Where You’re Cutting Corners

Every founder using AI is making implicit trade-offs. The question isn’t whether you’re cutting corners – it’s whether you know which corners you’re cutting and whether those are corners you can afford to cut.

I keep a running mental model: AI handles the first pass on code, I handle the architecture. AI drafts the support reply, I check the tone. AI outlines the blog post, I bring the experience and the angle. AI generates the test cases, I verify they actually cover the edge cases that matter.

The trade-offs I’m not willing to make: AI doesn’t make architectural decisions. AI doesn’t set product strategy. AI doesn’t handle sensitive customer conversations. AI doesn’t merge code without human review. These are my non-negotiables.

4. Invest in Review Infrastructure, Not Just Generation Speed

The industry’s obsession with generation speed is misguided. Yes, AI generates code 10x faster. But if it takes 3x longer to review, debug, and maintain, your net gain is much smaller than the demo video suggested.

I’ve built my workflow around review – automated WordPress Coding Standards checks, PHPStan static analysis, browser testing with Playwright for every UI change, structured code review processes with clear checklists. The generation is essentially free now. The quality assurance is where the actual work happens.

Think of it like a factory: making the generation line faster only helps if the quality inspection line can keep up. Otherwise, you’re just producing defects faster.

5. Track the Real Metric: Outcomes, Not Output

Remember that 39-44% productivity perception gap? The antidote is measuring outcomes instead of output.

I don’t track how many lines of code I generate per day. I track how many features ship without bugs. I don’t track how many support replies I draft. I track customer satisfaction scores. I don’t track how many blog posts I outline. I track which posts actually drive traffic and engagement.

Output is vanity. Outcomes are sanity. AI dramatically increases output. Whether it increases outcomes depends entirely on how you use it.

6. Use AI’s Trajectory, Not Its Current State

This is where it gets exciting. MIT projects that AI success rates are increasing by about 11 percentage points annually.

80-95% by 2029

MIT projects AI will hit “minimally sufficient” on 80-95% of text tasks within 3 years

That trajectory matters enormously. The founders who will win aren’t the ones who either reject AI entirely or trust it blindly today. They’re the ones building systems that get better as AI gets better – with human oversight that can gradually loosen as the 35% gap shrinks.

My review processes today are designed to be dial-able. As AI improves, I can reduce the review intensity for categories where it’s proven reliable. Boilerplate code reviews? Already minimal. Architectural reviews? Still full human oversight. The dial turns slowly, based on evidence, not hype.

What MIT’s Study Means for the AI Hype Cycle

Let’s zoom out for a moment. The broader AI conversation right now is stuck between two extremes: “AI will replace everyone” and “AI is just a toy.” MIT’s data shows that both camps are wrong.

AI won’t replace everyone because 35% of work requires judgment, context, and domain expertise that current models can’t reliably provide. But AI is far from a toy – 65% task competence across 11,000 real-world tasks is significant, and the trajectory suggests rapid improvement.

The MIT Sloan analysis frames it well: we’re in a transition period where the most important skill isn’t using AI or avoiding AI – it’s knowing which tasks to delegate and which to own.

For founders and agency owners, this means:

Stop chasing AI autonomy. Full autopilot isn’t coming this year. Build workflows that assume human oversight.
Start building expertise moats. If AI can handle 65% of the work in your industry, the remaining 35% becomes your competitive advantage. That’s where you need to be better than everyone else.
Invest in integration, not adoption. Most businesses have adopted AI tools. Few have integrated them into genuine workflows with feedback loops, quality checks, and measurable outcomes. Integration is where the value lives.
Be wary of the competence illusion. AI output looks competent. It often is competent. But competent and excellent are different things, and your customers, clients, and users can tell the difference even if they can’t articulate it.

A Personal Note on the “48% Vulnerability” Stat

One data point I haven’t mentioned yet: 48% of AI-generated code contains security vulnerabilities. Nearly half.

As someone who runs over 100 WordPress plugins that thousands of people trust with their websites, this number keeps me up at night. WordPress is the most targeted CMS on the internet. Every plugin is a potential attack surface. And if half the AI-generated code has security issues that aren’t immediately obvious, the math gets scary fast.

This is why I run multiple layers of security analysis on every piece of code – automated WPCS checks that flag security patterns, PHPStan for type safety and logic errors, manual review for business logic vulnerabilities, and periodic security audits of the full codebase.

AI helps me write security checks. But AI doesn’t replace security checks. If anything, the proliferation of AI-generated code makes security review more important than ever, not less.

The Honest Assessment

Here’s my honest take after more than two years of deep AI integration into every part of my business:

AI is the most important tool I’ve adopted in a decade. It genuinely makes me more productive in ways that matter. I ship more features, respond to customers faster, produce more content, and handle a workload that would have required twice the team five years ago.

But.

MIT’s study puts hard numbers on something every honest practitioner already knows: we’re in an awkward middle period. AI is too good to ignore and too unreliable to trust completely. This is the “minimally sufficient” era.

The founders who will thrive are the ones who:

Use AI aggressively for the 65% where it genuinely excels
Maintain deep expertise in the 35% where it fails
Build review systems that reliably catch the gap between sufficient and excellent
Stay honest about what they’re actually shipping versus what they think they’re shipping
Prepare for the trajectory – 80-95% by 2029 – without getting ahead of where we are today

The MIT researchers estimate we’ll hit 80-95% “good enough” within three years. That’s when the real transformation happens – when AI can handle not just the volume work but the nuanced, multi-step, creative work that currently defines the 35% gap.

Until then? AI is a force multiplier, not a replacement. And the difference between the two is where your business either thrives or quietly falls apart.

The question isn’t whether AI is good enough. It’s whether you’re good enough to know where it isn’t.

That’s the real skill of this era. Not prompt engineering. Not tool selection. Not adoption speed. The real skill is judgment – knowing when to trust the machine and when to trust yourself.

And that’s a skill that, at least for now, no AI model can teach you.

MIT Tested 41 AI Models on 11,000 Real Tasks: The ‘Good Enough’ Problem Is Real, and I See It Every Day

What MIT Actually Found

The 35% Nobody Talks About

I See This Every Day – Here’s the Real Picture

Where AI Genuinely Crushes It (The 65%)

Where AI Falls Flat (The 35%)

The “Good Enough” Trap for Founders and Agencies

AI Code Creates More Issues Than Human Code

The Productivity Illusion

The Trust Deficit

The Chef vs. The Monkey

Where AI Actually Shines – And Why That Matters

What I Actually Do Differently – A Practical Framework

1. Trust AI for Volume, Not Judgment

2. Never Skip Human Review for Client-Facing Anything

3. Be Brutally Honest About Where You’re Cutting Corners

4. Invest in Review Infrastructure, Not Just Generation Speed

5. Track the Real Metric: Outcomes, Not Output

6. Use AI’s Trajectory, Not Its Current State

What MIT’s Study Means for the AI Hype Cycle

A Personal Note on the “48% Vulnerability” Stat

The Honest Assessment

Further Reading

More from the studio

What MIT Actually Found

The 35% Nobody Talks About

I See This Every Day – Here’s the Real Picture

Where AI Genuinely Crushes It (The 65%)

Where AI Falls Flat (The 35%)

The “Good Enough” Trap for Founders and Agencies

AI Code Creates More Issues Than Human Code

The Productivity Illusion

The Trust Deficit

The Chef vs. The Monkey

Where AI Actually Shines – And Why That Matters

What I Actually Do Differently – A Practical Framework

1. Trust AI for Volume, Not Judgment

2. Never Skip Human Review for Client-Facing Anything

3. Be Brutally Honest About Where You’re Cutting Corners

4. Invest in Review Infrastructure, Not Just Generation Speed

5. Track the Real Metric: Outcomes, Not Output

6. Use AI’s Trajectory, Not Its Current State

What MIT’s Study Means for the AI Hype Cycle

A Personal Note on the “48% Vulnerability” Stat

The Honest Assessment

Further Reading

Related Reading

More from the studio