Measuring AI Ability to Complete Long Tasks: Opus 4.5 has 50% horizon of 4h49M

(metr.org)

75 points | by spicypete 2 hours ago

12 comments

simonw 1 hour ago
I didn't really understand the "long task" thing until I actually experienced it. The problem is finding a task you can set an agent that justifies working for that long. I finally hit one when I tried porting that Python HTML5 parser to JavaScript by pointing Codex CLI at the 9,200 html5lib-tests test suite: https://simonwillison.net/2025/Dec/15/porting-justhtml/
It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.
[-]
- ehnto 43 minutes ago
  I think you might be misunderstanding the article actually, this is about AI solving tasks as measured by how long it takes a human to solve the task. The AI could potentially solve it much quicker, but the use of "human time to solve" is an attempt to create a metric that reveals long horizon complexity (as I understand it anyway).
  It's interesting because like the article notes, AI is really smashing benchmarks, but actual usefulness in automation of thought work is proving much more elusive. I think that collective experience of AI just not being that useful, or as useful as benchmarks suggest it should be, is captured in this metric.
- dwohnitmok 41 minutes ago
  To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuring the difficulty of tasks by how long it takes a human to do the same task. This benchmark is saying that Opus 4.5 can now do tasks (related to AI R&D, coding foremost among them) that take human experts > 4 hours (at a 50% reliability level; whether that's actually useful depends on of course the cost of failure). It is silent on how long it takes AI systems to do those tasks. In theory an AI system could take longer than that (in practice it's usually significantly shorter).
  This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing.
  Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week).
- twotwotwo 37 minutes ago
  METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work.
  "Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: probably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is very different from the hours any particular person like you or I would spend.
  In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.
- tacitusarc 31 minutes ago
  My problem with the OpenAI models (GPT5.2 in particular) recently is an extreme aversion to doing more than the smallest step in a task before asking for using input. Even if I explicitly instruct it to continue without input until the task is complete, it ignores the instruction.
  I cannot imagine GPT5.2 working on a task for more than 2 minutes, let alone 4 hours. I’m curious if you’ve run into this and figured out a way around it?
  [-]
  - BoiledCabbage 19 minutes ago
    What agent framework are you using? It can differ from one to the next on the same model.
twotwotwo 8 minutes ago
I'm conflicted about opining on models: everyone's speaking based on a tiny sample of real-world tasks, but I kinda think we should share our dubiously-informed opinions anyway because benchmarks don't predict real-world utility that well.
Opus 4.5 seemed notably better than Sonnet 4.5, more of a difference than I'd noticed from, for example, the last couple Sonnet bumps. Objectively, at 1.66x Sonnet's price instead of the 5x, it's much more often practical to consider reaching for. Anthropic's basic monthly thing covers a fair amount of futzing with it in CC.
At the other extreme, another surprise of this family is that Haiku 4.5 with reasoning on is usable: better than Sonnet with thinking off according to some bencharks, and in any case subjectively decent for point edits, single-page thingies, and small tools.
subdavis 1 hour ago
I recently asked Opus to just “Add vector search” to my current hobby project, a topic I know very little about. It set up manticore, pulled an embedding model, wrote a migration tool for my old keyword indices, and built the front end. I’m not exaggerating much either: the prompt was the length of a tweet.
I think it would easily have taken me 4+ hours to do that. It ran in 15 minutes while I played Kirby Air Riders and worked on the first try.
Afterward, I sort of had to reflect on the fact that I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature. It kept me learning the thing I cared about rather than doing a side quest.
[-]
- simonw 1 hour ago
  I don't think building it the long way is necessarily a more effective way to learn.
  You could spend 4 hours (that you don't have) building that feature. Or... you could have the coding agent build it in the background for you in 15 minutes, then spend 30 minutes reading through what it did, tweaking it yourself and peppering it with questions about how it all works.
  My hunch is that the 30 minutes of focused learning spent with a custom-built version that solves your exact problem is as effective (or even more effective) than four hours spent mostly struggling to get something up and running and going down various rabbit holes of unrelated problem-solving.
  Especially if realistically you were never going to carve out those four hours anyway.
  [-]
  - aabhay 1 hour ago
    This feels like the exactly wrong way to think about it IMO. For me “knowledge” is not the explicit recitation of the correct solution, it’s all the implicit working knowledge I gain from trying different things, having initial assumptions fail, seeing what was off, dealing with deployment headaches, etc. As I work, I carefully pay attention to the outputs of all tools and try to mentally document what paths I didn’t take. That makes dealing with bugs and issues later on a lot easier, but it also expands my awareness of the domain, and checks my hubris on thinking I know something, and makes it possible to reason about the system when doing things later on.
    Of course, this kind of interactive deep engagement with a topic is fast becoming obsolete. But the essence to me of “knowing” is about doing and experiencing things, updating my bayesian priors dialectically (to put it fancily)
    [-]
    - simonw 51 minutes ago
      I agree that the only reliable way to learn is to put knowledge into practice.
      I don't think that's incompatible with getting help from LLMs. I find that LLMs let me try so much more stuff, and at such a faster rate, that my learning pace has accelerated in a material way.
      [-]
      - gflarity 41 minutes ago
        Consider, ever so briefly, that people don't all learn the same. You do you.
        [-]
        simonw 24 minutes ago
        That's fair.
        Something I'm really interested right now is the balance in terms of the struggle required to learn something.
        I firmly believe that there are things where the only way to learn how to do them is to go through the struggle. Writing essays for example - I don't think you can shortcut learning to write well by having an LLM do that for you, even though actually learning to write is a painful and tiresome progress.
        But programming... I've seen so many people who quit learning to program because the struggle was too much. Those first six months of struggling with missing semicolons are absolutely miserable!
        I've spoken to a ton of people over the past year who always wanted to learn to program but never managed to carve out that miserable six months... and now they're building software, because LLMs have shaved down that learning curve.
    - johnfn 15 minutes ago
      But how much of that time is truly spent on learning relevant knowledge, and how much of it is just (now) useless errata? Take vector search for an example. Pre-GPT, I would spend like an hour chasing down a typo, like specifying 1023 instead of 1024 or something. This sort of problem is now trivially solved in minutes by a LLM that fully understands the API surface area. So what exactly do I lose by not spending that hour chasing it down? It has nothing to do with learning vector search better, and an LLM can do it better and faster than I can.
  - weitendorf 41 minutes ago
    Generally I agree with your takes and find them very reasonable but in this case I think your deep experience might be coloring your views a bit.
    LLMs can hurt less experienced engineers by keeping them from building an intuition for why things work a certain way, or why an alternative won't work (or conversely, why an unconventional approach might not only be possible, but very useful and valuable!).
    I think problem solving is optimization in the face of constraints. Generally using LLMs IME, the more you're able to articulate and understand your constraints, and prescriptively guide the LLM towards something it's capable of doing, the more effective they are and the more maintainable their output is for you. So it really helps to know when to break the rules or to create/do something unconventional.
    Another way to put it is that LLMs have commodified conventional software so learning when to break or challenge convention is going to be where most of the valuable work is going forward. And I think it's hard to actually do that unless you get into the weeds and battle/try things because you don't understand why they won't work. Sometimes they do
    [-]
    - simonw 22 minutes ago
      I think it's very easy to harm your learning by leaning into LLMs.
      What I don't believe is that it HAS to be like this. Maybe it's my natural optimism showing through here, but I'm confident it's possible to accelerate rather than slow down your learning progress with LLMs, if you're thoughtful about how you apply them.
      An open question for me is how feasible it is to teach people how to teach themselves effectively using this new technology.
      I have a core belief that everything is learnable, if people are motivated to learn. I have no idea how to help instill that motivation in people who don't yet have it though!
- vachina 1 hour ago
  Yeah and then it becomes an unmaintainable monolith because at some point the AI also lost track of what code does what.
  Great for Opus because you’re now a captive customer.
  [-]
  - tokioyoyo 1 hour ago
    The point of eventual “all-code-is-written-by-AI” is that it really does not matter if your code is maintainable or not. In the end, most of the products are written to accomplish some sort of a goal or serve a need within a given set of restrictions (cost, speed and etc.). If the goal is achieved within given restrictions, the codebase can be thrown away until the next need is there to just create everything from scratch, if needed.
    [-]
    - weitendorf 0 minutes ago
      It does matter because the code needs to still be legible and discoverable and semantic enough for other AI to find it and use it without it being so confusing or painful that they prefer to just write it themselves.
      The reason software is so valuable is that it's capital/up-front investment in figuring something out that can continuously deliver value with low or no marginal cost. Rewriting/maintenance/difficulty/figuring out software is marginal cost.
    - simonw 1 hour ago
      I don't buy it.
      I think that could work, but it can work in the same way that plenty of big companies have codebases that are a giant ball of mud and yet they somehow manage to stay in business and occasionally ship a new feature.
      Meanwhile their rivals with well constructed codebases who can promptly ship features that work are able to run rings around them.
      I expect that we'll learn over time that LLM-managed big ball of mud codebases are less valuable than LLM-managed high quality well architected long-term maintained codebases.
      [-]
      - tokioyoyo 1 hour ago
        Fair enough. In my imagination, I can see people writing AI-first framework/architectures and a general trend for people to “migrate to such frameworks”, just like the push towards the microservices architectures in 2010s. A part of these frameworks would be “re-constructibility” by changing contracts in parts where it matters, and somehow the framework would make it easy for the LLM to discover such “parts”.
        Honestly, i’m making stuff up, as I don’t think it’s feasible right now because of the context sizes. But given how fast things develop, maybe in a couple of years things might change.
        [-]
        weitendorf 4 minutes ago
        No you're not making it up, this is exactly what some people are working on. Agent frameworks are starting to move towards "dynamic" service discovery/runtime introspection and composition-with-guardrails. Some keywords are "agent mesh", and the general marketing from AI companies about AI "inventors", and agent-driven interfaces like Google's a2ui (which is just a spec)
        We recently started working on https://github.com/accretional/collector to serve as a dynamic proto ORM+CRUD server with search and discovery, and features for operating as a node in an "agent/service mesh". The idea is that you can create a uniform interface for data retrieval/search/APIs that lets agents dynamically register, invoke, or discover any data type or service, or write it themselves, then register it locally or share it.
        It is feasible to do this stuff now actually, just a bit tricky because most LLMs aren't trained to operate this way without very explicit instructions for how to do so, and for collector specifically the API surface is probably too big. But I am pretty sure neither would take long to fix if enough people were adopting this kind of pattern.
      - cornel_io 10 minutes ago
        And at the end of the day it's not really a tradeoff we'll need to make, anyways: my experience with e.g. Claude Code is that every model iteration gets much better at avoiding balls of mud, even without tons of manual guidance and pleading.
        I get that even now it's very easy to let stuff get out of hand if you aren't paying close attention yourself to the actual code, so people assume that it's some fundamental limitation of all LLMs. But it's not, much like 6 fingered hands was just a temporary state, not anything deep or necessary that was enforced by the diffusion architecture.
    - Aperocky 1 hour ago
      Recreating everything from scratch gets harder and the previous requirements will eventually not be met after sufficient number of them have been accumulated. AI would have no solution to this unless it iterate on the same code base, but since I've not seen evidence of architectural maintainability from AI, a project that are fully given to AI is bound to fail.
      AI is still incredibly useful used in tandem, but have it implement full feature from one sentence usually lead to doom.
- Avicebron 1 hour ago
  > I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature
  Opus/Anthropic is hands down the best in my experience. But using it feels like intellectual fast food (they all are), I hate the fact that I can build something like a neatly presentable one off spa tool (ty Simon) when I'm barely paying attention. it feels unsatisfying to use.
  EDIT: because I'm rambling, I like "AI" as much as the next guy, probably more because I was there before it turned into LLMs"R"US, but I also like(d) the practice of sitting around listening to music solving problems with Scala. I don't know why we've decided to make work less fun..
  [-]
  - pastel8739 56 minutes ago
    “We” didn’t decide to make work less fun, others decided for us.
  - fluidcruft 32 minutes ago
    I sort of disagree. It's somewhat like having hypercard again. You can build fun UI things and make machines do what you want them to do. You can care about the parts you want to care about and not sweat about the parts you don't want to learn in detail (yet). And Claude and codex make great guides/Sherpas.
    There are just too many parts involved to do anything. For example today I built a simple data collection app to use on my phone that involves inventories with photos for a tedious workflow I have to do. I knew what I wanted but didn't know how to even choose which tools to bother learn. And just even trying things to see if an approach works or not without spending hours learning one thing or another or wading through the hell of web search is really great.
    Things I learned today that I figure everyone else must know: if you want to take a photo from a webapp I guess you need https. So I decided to try mTLS (knew it existed but never had the time) so asked Claude to write me a short tutorial about setting it up, creating keys, importing them (including a cool single line trick of spinning up a python server and downloading the keys on my phone rather than find a USB stick or whatever). And then helping me figure out a path out of the suffering of Chrome and Firefox hating self-signed CA. But at least I figured out how to make Firefox happy. But it would insist on prompting me for the certificate for every htmx request. But chatting with Claude I learn caddy is pretty cool, it's go. Claude suggests an auth boxcar when I balk at adding auth and user management to my app because I think the webserver should handle all this shit (wtf is a boxcar? Claude clues me in). I tell Claude to use go or rust to build the boxcar because Jesus Christ "yay" build another service just to get a good damn customized CRUD app on my phone that can take a picture. Claude picks go which is fine by me. (Incidentally I can't write go, but I can read it and it's on my "to be learned" agenda and go seems safer than a pile of python for this simple thing) The boxcar was fine but Claude was struggling with getting headers to work in the caddy config. So while Claude is working on that I do a quick Google about whether caddy can have extensions because there has to be a better way to "if someone has authenticated successfully, give them a cookie that will last an hour so they don't have to mash the confirm about using the certificate for every goddamn htmx request" than spin up a web service. Interrupt Claude and suggest an extension instead of a boxcar. Claude's on board so we ditch the boxcar. Have Claude and codex evaluate the extension for security. They find important issues about things a jerk might do, fix them. So successful mTLS connections transition to session cookies. So my dumb CRUD tool doesn't have to worry about auth. Which it didn't have to do anyway except browsers say so etc because my phone is literally only able to access the server via VPN anyway.
    Other things I have learned today that only wasted 5min of Claude's time rather than hours of mine: Firefox camera access can't control flash, focus or zoom. So call out to the native app instead.
    This is all quite fun and the tool I'm building is going to really make my own life better.
    Is there a better way to do this: probably.
    [-]
    - Avicebron 17 minutes ago
      >only wasted 5min of Claude's time rather than hours of mine
      I mean will you (we) retain all that it did after a few months go by? You may say we don't need to, but that sounds a little shallow given we're both on HN. Do you remember Gatsby's criticism of "Summer People"?
- ModernMech 1 hour ago
  The result of you having worked 4 hours to implement the thing is not just that you have the thing, it's that you have the thing and you understand the thing. Having the thing is next to useless if you don't understand it.
  At best it plods along as you keep badgering Claude to fix it, until inevitably Claude reaches a point where it can't help. At which time you'll be forced to spend at least the 4 hours you would have originally spent trying to understand it so you can fix it yourself.
  At worst the thing will actively break other things you do understand in ways you don't understand, and you'll have to spend at least 4 hours cleaning up the mess.
  Either way it's not clear you've saved any time at all.
  [-]
  - weitendorf 24 minutes ago
    You do learn how to control claude code and architect/orient things around getting it to deliver what you want. That's a skill that is both new and possibly going to be part of how we work for a long time (but also overlaps with the work tech leads and managers do).
    My proto+sqlite+mesh project recently hit the point where it's too big for Claude to maintain a consistent "mental model" of how eg search and the db schemas are supposed to be structured, kept taking hacky workarounds by going directly to a db at the storage layer instead of the API layer, etc. so I hit an insane amount of churn trying to get it to implement some of the features needed to get it production ready.
    Here's the whackamole/insanity documented in git commit history: https://github.com/accretional/collector/compare/main...feat...
    But now I know some new tricks and intuition for avoiding this situation going forward. Because I do understand the mental model behind what this is supposed to look like at its core, and I need to maintain some kind of human-friendly guard rails, I'm adding integration tests in a different repo and a README/project "constitution" that claude can't change but is accountable for maintaining, and configuring it to keep them in context while working on my project.
    Kind of a microcosm of startups' reluctance to institute employee handbook/kpis/PRDs followed by resignation that they might truly be useful coordination tools.
  - subdavis 1 hour ago
    Respectfully, I think I’m in a better position to decide a) what value this has to me and b) what I choose to learn vs just letting Opus deal with. You don’t have enough information to say if I’ve saved time because you don’t know what I’m doing or what my goals are.
  - OxfordOutlander 1 hour ago
    > inevitably Claude reaches a point where it can't help.
    Perhaps not. If LLMs keep getting better, more competent models can help him stay on top of it lol.
    [-]
    - evklein 1 hour ago
      You're still captive to a product. Which means that when CloudCo. increases their monthly GenAI price from $50/mo. to $500/mo., you're losing your service or you're paying. By participating in the build process you're giving yourself a fighting chance.
      [-]
      - pillefitz 36 minutes ago
        I will quickly forget the details about any given code base within a few months anyway. Having used AI to build a project at least leaves me with very concise and actionable documentation and, as the prompter, I will have a deep understanding of the high-level vision, requirements and functionality.
pugio 1 hour ago
Opus looks like a big jump from the previous leader (GPT 5.1), but when you switch from "50%" to "80%", GPT 5.1 still leads by a good margin. I'm not sure if you can take much from this - perhaps "5.1 is more reliable at slightly shorter stuff, choose Opus if you're trying to push the frontier in task length".
Aperocky 1 hour ago
I think the problem here is LLM eventually pollute its context window with so much of the current task that the larger picture or architectural sanity is forgotten in favor of the current task at hand.
And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!
karimQuant 1 hour ago
The big issue is the 50%, if you switch to 80% it's much less. Now if you are in the wrong side of 50% given the task was 4hours. How much additional time to 4hours you need. repeat trying to get the task done 50%*50%->25% , 50%^4 -> 6.25%. the cost of bad luck is very high.
nrhrjrjrjtntbt 1 hour ago
Why measure in minutes and not tokens? Seems you could cheat by slowing the ai down.
[-]
- wmf 53 minutes ago
  They measure the time it takes a human to complete the task. They don't care how long the AI takes (although in practice it's much faster than human). Measuring tokens isn't a good idea because newer models can complete tasks using fewer tokens.
yismail 1 hour ago
Would be interesting to see Gemini 3.0 Pro benchmarked as well.
bentobean 36 minutes ago
> We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.
If true, how much of this is a result of:
1. Genuine technical advancement
or:
2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?
In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?
[-]
- mediaman 12 minutes ago
  Much of this is due to vastly better posttraining RL, not models that are much bigger. The idea that most of these gains comes from training really big models, or throwing immensely larger amounts of compute at it, is not really true.
- emp17344 17 minutes ago
  I wonder how much of this stuff is attributable to true model advancement, or if it’s an improvement in the genetic harness? It’s impossible to separate strict model improvement from improvement in the associated tools.
- dghost-dev 35 minutes ago
  Good point.
grim_io 1 hour ago
This seems like a good way to measure LLM improvement.
It matches the my personal feeling when using progressively better models over time.
alexgotoi 1 hour ago
[dead]
Dwedit 1 hour ago
Opus is already the name of an audio codec.
[-]
- pants2 1 hour ago
  Gemini is already the name of a Greek god, a constellation, a space mission, a crypto exchange, an astrological sign, a car, and a comic villain! How will we ever figure out which one someone is talking about?
- p1esk 1 hour ago
  Have you been living under a rock?
- GaggiX 1 hour ago
  Opus: "an artistic work, especially one on a large scale."
  The names Haiku, Sonnet, and Opus have not been chosen randomly.