Performance per dollar is getting faster and cheaper

(wafer.ai)

261 points | by latchkey 13 hours ago

21 comments

minraws 12 hours ago
Can you folks add performance per watt as a metric to these comparisons, I honestly want to understand where AMD fits in the stack in terms of actual performance to dollars. I have had talks with companies wanting to build data centers outside of US and find it hard to source anything Nvidia in sufficient capacity and scale.
If AMD is competitive performance per watt and roughly reliable in terms of software support which is what most folks outside of US prioritize above all else, since outside of China and US electricity tends to at a relative premium.
Maybe if they make smaller data centers viable at the right price, AMD could be part of the stack outside of US where ever Nvidia is more limited in supply. Though I have genuinely no idea what sourcing an AMD GPU looks like.
I have never seen a company use AMD outside of wafer and a couple others mostly in US.
Genuinely intriguing or maybe not really (could be this stuff is common knowledge) and I am just stuck in my Nvidia bubble here.
[-]
- kingstnap 10 hours ago
  A DGX B200 costs like ~$0.5 M and uses around 14 kW.
  If you plan to run it straight for 8 years 100% max usage thats around 1 GWhr.
  A gigawatt hour is a lot of energy but its not that much compared to the price of the actual machine. In Germany for example with its expensive energy thats about €100k worth, which spread over 8 years is pretty minor compared to the up front half mill.
  The real issue with high power consumption is not really the cost of energy but the limited powersupply you can get for a datacenter. A more efficient setup is highly desirable because it means you can fit more in the limited power hookup.
  [-]
  - dannyw 7 hours ago
    It’s more than power supply. Cooling and ventilation becomes a MUCH bigger deal at rack scale, and that costs electricity too.
    [-]
    - thereisnospork 5 hours ago
      Cooling demand is only fractional with respect to the load: cooling 1MW of heat will only cost a few 10's to low 100's of kW, depending on the specifics. 10-20% overhead on cooling is probably a close enough estimate for napkin math.
    - psychoslave 5 hours ago
      And datacenters have impact on everything around them. If at the end of the day to result is a few more yachts and jets and, a lot more of miserable humans starving in ruined ecosystems, maybe that’s not the best go-to direction.
      [-]
      - butvacuum 4 hours ago
        You say they have a large impact, but having lived somewhere with some of the largest data centers- they very much don't. At least not more so then any other structure that paves over greenery.
        love to debate actual discission points. pull up "datacenter dfw" on google maps for mine.
        [-]
        ffsm8 4 hours ago
        The people having glass literally break from the vibrations would probably disagree with your opinion
        https://youtu.be/_bP80DEAbuo?is=sg09k66iutKFIFSo
        Yet here we are, discussing "data center" as if they're standardized and of similar (nose) isolation.
        There are no meaningful regulations in building them, and they can be incredibly polluting. So your experience with a potentially well isolated one is sadly not the norm going forward. And we don't even know how close you lived, if you're eg talking about "within 5km/3miles" then your experience would also have little value in this discussion in general.
        [-]
        jml7c5 3 hours ago
        >The people having glass literally break from the vibrations would probably disagree with your opinion
        Can you cite a source for this? It's not in the video, as far as I can tell.
        I would be wary of Benn Jordan's videos. They are full of mistakes and misrepresentations, as Andy Masley has convincingly demonstrated: https://blog.andymasley.com/p/contra-benn-jordan-data-center...
        I recall seeing Benn Jordan's responses on Bluesky and thinking they were quite poor. He was unwilling to admit to mistakes, and kept trying to grasp at newly searched papers that didn't actually support his arguments.
        [-]
        hypfer 2 hours ago
        Benn unfortunately is one of those people that actually feel stuff, which is a trait that easily gets exploited by bad actors.
        Indeed, he shot himself in the foot there pretty bad, but I would argue that that was just the result of successful Agitation.
        I would personally strongly prefer being in the same room with Benn compared with Andy, because one of them is authentic, while the other is calculating. Though, arguably, Benn has been catching up on that lately too.
        But yeah, taking stuff with a grain of salt should be the default regardless of the person speaking.
        redsocksfan45 1 hour ago
        [dead]
        apublicfrog 3 hours ago
        The fact that people have lived and worked near data centres for decades and didn't even know what the term meant - let alone be adversely impacted by them - probably indicates they're broadly an non issue. All of a sudden out of nowhere, AI and data centres got intermingled by the media and now people seem to have big issues with them.
        [-]
        Liftyee 3 hours ago
        Though, the new data centers are not entirely the same. Increasing use of onsite gas turbines to generate power instead of using grid power changes their noise+air pollution profile.
        [-]
        fragmede 1 hour ago
        The problem these days is lack of nuance. It should seem entirely reasonable to be pro-datacenters-if-they're-done-right, but it feels like there are only two sides to any issue. Gas turbine whine noise isn't coming from the data center, it's being used to power the data center, but the camp is either pro data center or not, and fuck any nuance.
        lnsru 3 hours ago
        Sounds exactly like the stories with 5G cell towers. Almost no problems with GSM and then suddenly 5G is big issue.
  - heisenbit 2 hours ago
    Plus the power needed for cooling adding maybe 50%.
  - jwpapi 2 hours ago
    Interesting so it’s supply chain and then you need to calculate how long it can be utilized and for how much you can sell it.
    Would love more calculations on that
- embedding-shape 2 hours ago
  > I have never seen a company use AMD outside of wafer and a couple others mostly in US.
  Worth remembering AMD basically "owns" (not literally) the hardware-side of things in video games consoles for good many years now, with no end in sight.
  [-]
  - ekianjo 1 hour ago
    Because they have x86 CPU licenses.
    [-]
    - embedding-shape 1 hour ago
      Every single video game console of the last generation (and probably further back) are using AMD Radeon for graphics too FWIW. I think the Switch might be the only outlier recently using nvidia graphics.
- Twirrim 11 hours ago
  > I have never seen a company use AMD outside of wafer and a couple others mostly in US.
  There's a few using them, and even more starting to experiment with them. AMD has long been a source of disappointment around this side of things, so I'm hesitant to feel optimistic we'll finally get some competition. The market really needs viable competition to Nvidia, especially performance/watt.
- craftkiller 11 hours ago
  > I have never seen a company use AMD
  Meta is using AMD: https://www.amd.com/en/newsroom/press-releases/2026-2-24-amd...
  And OpenAI: https://www.amd.com/en/newsroom/press-releases/2025-10-6-amd...
  [-]
  - Schiendelman 10 hours ago
    It's not clear when this will be - AMD has slipped these dates likely to 2027.
- latchkey 9 hours ago
  > I have never seen a company use AMD outside of wafer and a couple others mostly in US.
  Just because you haven't seen it doesn't mean it doesn't exist.
  We've serviced over 700 customers on our MI300x.
  [-]
- 7thpower 5 hours ago
  Typically any company that can’t get Nvidia to fill their orders will have at least some AMD.
  [-]
  - embedding-shape 1 hour ago
    What type of company are you talking about here? Granted, nowadays I mostly interact with ML-adjacent companies but almost none would go "Hmm, hard to get nvidia hardware today, lets dump all expertise and knowledge of CUDA et al we have and start using AMD hardware until we can get nvidia", everyone would just wait or rent in the meantime.
- jingpostmedia 1 hour ago
  [flagged]
- technoabsurdist 11 hours ago
  AMD MI355X uses 1,400W per GPU and NVIDIA B200 uses 1,200W. So AMD uses about 16% more power.
  [-]
  - vlovich123 10 hours ago
    Not how you measure performance per watt but generally it’s 20-60% worse at tok/s/watt not 16. It does have ~50% more memory (~100gb) which complicates the comparison.
hassaanr 7 hours ago
While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality. Would love to see this not be true.
[-]
- zozbot234 4 hours ago
  Kimi uses INT4 as its native format, there's no such thing as "better than 4-bit precision" for that model. This is in contrast with GLM for which 16-bit precision is native and 8-bit is in common use.
  [-]
  - hassaanr 1 hour ago
    You’re right, but this poses a separate issue as the providers then do FP4 PTQ, which is quite lossy. Reduces the model size and optimizes for Blackwells at the (imo severe) cost of performance.
- unrvl22 5 hours ago
  MI355X can perform FP6 operations with the same speed as their FP4 (unique to AMD) - people should be making MXFP6 quants which would be pretty much lossless, and much closer to FP4 performance than FP8
- google234123 6 hours ago
  First thing I noticed as well
- tw1984 6 hours ago
  from memory, it is like 96-98% of the accuracy.
  [-]
  - lgessler 6 hours ago
    Accuracy isn't a meaningful metric here without reference to a specific task.
    [-]
    - flawn 2 hours ago
      Additionally, I'd imagine quantization to have more side-effects than just slightly lower performance (on whatever task). You are basically removing information, and that information could be by chance what the model needs to fulfill it exactly the way you'd want to do - although it's still fully capable. I am not sure if this is really different from "lower performance" but open to hear your opinions.
  - EduardoBautista 5 hours ago
    And that 2%-4% makes all the difference.
    [-]
    - fpaf 3 hours ago
      Yes, it's like saying "we took off a big chunk of his brain but look! He can still breathe autonomously, swallow food and walk almost straight, which is like 95% of what he did before!"
nxtfari 8 hours ago
I think we should make it illegal to not specify the quantization in the headline for these types of posts.
[-]
- IshKebab 54 minutes ago
  And to use the heading "Why this matters".
- ahmadyan 8 hours ago
  Its MXFP4
- ozgrakkurt 1 hour ago
  A nice filter is checking for the `.ai` in the end. It is very likely slop if you see that. Slop meaning low-effort/clickbait/shallow/useless/scam etc.
p1esk 10 hours ago
There’s noticeable accuracy degradation when they switched from fp8 to mxfp4
[-]
- greyb 7 hours ago
  Wafer discontinued their own "Wafer Pass" flagship coding plan within weeks of launch and had to issue prorated refunds. Now they're bragging about squeezing costs down even further via quantization, even though their implementation is clearly lacking.
  [1] https://www.ycombinator.com/launches/Q9i-wafer-pass-flat-rat...
- throwdbaaway 10 hours ago
  And somehow they claimed that it is "lossless".
adammarples 34 minutes ago
Slight criticism of the headline there, you can't get cheaper per dollar.
Schiendelman 10 hours ago
I'm not surprised to see competition with Blackwell. Rubin is 5x faster than Blackwell at inference - Blackwell is the last generation Nvidia didn't optimize specifically for inference.
If I'm missing something, please let me know!
[-]
- boroboro4 7 hours ago
  It's very unclear what's special in Rubin to be optimized for inference? I can see disaggregated bit (with having separate prefill and decoding nodes), but what else?
  [-]
  - villgax 6 hours ago
    Lot more SMs & Tensor Cores for NVFP4 going by the looks of it.
- nullc 10 hours ago
  how do you get 5x faster at inference when inference is memory bandwidth limited? getting 5x the memory bandwidth of a h100 seems physically difficult.
  [-]
  - Schiendelman 9 hours ago
    Rubin has 22TB/s of memory bandwidth vs Blackwell's 8TB/s. NVLink 6 doubles interconnect speed. Plus they're moving to 3nm from ~4nm.
    (Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)
    [-]
    - boredatoms 9 hours ago
      Moving to lower bits is not a slam dunk, the model itself might degrade too much
      [-]
      - Schiendelman 8 hours ago
        Of course, but for most workflows it's fine.
    - zackangelo 8 hours ago
      Blackwell supports nvfp4 natively.
      [-]
      - Schiendelman 8 hours ago
        You're right - Rubin is better at NVFP4 training, not inference, thank you for catching me!
        [-]
        boroboro4 7 hours ago
        What does it mean it's better at nvfp4 training? What's different between training and inference to make this true?
        [-]
        Schiendelman 7 hours ago
        We're getting to the limit of my understanding, but I believe most Blackwell users still usually run FP8 passes through the transformer engine - they'll just store weights at NVFP4. Nvidia has model-specific stabilization recipes for NVFP4 end to end, but they're taking fixes all the time.
        Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.
        But yeah, grain of salt - we haven't seen this in practice.
        fc417fc802 7 hours ago
        I'm also puzzled by that statement. The issue with training is (as I understand it) one of precision and the associated numerical stability. You need enough bits in order for backprop to function correctly.
        Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.
        You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?
  - unrvl22 4 hours ago
    inference is only memory bandwidth limited when targeting higher tps / high single stream tps. the weights only need to be moved across once per forward pass, when you batch say 100 streams per forward pass (which is what most inference services do / care about) its compute bottlenecked.
johanvts 2 hours ago
That sounds literally impossible.
AussieWog93 11 hours ago
The 2600 tok/s is an "aggregate", not the actual throughput.
[-]
- technoabsurdist 11 hours ago
  yes it is 213 tok/s single stream (so per user)
  [-]
  - unrvl22 5 hours ago
    that 213 wasn't achieved when saturated though. was probably more like 30 tps per stream when doing 2.6k tps.
  - 3836293648 11 hours ago
    So per subagent*.
    [-]
    - alienbaby 9 hours ago
      *per stream, I guess is more accurate than either?
hahahaa 2 hours ago
What is a knee, in performance talk?
[-]
- kgwgk 2 hours ago
  A place where the slope/derivative/incremental-performance-per-price changes.
- nnevatie 2 hours ago
  I used to be high-performance like you, then I took an arrow to the knee?
oDot 12 hours ago
Do these providers have 80+% gross margins or is something eating into them? Maybe utilization?
[-]
- technoabsurdist 11 hours ago
  hi i work at wafer. no the margins are lower averaging at about ~40%. utilization is one of the highest order bits in determining margins here, yes.
  [-]
  - keynha 8 hours ago
    [dead]
gowthamsaiyadav 1 hour ago
world is not limited by Nvidia, AMD can be used
alienbaby 9 hours ago
I'm interested if anyone knows how much legwork the assumed 60% cache hit, plus running a quantised model is doing? Esp. compared to what the headline half implies is a full fat GLM5.2
killingtime74 8 hours ago
No word on what this actually means as a consumer. What's the price. Is it lower than NVIDIA serving?
[-]
- mixtureoftakes 6 hours ago
  They seem to be serving it at 3x the price while also struggling with maintaining uptime on openrouter; while the vercel router advertizes even bigger speeds but has no clear uptime stats
  I guess you really do have to try it at least for some time to actually know
yieldcrv 11 hours ago
Agentic coding drivers for different architectures is a massive unlock for the world
So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts
[-]
- technoabsurdist 11 hours ago
  this is exactly our thesis at wafer :) thank you for the support
  [-]
  - yieldcrv 7 hours ago
    well done
- yogthos 11 hours ago
  Personally, I can't wait till something like this starts getting to consumer level. https://www.anuragk.com/blog/posts/Taalas.html
  [-]
  - yieldcrv 10 hours ago
    That’s pretty fascinating, Apple has some innocuous LLMs and transformers baked into its devices and leveraging their neural chipset
    So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device
- innis226 6 hours ago
  [dead]
calin2k 4 hours ago
then why is token per dollar getting more expensive?
[-]
- FeepingCreature 1 hour ago
  Because lots of people are willing to pay more dollar for smarter token.
- AtlasBarfed 4 hours ago
  Because they are dumping/subsidizing it token processing to try and get companies to fire as many people as possible. So they'll be dependent upon the companies when they have to Jack the rates
beffjezos 7 hours ago
This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.
It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.
[-]
- wmf 7 hours ago
  You got it backwards; it's ~200 on single stream so the 2,600 is achieved with ~13 streams.
  [-]
  - beffjezos 7 hours ago
    Yeah that makes sense. I'm more familiar with seeing tok/s/user + TTFT rather than the total node throughput.
- technoabsurdist 7 hours ago
  hi yes it’s not optimized for single stream it’s optimized for total node throughput
  [-]
  - beffjezos 7 hours ago
    Oh, that's much better then. A good metric to share is the tokens per second per user for the node rather than the total throughput of the node. It disambiguates what's being optimized for much better than your blog post currently does.
    [-]
    - technoabsurdist 5 hours ago
      sounds good feedback taken, thanks beffjezos
villgax 6 hours ago
They fail to mention non speculative numbers & whether baseline was nvfp4 as well. So much for erosion against an older gen
shevy-java 3 hours ago
But RAM prices skyrocketed!
The AI companies owe use money. As does e. g. NVIDIA for becoming a cartel.
zuzululu 5 hours ago
yeah but we are still far far away from being able to run the frontier model equivalents locally without significant quantization
even having something like opus 4.8 locally would completely change the landscape
bitwize 3 hours ago
(in a high-pitched, pathetic regency-era British orphan voice) Please sir, may I have some compute as well?