I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.
Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!
We'd save the test result to the registry and include the result in automated bug reports.
The common causes we discovered for the problem were:
- overclocked CPU
- bad memory wait-state configuration
- underpowered power supply
- overheating due to under-specced cooling fans or dusty intakes
These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.
Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.
And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.
Sometimes I'm amazed that computers even work at all!
Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.
As a mobile dev at YouTube I'd periodically scroll through crash reports associated with code I owned and the long tail/non-clustered stuff usually just made absolutely no sense and I always assumed at least some of it was random bit flips, dodgy hardware, etc.
GW1 was my childhood. The MMO with no monthly fees appealed to my Mom and I met friends for years. The 8 skill build system was genius, as was the cut scenes featuring your player character. If there's ever a 3rd game I would love to see something allowing for more expression through build creation though I could see how that's hard to balance.
I still remember summoning flesh golems as a necromancer! Too much of my life sunk into GW1. Beat all 4(?) expansions. Logged in years later after I finally put it down to find someone had guessed my weak password, stole everything, then deleted all my characters. C'est la vie.
Thanks to asrock motherboards for AMD’s threadripper 1950x working with ECC memory, that’s what I learned to overclock on.
I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.
From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.
I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)
What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.
I hate to think of how much effort will be spent debugging software in vain because of memory errors.
DDR4 and 5 both have similar heat sensitivity curves which call for increased refresh timings past 45C.
Some of the (legitimately) extreme overclockers have been testing what amounts to massive hunks of metal in place of the original mounting plates because of the boards bending from mounting pressure, with good enough results.
On top of all of this, it really does not help that we are also at the mercy of IMC and motherboard quality too. To hit the world records they do and also build 'bulletproof', highest performance, cost is no object rigs, they are ordering 20, 50 motherboards, processors, GPUs, etc and sitting there trying them all, then returning the shit ones. We shouldn't have to do this.
I had a lot of fun doing all of this myself and hold a couple very specific #1/top 10/100 results, but it's IMHO no longer worth the time or effort and I have resigned to simply buying as much ram as the platform will hold and leaving it at JEDEC.
If you look around you'll see people already putting the new, chinese made DDR4 through its paces, it's holding up far better than anyone expected.
Every single time I've had someone pay me to figure out why their build isn't stable, it's always some combination of cheap power supply with no noise filtering, cheap motherboard, and poor cooling. Can't cut corners like that if you want to go fast. That is to say, I've never encountered "almost ok" memory. They're quite good at validation.
> From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.
This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.
I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.
[0] In practice, if they didn't, they'd all just flock to AMD.
>[0] In practice, if they didn't, they'd all just flock to AMD.
only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower. Back in Zen 2/3 days the choice was basically DDR4-3600 without ECC, or DDR4-2400 with ECC.
At the beginning of your comment I was wondering if the "attitude" that was corporate serving was the anti-ECC stance or the pro-ECC stance (based on the full chunk that you quoted). I'm glad that by the end of the comment you were clearly pro ECC.
Any workstation where you are getting serious work done should use ECC
Every interesting bug report I've read about Guild Wars is Dwarf Fortress tier. A very hardcore, longtime player who was recounting some of the better ones to me shared a most excellent one wrt spirits or ghosts, some sort of player summoned thing that were sticking around endlessly and causing OOM errors?
As a community alpha tester of GW1, this was a fun read! Such an educational journey and what a well organized and fruitful one too. We could see the game taking shape before our eyes! As a European, I 100% relied on being young and single with those American time zones. :D Tests could end in my group at like 3 am, lol.
There's a famous Raymond Chen post about how a non-trivial percentage of the blue screen of death reports they were getting appeared to be caused by overclocking, sometimes from users who didn't realize they had been ripped off by the person who sold them the computer: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35.... Must've been really frustrating.
This was a design choice by AMD at the time for their Athlon Slot A cpus. Use the same slot A board which you could set the cpu speed by bridging a connections. Since the Slot A came in a package, you couldn't see the actual cpu etching. So shady cpu sellers would pull the cover off high speed cpus, and put them on slow speed cpus after overclocking them to unstable levels.
Some multiplayer real-time strategy (RTS) games used deterministic fixed-point maths and incremental updates to keep the players in sync. Despite this, there would be the occasional random de-sync kicking someone out of a game, more than likely because of bit flips.
I kind of wanted to confirm that. At that time I was still using a Compaq business laptop on which I played Guild Wars.
The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.
The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).
The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.
After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.
Did you/he ever consider redundant allocation for high value content and hash checks for low value assets that are still important?
I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?
I think the most reasonable take would be to just tell the users hardware is borked, they're going to have a bad outside the game too, and point them to one of the many guides around this topic.
I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!
To counter that, we're LONG overdue for ECC in all consumer systems.
I put engineering effort into handling bad hardware all the time because safety critical, :)
It significantly overlaps the engineering to gracefully handle non-hardware things like null pointers and forgetting to update one side of a communication interface.
80/20 rule, really. If you're thoughtful about how you build, you can get most of the benefits without doing the expensive stuff.
I think I sit in another camp. A lot of my engineering efforts are in working around bad hardware.
Better the user sees some lag due to state rebuild versus a crash.
Most consumers have what they have, and use what they have. Upgrading everything is now rare. If they got screwed, they'll remain screwed for a few years.
That's an interesting idea. How might you implement that? Like RAID but on the level of variables? Maybe the one valid use case for getters/setters? :)
As another user fairly pointed out, ECC. But a compiler level flag would probably achieve the redundancy, sourcing stuff from disk etc would probably still need to happen twice to ensure that bit flips do not occur, etc.
It's not even ECC price/availability that bothers me so much, it's that getting CPUs and motherboards that support ECC is non-trivial outside of the server space. The whole consumer class ecosystem is kind of shitty. At least AMD allows consumer class CPUs to kinda sorta use ECC, unlike Intel's approach where only the prosumer/workstation stuff gets ECC.
I've been honestly amazed people actually buy stuff that's not "workstation" gear given IME how much more reliably and consistently it works, but I guess even a generation or two used can be expensive.
Very few applications scale with cores. For the vast majority of people single core performance is all they care about, it's also cheaper. They don't need or want workstation gear.
ECC are traditionally slower, quite more complex, and they dont completely eliminate the problem (most memories correct 1 bit per word and detect 2 bits per word). They make sense when environmental factors such as flaky power, temperature or RF interference can be easily discarded - such as a server room. But yeah, I agree with you, as ECC solves like 99% of the cases.
Thing is, every reported bug can be a bit flip. You can actually in some cases have successful execution, but bitflips in the instrumentation reporting errors that dont exist.
ECC are "slower" because they are bought by smart people who expect their memory to load the stored value, rather than children who demand racing stripes on the DIMMs.
ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.
However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.
On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.
It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.
Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.
Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.
However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.
In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.
In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.
I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.
Interesting reading - I've occasionally seen some odd crashes in an iOS app that I'm partly responsible for. It's running some ancient version of New Relic that doesn't give stack traces but it does give line numbers and it's always on something that should never fail (decoding JSON that successfully decoded thousands of times per day).
I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.
> In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).
You might consider adding the CPU temperature to the report, if there's a reasonable way to get it (haven't tried inside a VM). Then you could at least filter out extremely hot hardware.
CPU model / stepping / microcode versions are probably at least as useful as temperature. I'd also try to get things like the actual DRAM timing + voltage vs. what the XMP extensions (or similar) advertise the manufacturer tested the memory at.
I have at least one motherboard that just re-auto-overclocks itself into a flaky configuration if boot fails a few times in a row (which can happen due to loose power cords, or whatever).
This is quite surprising to me, since I thought the percentage would be a lot lesser.
But I don’t really know what the Firefox team does with crash reports and in making Firefox almost crash proof.
I have been using it at work on Windows and for the last several years it always crashes on exit. I have religiously submitted every crash report. I even visit the “about:crashes” page to see if there are any unsubmitted ones and submit them. Occasionally I’ll click on the bugzilla link for a crash, only to see hardly any action or updates on those for months (or longer).
Granted that I have a small bunch of extensions (all WebExtensions), but this crash-on-exit happens due to many different causes, as seen in the crash reports. I’m too loathe to troubleshoot with disabling all extensions and then trying it one by one. Why should an extension even cause a crash, especially when its a WebExtension (unlike the older XUL extensions that had a deeper integration into the browser)? It seems like there are fundamental issues within Firefox that make it crash prone.
I can make Firefox not crash if I have a single window with a few tabs. That use case is anyway served by Edge and Chrome. The main reasons I use Firefox, apart from some ideological ones, are that it’s always been much better at handling multiple windows and tons of tabs and its extensibility (Manifest V2 FTW).
I would sincerely appreciate Firefox not crashing as often for me.
I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.
If 50% of chrome crashes were due to bit flips, and bit flips effect the two browsers at basically the same rate, that would indicate that chrome experiences 1/5th the total crashes of firefox... even though the bit flip crashes happen at the same rate on both browsers.
It would have been better news for firefox if the number of crashes due to faulty hardware were actually much higher! These numbers indicate the vast majority of firefox crashes are actually from buggy software : (
I think they claim that if your computer has bad hardware, you're probably sending a lot of _additional_ crashes to their telemetry system. Your hardware might be working just fine, but the guy next to you might be sending 30% more crashes.
Same, been using it for over 20 years and probably only a handful of crashes in that time. But I mostly look at dead simple web stuff (like hn) and run aggressive ad blocking so I might not be representative of the average user
I can also go months and don't see crashes (though occasionally I'll hit a memory leak where closing tabs doesn't release it so I'll restart firefox then), but unless ThinkPads come with ECC I don't have it.
Its pretty stable for me, except it has some memory leaks. Generally I gotta leave heavy pages open for days at a time to notice, but if I don't close it entirely for over a week or two it will start to chug and crash.
I've had zero crashes in safari, ff or chrome in recent memory (except maybe OOMs). (Though I don't use Windows, so maybe that's part of the reason stuff just works?)
Perhaps you're part of the group driving hardware crashes up to 10% and need to fix your machine.
I can't recall a single Firefox crash in at least a decade. What are people doing? I run ublock origin, nothing else. I do sometimes have Firefox mobile misbehave where it stops loading new pages and I jave to restart it, but open pages work normally as do all other operations, so not a crash exactly. Happens maybe once a month
Edit: more context, I power cycle at least once a week on desktop and the version is typically a bit behind new. I also don't have more tabs open than will fit in the row. All these habits seem likely to decrease crashes.
firefox crashes... decently often for me, but it's usually pretty clear what the cause is [having a bunch of other programs open]. every time i can recall my computer bluescreening [in the last year~, since that's how long ive had it] it was because of firefox tho.
this may have something to do with the fact that my laptop is from 2017, however.
> Bold claim. From my gut feeling this must be incorrect
RAM flips are common. This kind of thing is old and has likely gotten worse.
IBM had data on this. DEC had data on this. Amazon/Google/Microsoft almost certainly had data on this. Anybody who runs a fleet of computers gets data on this, and it is always eye opening how common it is.
>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!
> Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.
That's a misinterpretation. The finding refers to the composition of crashes, not the overall crash rate (which is not reported by the post). Brought to the extreme, there may have been 10 (reported) crashes in history of Firefox, and 1 due to faulty hardware, and the statement would still be correct.
A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?
The simplest way to do this, what I believe memtest86 and friends do, is to write a fixed pattern over a region of memory and then read it back later and see if it changed; then you write patterns that require flipping the bits that you wrote before, and so on.
Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].
There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.
edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.
That would tell you if there's a bitflip in your test, but not if there's a bitflip in normal program code causing a crash, no? IIUC GP's questions was how do they actually tell after a crash that that crash was caused by a bitflip.
The example I gave in there is of adding sentinel values in your data, so you can check the constants in your data structures later and go "oh, this is overwritten with garbage" versus "oh, this is one or two bits off". I would imagine plumbing things like that through most common structures is what was done there, though I haven't done the archaeology to find out, because Firefox is an enormous codebase to try and find one person's commits from several years ago in.
That, and 50% of the machines where their heuristics say it is a hardware error fail basic memory tests.
I've seen a lot of confirmed bitflips with ECC systems. The vast majority of machines that are impacted are impacted by single event upsets (not reproducible).
(I worded that precisely but strangely because if one machine has a reproducible problem, it might hit it a billion times a second. That means you can't count by "number of corruptions".)
My take is that their 10% estimate is a lower bound.
A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.
>> DDR5 technology comes with an exclusive data-checking feature that serves to improve memory cell reliability and increase memory yield for memory manufacturers. This inclusion doesn't make it full ECC memory though.
"Proper" ECC has a wider memory buss, so the CPU emits checksum bits that are saved alongside every word of memory, and checked again by the CPU when memory is read. Eg. a 64 bit machine would actually have 72 bit memory.
DDR5 "ECC" uses error correction only within the memory stick. It's there to reduce the error rate, so otherwise unacceptable memory is usable - individual cells have become so small that they are not longer acceptably reliable by themselves!
The net error rate is lower with the internal ECC.
DDR4 is not fully reliable memory either.
This is common for many high speed electrical engineering challenges: Running a slightly higher error rate option with ECC on top can have an overall lower error rate at higher throughput than the alternative of running it slow enough to push the error rate down below some threshold.
It makes some people nervous because they don’t like the idea of errors being corrected, but the system designers are looking at overall error rates. The ECC is included in the system’s operation so it isn’t something that is worthwhile to separate out.
Similar to CPUs, where many arrays have spare yield capacity, even whole cores can get disabled (and possibly sold in a different bin). DRAM stores redundant electrons in capacitors to patch it up and boost yields. Everything in reliability is a spectrum.
"ECC" does not give you fully reliable RAM. UEs are still be observed.
What's the chance of fail? If you have one device that achieves equal performance with less reliable cells and redundancy to another device that uses more reliable cells without redundancy, it's not really any different.
NAND is horribly flaky, cell errors are a matter of course. You could buy boutique NOR or SLC NAND or something if you want really good cells. You wouldn't though, because it would be ruinously expensive, but also it would not really give you a result that an SSD with ECC can't achieve.
When debugging something, I often remember the the quote, often misattributed to Einstein: "Insanity is doing the same thing over and over again and expecting different results". Then I remember about bitflips, and run a second, maybe a third time, just expecting the next bit to flip to not be in the routine I'm trying to debug.
It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."
One of the points Linus Torvalds made a few years back was that enthusiasts/PC gamers should be pissed that consumer product availability/support for ECC is spotty because as mentioned up-thread they're the kind of user that will push their system, and if memory is the cause of instability there will be a smoking gun (and they can then set the speed within its stable capacity). Diagnosing bad RAM is a pain in the rear even if you're actively looking for a cause, never mind trying to get a general user to go further than blaming software or gremlins in the system for weirdness on whatever frequency it's occurring at.
It's true that in the very early days Google used cheap computers without ECC memory, and this explains the desire for checksums in older storage formats such as RecordIO and SSTable, but our production machines have used ECC RAM for a long time now.
One of the nicest guys I have met. Was an intern at Google at that time, firing off mapreduces then (2003-2004) was quite a blast. The Peter Weinberger theme T-shirt too.
I'm glad to see somebody is getting some data on this, I feel bad memory is one of the most underrated issues in computing generally. I'd like to see a more detailed writeup on this, like a short whitepaper.
> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%.
Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.
Going to be downvoted, but I call bullshit on this. Bitflips are frequent (and yes ECC is an improvement but does not solve the problem), but not that frequent. One can either assume users that enabled telemetry are an odd bunch with flaky hardware, or the implementation isnt actually detecting bitflips (potentially, as the messages indicate), but a plathora of problems. Having a 1/10 probability a given struct is either processed wrong, parsed wrong or saved wrong would have pretty severe effects in many, many scenarios - from image editing to cad. Also, bitflips on flaky hardware dont choose protection rings - it would also affect the OS routines such as reading/writing to devices and everything else that touches memory. Yup, i've seen plenty of faulty ram systems (many WinME crashes were actually caused by defective ram sticks that would run fine with W98), it doesnt choose browsers or applications.
How can you possibly be this confident if you don't know the number of times Firefox was run and number of bug reports submitted? Say it's run 100,000,000 times, 1000 reports are submitted, and 10 are bit flips. Seems reasonable. You're misinterpreting what they are saying.
You should look at about:crashes and see if there's any commonality in the causes, or bugs associated with them (though often bugs won't be associated with the crash if it isn't filed from crash-stats or have the crash signature in the bug)
Maybe you should check your memory? I recently started to get quite a lot of Firefox crashes, and definitely contributed to this statistic. In the end, the problem was indeed memory - crashes stopped after I tuned down some of the timings. And I used this RAM for a few years with my original settings (XMP profile) without issue.
Also a polite reminder that most of those crashes will be concentrated on machines with faulty memory so the naive way of stating the statistic may overestimate its impact to the average user. For the average user this is the difference between 4/5 crashes are from software bugs and 5/5 crashes are from software bugs, and for a lot of people it will still be 5/5
This is a pretty big claim which seems to imply this is much more common than expected, but there's no real information here and the numbers don't even stack up:
> That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.
So the data actually only supports 5% being caused by bitflips, then there's a magic multiple of 2? Come on. Let alone this conservative heuristic that is never explained - what is it doing that makes him so certain that it can never be wrong, and yet also detects these at this rate?
The next logical step would be to somehow inform users so they could take action to replace the bad memory. I realize this is a challenge given the anonymized nature of the crash data, but I might be willing to trade some anonymity in exchange for stability.
The easy solution for that is to just do that analysis locally...
Firefox doesn't submit the full core dumps anyhow for this exact reason and therefore needs to do some preprocessing in any case.
The memory issue may not necessarily be from bad ram, it can also be due to configuration issues. Or rather it may be fixable with configuration changes.
I had memory issues with my PC build which I fixed by reducing the speed to 2800MHZ, which is much lower than its advertised speed of 5600MHZ. Actually looking back at this it might've configured its speed incorrectly in the first place, reducing it to 2800 just happened to hit a multiple of 2 of its base clock speed.
>That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job.
CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?
In just about every way. CPU caches are made from SRAM and live on the CPU itself. Main system RAM is made from DRAM and live on separate chips even if they are soldered into the same physical package (system in package or SiP). The RAM still isn't on the SoC.
For one thing, static vs dynamic RAM. Static RAM (which is what's used for your typical CPU cache) is implemented with flip-flops and doesn't need to be refreshed, reads aren't destructive like DRAM, etc.
Caches and registers are also subject to bitflips. In many CPUs the caches use ECC so it's less of a problem. Intel did a study showing that many bits in registers are unused so flipping them doesn't cause problems.
At that level, they are not different. They could suffer from UE due to defect, marginal system (voltage, temperature, frequency), or radiation upset, suffer electromigration/aging, etc. And you can't replace them either.
CPUs tend to be built to tolerate upsets, like having ECC and parity in arrays and structures whereas the DRAM on a Macbook probably does not. But there is no objective standard for these things, and redundancy is not foolproof it is just another lever to move reliability equation with.
This matches what I have long said, which is that adding ECC memory to consumer devices will not result in any incredible stability improvement. It will barely be a blip really.
As we know from Google and other papers, most of these 10% of flips will be caused by broken or marginal hardware, of which a good proportion of which could be weeded out by running a memory tester for a while. So if you do that you're probably looking a couple out of every hundred crashes being caused by bitflips in RAM. A couple more might be due to other marginal hardware. The vast majority software.
How often does your computer or browser crash? How many times per year? About 2-3 for me that I can remember. So in 50 years I might save myself one or two crashes if I had ECC.
ECC itself takes about 12.5% overhead/cost. I have also had a couple of occasions where things have been OOM-killed or ground to a halt (probably because of memory shortage). Could be my money would be better spent with 10% more memory than ECC.
People like to rave and rant at the greedy fatcats in the memory-industrial complex screwing consumers out of ECC, but the reality is it's not free and it's not a magical fix. Not when software causes the crashes.
Software developers like Linus get incredibly annoyed about bug reports caused by bit flips. Which is understandable. I have been involved in more than one crazy Linux kernel bug that pulled in hardware teams bringing up new CPU that irritated the bug. And my experience would be far from unique. So there's a bit of throwing stones in glass houses there too. Software might be in a better position to demand improvement if they weren't responsible for most crashes by an order of magnitude...
Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.
Definitely going to hard disagree with Gabriele Svelto's take. I could point to the comments, however, let me bring up my own experiences across personal devices and organizational devices. In particular, note where he says this:
"I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate."
You can't claim any percentage if you don't know what you are measuring. Based on his hot take, I can run an overclocked machine have firefox crash a few hundred thousand times a day and he'll use my data to support his position. Further, see below:
First: A pre-text: I use Firefox, even now, despite what I post below. I use it because it is generally reliable, outside of specific pain points I mention, free, open source, compatible with most sites, and for now, is more privacy oriented than chrome.
Second: On both corporate and home devices, Firefox has shown to crash more often than Chrome/Chromium/Electron powered stuff. Only Safari on Windows beats it out in terms of crashes, and Safari on Windows is hot garbage. If bit flips were causing issues, why are chromium based browsers such as edge and Chrome so much more reliable?
Third: Admittedly, I do not pay close enough attention to know when Firefox sends crash reports, however, what I do know is that it thinks it crashes far more often than it does. A `sudo reboot` on linux, for example, will often make firefox think it crashed on my machine. (it didn't, Linux just kills everything quickly, flushes IO buffers, and reboots...and Firefox often can't even recover the session after...)
Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.
force-kills like sudo reboot will show UI on restart indicating it didn't shut down cleanly, but that isn't reported as a crash. You can see how often you actually crash via about:crashes (and also see what happened)
People I think are overindexing on this being about "Bad hardware".
We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...
They found 25,000 to 70,000 errors per billion
device hours per Mbit and more than 8% of DIMMs affected
by errors per year.
At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.
A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.
There is DRAM which is mildly defective but got past QC.
There are power suppliers that are mildly defective but got past QC.
There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.
Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.
There are a ton of causes for bitflips other than cosmic rays.
For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?
It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.
Every so often when I'm doing refactoring work and my list of worries has decreased to the point I can start thinking of new things to worry about, I worry about how as we reduce the accidental complexity of code and condense the critical bytes of the working memory tighter and tighter, how we are leaning very hard on very few bytes and hoping none of them ever bitflip.
I wonder sometimes if we shouldn't be doing like NASA does and triple-storing values and comparing the calculations to see if they get the same results.
Might be worth doing the kind of "manual ECC" you're describing for a small amount of high-importance data (e.g., the top few levels of a DB's B+ tree stored in memory), but I suspect the biggest win is just to use as little memory as possible, since the probability of being affected by memory corruption is roughly proportional to the amount you use.
I've used it for many years. It only fixes physical hardware faults, not timing errors. For example if a RAM cell is damaged by radiation, not if you're overclocking your RAM.
However if the third chip on your memory stick is properly broken, then the third bit out of every word of memory may get stuck high or low, and then the whole chip is absolutely worthless.
The most expensive memory failure I had was of this sort, and frustratingly came from accidentally unplugging the wrong computer.
After this I did buy some used memory from a recycling center that had the sorts of problems you described and was able to employ them by masking off the bad regions.
Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.
I was running my PC with bad memory for a few weeks last year. Firefox crashed a LOT, way more than any other application I used during that time, so I've probably contributed a decent amount to these numbers...
>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!
I find this impossible to believe.
If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.
>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.
Might be the case, but 10% is still huge.
There imo has to be something else going on. Either their userbase/tracking is biased or something else...
Everyone who has put serious effort into analyzing crash reports en mass has made similar discoveries that some portion of their crashes are best explained by faulty hardware. What percent that is mostly comes down to how stable your software is. The more bugs you have, the lower the portion that come from hardware. Firefox being at 10% from bad RAM just means that crashes due to FF bugs are somewhat uncommon but not nonexistent, which lines up with my experience with using FF.
IME, random bitflips is the engineer's way of saying "I'm sick and tired of root cause analysis" or "I have no fucking clue what the bug is." I, like others, remain skeptical about the claim.
We're not talking about unexplained bugs here. We're talking about a pointer that obviously has one bit flipped and it would be correct if you flipped that one bit back.
Browsers, videogames, and Microsoft Excel push computers really hard compared to regular applications, so I expect they're more likely to cause these types of errors.
The original Diablo 2 game servers for battle.net, which were Compaq 1U servers, failed at astonishing rates due to their extremely high utilization and consequent heat-generation. Compaq had never seen anything like it; most of their customers were, I guess, banking apps doing 3 TPS.
In my case it doesn't seem to be related to system load. I have an issue where (mainly) using FF can trigger random system freezes on Linux, often with the browser going down first. But running CPU/memory stress tests, compiling things etc don't cause any errors and the cooler is downright bored.
Computers today have many GB of RAM, and programs that use it.
The more RAM you have, the higher the probabilty that there will be some bad bits. And the more RAM a program uses, the more likely it will be using some that is bad.
For my part I'm not sure I recall a crash having daily driven firefox in quite some time. I'd suspect that the large number of bit errors might be driven by a small number of poor hardware clients.
Based on what data?
According to their reporting they have around 200 Million monthly users, which seems compatible with 470k crashes a week?
See <https://data.firefox.com/dashboard/user-activity>
The nuance here is of cause that there are a bunch of people using multiple browsers.
Also I mean there are a lot of people using browsers on the world
If 10% of firefox users are also iOS users, which is not unlikely, then those people get double-counted. In my case I probably use my phone and tablet for at least 50% of my web traffic, not counting youtube, which also skews things.
Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!
We'd save the test result to the registry and include the result in automated bug reports.
The common causes we discovered for the problem were:
- overclocked CPU
- bad memory wait-state configuration
- underpowered power supply
- overheating due to under-specced cooling fans or dusty intakes
These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.
Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.
And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.
Sometimes I'm amazed that computers even work at all!
Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.
https://wiki.guildwars.com/wiki/Guild_Wars_Reforged
I also was introduced to programming through Roblox.
I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway
I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.
From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.
I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)
What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.
I hate to think of how much effort will be spent debugging software in vain because of memory errors.
Some of the (legitimately) extreme overclockers have been testing what amounts to massive hunks of metal in place of the original mounting plates because of the boards bending from mounting pressure, with good enough results.
On top of all of this, it really does not help that we are also at the mercy of IMC and motherboard quality too. To hit the world records they do and also build 'bulletproof', highest performance, cost is no object rigs, they are ordering 20, 50 motherboards, processors, GPUs, etc and sitting there trying them all, then returning the shit ones. We shouldn't have to do this.
I had a lot of fun doing all of this myself and hold a couple very specific #1/top 10/100 results, but it's IMHO no longer worth the time or effort and I have resigned to simply buying as much ram as the platform will hold and leaving it at JEDEC.
If we had a time series graph of this data, it might be revealing.
Every single time I've had someone pay me to figure out why their build isn't stable, it's always some combination of cheap power supply with no noise filtering, cheap motherboard, and poor cooling. Can't cut corners like that if you want to go fast. That is to say, I've never encountered "almost ok" memory. They're quite good at validation.
This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.
I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.
[0] In practice, if they didn't, they'd all just flock to AMD.
only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower. Back in Zen 2/3 days the choice was basically DDR4-3600 without ECC, or DDR4-2400 with ECC.
Any workstation where you are getting serious work done should use ECC
Funny you say this, because for a good while I was running OC'd RAM
I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)
The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.
The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).
The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.
After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.
I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?
I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!
To counter that, we're LONG overdue for ECC in all consumer systems.
It significantly overlaps the engineering to gracefully handle non-hardware things like null pointers and forgetting to update one side of a communication interface.
80/20 rule, really. If you're thoughtful about how you build, you can get most of the benefits without doing the expensive stuff.
Better the user sees some lag due to state rebuild versus a crash.
Most consumers have what they have, and use what they have. Upgrading everything is now rare. If they got screwed, they'll remain screwed for a few years.
It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.
I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.
However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.
On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.
It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.
Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.
However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.
In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.
In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.
I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.
I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.
Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).
I have at least one motherboard that just re-auto-overclocks itself into a flaky configuration if boot fails a few times in a row (which can happen due to loose power cords, or whatever).
But I don’t really know what the Firefox team does with crash reports and in making Firefox almost crash proof.
I have been using it at work on Windows and for the last several years it always crashes on exit. I have religiously submitted every crash report. I even visit the “about:crashes” page to see if there are any unsubmitted ones and submit them. Occasionally I’ll click on the bugzilla link for a crash, only to see hardly any action or updates on those for months (or longer).
Granted that I have a small bunch of extensions (all WebExtensions), but this crash-on-exit happens due to many different causes, as seen in the crash reports. I’m too loathe to troubleshoot with disabling all extensions and then trying it one by one. Why should an extension even cause a crash, especially when its a WebExtension (unlike the older XUL extensions that had a deeper integration into the browser)? It seems like there are fundamental issues within Firefox that make it crash prone.
I can make Firefox not crash if I have a single window with a few tabs. That use case is anyway served by Edge and Chrome. The main reasons I use Firefox, apart from some ideological ones, are that it’s always been much better at handling multiple windows and tons of tabs and its extensibility (Manifest V2 FTW).
I would sincerely appreciate Firefox not crashing as often for me.
Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.
I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.
If 50% of chrome crashes were due to bit flips, and bit flips effect the two browsers at basically the same rate, that would indicate that chrome experiences 1/5th the total crashes of firefox... even though the bit flip crashes happen at the same rate on both browsers.
It would have been better news for firefox if the number of crashes due to faulty hardware were actually much higher! These numbers indicate the vast majority of firefox crashes are actually from buggy software : (
Perhaps you're part of the group driving hardware crashes up to 10% and need to fix your machine.
Edit: more context, I power cycle at least once a week on desktop and the version is typically a bit behind new. I also don't have more tabs open than will fit in the row. All these habits seem likely to decrease crashes.
this may have something to do with the fact that my laptop is from 2017, however.
The hardware bugs are there. They're just handled.
RAM flips are common. This kind of thing is old and has likely gotten worse.
IBM had data on this. DEC had data on this. Amazon/Google/Microsoft almost certainly had data on this. Anybody who runs a fleet of computers gets data on this, and it is always eye opening how common it is.
ZFS is really good at spotting RAM flips.
I agree. Good thing he doesn't back up his claim with any sort of evidence or reasoned argument, or you'd look like a huge moron!
> And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.
The actual measurement is 5%. The 10% figure is entirely made up, with zero evidence or reasoned argument except a hand-wavy "conservative".
Edit: actually, the claim is even less supported:
> out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory
"Potential" is a weasel word here. We don't see any of the actual methodology. For all we know, the real value could be 0.1% or 0.01%.
> Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.
That's a misinterpretation. The finding refers to the composition of crashes, not the overall crash rate (which is not reported by the post). Brought to the extreme, there may have been 10 (reported) crashes in history of Firefox, and 1 due to faulty hardware, and the statement would still be correct.
Hardware problems are just as good a potential explanation for those as anything else.
Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].
There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.
edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.
[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...
[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568
[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...
[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...
(see: https://github.com/mozilla-firefox/firefox/blob/main/toolkit..., which points to a specific commit in that repo - turns out to be tip of main)
He doesn't explain anything indeed but presumably that code is available somewhere.
I've seen a lot of confirmed bitflips with ECC systems. The vast majority of machines that are impacted are impacted by single event upsets (not reproducible).
(I worded that precisely but strangely because if one machine has a reproducible problem, it might hit it a billion times a second. That means you can't count by "number of corruptions".)
My take is that their 10% estimate is a lower bound.
[1] https://www.corsair.com/us/en/explorer/diy-builder/memory/is...
>> DDR5 technology comes with an exclusive data-checking feature that serves to improve memory cell reliability and increase memory yield for memory manufacturers. This inclusion doesn't make it full ECC memory though.
"Proper" ECC has a wider memory buss, so the CPU emits checksum bits that are saved alongside every word of memory, and checked again by the CPU when memory is read. Eg. a 64 bit machine would actually have 72 bit memory.
DDR5 "ECC" uses error correction only within the memory stick. It's there to reduce the error rate, so otherwise unacceptable memory is usable - individual cells have become so small that they are not longer acceptably reliable by themselves!
DDR4 is not fully reliable memory either.
This is common for many high speed electrical engineering challenges: Running a slightly higher error rate option with ECC on top can have an overall lower error rate at higher throughput than the alternative of running it slow enough to push the error rate down below some threshold.
It makes some people nervous because they don’t like the idea of errors being corrected, but the system designers are looking at overall error rates. The ECC is included in the system’s operation so it isn’t something that is worthwhile to separate out.
"ECC" does not give you fully reliable RAM. UEs are still be observed.
What's the chance of fail? If you have one device that achieves equal performance with less reliable cells and redundancy to another device that uses more reliable cells without redundancy, it's not really any different.
NAND is horribly flaky, cell errors are a matter of course. You could buy boutique NOR or SLC NAND or something if you want really good cells. You wouldn't though, because it would be ruinously expensive, but also it would not really give you a result that an SSD with ECC can't achieve.
Has to be normalized, and outliers eliminated in some consistent manner.
https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...
Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.
If Firefox itself has so few bugs that it crashes very infrequently, it is not contradictory to what you are saying.
I wouldn't be surprised if 99% of crashes in my "hello world" script are caused by bit flips.
This will bloat the code a bit.
> That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.
So the data actually only supports 5% being caused by bitflips, then there's a magic multiple of 2? Come on. Let alone this conservative heuristic that is never explained - what is it doing that makes him so certain that it can never be wrong, and yet also detects these at this rate?
This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.
I had memory issues with my PC build which I fixed by reducing the speed to 2800MHZ, which is much lower than its advertised speed of 5600MHZ. Actually looking back at this it might've configured its speed incorrectly in the first place, reducing it to 2800 just happened to hit a multiple of 2 of its base clock speed.
CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?
CPUs tend to be built to tolerate upsets, like having ECC and parity in arrays and structures whereas the DRAM on a Macbook probably does not. But there is no objective standard for these things, and redundancy is not foolproof it is just another lever to move reliability equation with.
As we know from Google and other papers, most of these 10% of flips will be caused by broken or marginal hardware, of which a good proportion of which could be weeded out by running a memory tester for a while. So if you do that you're probably looking a couple out of every hundred crashes being caused by bitflips in RAM. A couple more might be due to other marginal hardware. The vast majority software.
How often does your computer or browser crash? How many times per year? About 2-3 for me that I can remember. So in 50 years I might save myself one or two crashes if I had ECC.
ECC itself takes about 12.5% overhead/cost. I have also had a couple of occasions where things have been OOM-killed or ground to a halt (probably because of memory shortage). Could be my money would be better spent with 10% more memory than ECC.
People like to rave and rant at the greedy fatcats in the memory-industrial complex screwing consumers out of ECC, but the reality is it's not free and it's not a magical fix. Not when software causes the crashes.
Software developers like Linus get incredibly annoyed about bug reports caused by bit flips. Which is understandable. I have been involved in more than one crazy Linux kernel bug that pulled in hardware teams bringing up new CPU that irritated the bug. And my experience would be far from unique. So there's a bit of throwing stones in glass houses there too. Software might be in a better position to demand improvement if they weren't responsible for most crashes by an order of magnitude...
https://stackoverflow.com/questions/2580933/cosmic-rays-what...
"I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate."
You can't claim any percentage if you don't know what you are measuring. Based on his hot take, I can run an overclocked machine have firefox crash a few hundred thousand times a day and he'll use my data to support his position. Further, see below:
First: A pre-text: I use Firefox, even now, despite what I post below. I use it because it is generally reliable, outside of specific pain points I mention, free, open source, compatible with most sites, and for now, is more privacy oriented than chrome.
Second: On both corporate and home devices, Firefox has shown to crash more often than Chrome/Chromium/Electron powered stuff. Only Safari on Windows beats it out in terms of crashes, and Safari on Windows is hot garbage. If bit flips were causing issues, why are chromium based browsers such as edge and Chrome so much more reliable?
Third: Admittedly, I do not pay close enough attention to know when Firefox sends crash reports, however, what I do know is that it thinks it crashes far more often than it does. A `sudo reboot` on linux, for example, will often make firefox think it crashed on my machine. (it didn't, Linux just kills everything quickly, flushes IO buffers, and reboots...and Firefox often can't even recover the session after...)
Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.
Just my thoughts.
Also, the latest version of Safari for Windows was released in 2012. How old is your Firefox?
We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...
They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.
At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.
A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.
There are power suppliers that are mildly defective but got past QC.
There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.
Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.
There are a ton of causes for bitflips other than cosmic rays.
For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?
It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.
I wonder sometimes if we shouldn't be doing like NASA does and triple-storing values and comparing the calculations to see if they get the same results.
Unfortunately, not that many consumer platforms make this possible or affordable.
https://github.com/prsyahmi/BadMemory
I've used it for many years. It only fixes physical hardware faults, not timing errors. For example if a RAM cell is damaged by radiation, not if you're overclocking your RAM.
https://www.memtest86.com/blacklist-ram-badram-badmemorylist...
The most expensive memory failure I had was of this sort, and frustratingly came from accidentally unplugging the wrong computer.
After this I did buy some used memory from a recycling center that had the sorts of problems you described and was able to employ them by masking off the bad regions.
Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.
These are potential bitflips.
I found an issue only yesterday in firefox that does not happen in other browsers on specific hardware.
My guess is that the software is riddled with edge-case bugs.
I find this impossible to believe.
If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.
>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.
Might be the case, but 10% is still huge.
There imo has to be something else going on. Either their userbase/tracking is biased or something else...
Browsers, videogames, and Microsoft Excel push computers really hard compared to regular applications, so I expect they're more likely to cause these types of errors.
The original Diablo 2 game servers for battle.net, which were Compaq 1U servers, failed at astonishing rates due to their extremely high utilization and consequent heat-generation. Compaq had never seen anything like it; most of their customers were, I guess, banking apps doing 3 TPS.
The more RAM you have, the higher the probabilty that there will be some bad bits. And the more RAM a program uses, the more likely it will be using some that is bad.
Same phenomenon with huge hard drives.
A bit flip actually needs to be pretty "lucky" to result in a crash.
Granted, they're probably just as accurate as netcraft. /shrug
67k crashes / day
claim: "Given # of installs is X; every install must be crashing several times a day"
We'll translate that to: "every install crashes 5 times a day"
67k crashes day / 5 crashes / install
12k installs
Your claim is there's 12k firefox users? Lol