Why use ECC?

2015-11-01 [danluu]

Jeff Atwood, perhaps the most widely read programming blogger, has a post that makes a case against using ECC memory. My read is that his major points are:

Google didn't use ECC when they built their servers in 1999
Most RAM errors are hard errors and not soft errors
RAM errors are rare because hardware has improved
If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is "awfully enterprisey"

Let's take a look at these arguments one by one:

1. Google didn't use ECC in 1999

Not too long after Google put these non-ECC machines into production, they realized this was a serious error and not worth the cost savings. If you think cargo culting what Google does is a good idea because it's Google, here are some things you might do:

A. Put your servers into shipping containers.

Articles are still written today about what a great idea this is, even though this was an experiment at Google that was deemed unsuccessful. Turns out, even Google's experiments don't always succeed. In fact, their propensity for “moonshots” in the early days meannt that they had more failed experiments that most companies. Copying their failed experiments isn't a particularly good strategy.

B. Cause fires in your own datacenters

Part of the post talks about how awesome these servers are:

Some people might look at these early Google servers and see an amateurish fire hazard. Not me. I see a prescient understanding of how inexpensive commodity hardware would shape today's internet. I felt right at home when I saw this server; it's exactly what I would have done in the same circumstances

The last part of that is true. But the first part has a grain of truth, too. When Google started designing their own boards, one generation had a regrowth1 issue that caused a non-zero number of fires. BTW, if you click through to Jeff's post and look at the photo that the quote refers to, you'll see that the boards have a lot of flex in them. That caused problems and was fixed in the next generation. You can also observe that the cabling is quite messy, which also caused problems, and was also fixed in the next generation. There were other problems as well. Jeff's argument here appears to be that, if he were there at the time, he would've seen the exact same opportunities that early Google enigneers did, and since Google did this, it must've been the right thing even if it doesn't look like it. But, a number of things that make it look like not the right thing actually made it not the right thing.

C. Make servers that injure your employees

One generation of Google servers had infamously sharp edges, giving them the reputation of being made of “razor blades and hate”.

D. Create weather in your datacenters

From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not. Note that these are all things Google tried and then changed. Making mistakes and then fixing them is common in every successful engineering organization. If you're going to cargo cult an engineering practice, you should at least cargo cult current engineering practices, not something that was done in 1999. When Google used servers without ECC back in 1999, they found a number of symptoms that were ultimately due to memory corruption, including a search index that returned effectively random results to queries. The actual failure mode here is instructive. I often hear that it's ok to ignore ECC on these machines because it's ok to have errors in individual results. But even when you can tolerate occasional errors, ignoring errors means that you're exposing yourself to total corruption, unless you've done a very careful analysis to make sure that a single error can only contaminate a single result. In research that's been done on filesystems, it's been repeatedly shown that despite making valiant attempts at creating systems that are robust against a single error, it's extremely hard to do so and basically every heavily tested filesystem can have a massive failure from a single error (see the output of Andrea and Remzi's research group at Wisconsin if you're curious about this). I'm not knocking filesystem developers here. They're better at that kind of analysis than 99.9% of programmers. It's just that this problem has been repeatedly shown to be hard enough that humans cannot effectively reason about it, and automated tooling for this kind of analysis is still far from a push-button process. In their book on warehouse scale computing, Google discusses error correction and detection and ECC is cited as their slam dunk case for when it's obvious that you should use hardware error correction2. Google has great infrastructure. From what I've heard of the infra at other large tech companies, Google's sounds like the best in the world. But that doesn't mean that you should copy everything they do. Even if you look at their good ideas, it doesn't make sense for most companies to copy them. They created a replacement for Linux's work stealing scheduler that uses both hardware run-time information and static traces to allow them to take advantage of new hardware in Intel's server processors that lets you dynamically partition caches between cores. If used across their entire fleet, that could easily save Google more money in a week than stackexchange has spent on machines in their entire history. Does that mean you should copy Google? No, not unless you've already captured all the lower hanging fruit, which includes things like making sure that your core infrastructure is written in highly optimized C++, not Java or (god forbid) Ruby. And the thing is, for the vast majority of companies, writing in a language that imposes a 20x performance penalty is a totally reasonable decision.

2. Most RAM errors are hard errors

The case against ECC quotes this section of a study on DRAM errors (the bolding is Jeff's):

Our study has several main findings. First, we find that approximately 70% of DRAM faults are recurring (e.g., permanent) faults, while only 30% are transient faults. Second, we find that large multi-bit faults, such as faults that affects an entire row, column, or bank, constitute over 40% of all DRAM faults. Third, we find that almost 5% of DRAM failures affect board-level circuitry such as data (DQ) or strobe (DQS) wires. Finally, we find that chipkill functionality reduced the system failure rate from DRAM faults by 36x.

This seems to betray a lack of understanding of the implications of this study, as this quote doesn't sound like an argument against ECC; it sounds like an argument for "chipkill", a particular class of ECC. Putting that aside, Jeff's post points out that hard errors are twice as common as soft errors, and then mentions that they run memtest on their machines when they get them. First, a 2:1 ratio isn't so large that you can just ignore soft errors. Second the post implies that Jeff believes that hard errors are basically immutable and can't surface after some time, which is incorrect. You can think of electronics as wearing out just the same way mechanical devices wear out. The mechanisms are different, but the effects are similar. In fact, if you compare reliability analysis of chips vs. other kinds of reliability analysis, you'll find they often use the same families of distributions to model failures. And, if hard errors were immutable, they would generally get caught in testing by the manufacturer, who can catch errors much more easily than consumers can because they have hooks into circuits that let them test memory much more efficiently than you can do in your server or home computer. Third, Jeff's line of reasoning implies that ECC can't help with detection or correction of hard errors, which is not only incorrect but directly contradicted by the quote. So, how often are you going to run memtest on your machines to try to catch these hard errors, and how much data corruption are you willing to live with? One of the key uses of ECC is not to correct errors, but to signal errors so that hardware can be replaced before silent corruption occurs. No one's going to consent to shutting down everything on a machine every day to run memtest (that would be more expensive than just buying ECC memory), and even if you could convince people to do that, it won't catch as many errors as ECC will. When I worked at a company that owned about 1000 machines, we noticed that we were getting strange consistency check failures, and after maybe half a year we realized that the failures were more likely to happen on some machines than others. The failures were quite rare, maybe a couple times a week on average, so it took a substantial amount of time to accumulate the data, and more time for someone to realize what was going on. Without knowing the cause, analyzing the logs to figure out that the errors were caused by single bit flips (with high probability) was also non-trivial. We were lucky that, as a side effect of the process we used, the checksums were calculated in a separate process, on a different machine, at a different time, so that an error couldn't corrupt the result and propagate that corruption into the checksum. If you merely try to protect yourself with in-memory checksums, there's a good chance you'll perform a checksum operation on the already corrupted data and compute a valid checksum of bad data unless you're doing some really fancy stuff with calculations that carry their own checksums (and if you're that serious about error correction, you're probably using ECC regardless). Anyway, after completing the analysis, we found that memtest couldn't detect any problems, but that replacing the RAM on the bad machines caused a one to two order of magnitude reduction in error rate. Most services don't have this kind of checksumming we had; those services will simply silently write corrupt data to persistent storage and never notice problems until a customer complains.

3. Due to advances in hardware manufacturing, errors are very rare

Jeff says

I do seriously question whether ECC is as operationally critical as we have been led to believe [for servers], and I think the data shows modern, non-ECC RAM is already extremely reliable ... Modern commodity computer parts from reputable vendors are amazingly reliable. And their trends show from 2012 onward essential PC parts have gotten more reliable, not less. (I can also vouch for the improvement in SSD reliability as we have had zero server SSD failures in 3 years across our 12 servers with 24+ drives ...

and quotes a study. The data in the post isn't sufficient to support this assertion. Note that since RAM usage has been increasing and continues to increase at a fast exponential rate, RAM failures would have to decrease at a greater exponential rate to actually reduce the incidence of data corruption. Furthermore, as chips continue shrink, features get smaller, making the kind of wearout issues discussed in “2” more common. For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink. The 2012 study that Atwood quoted has this graph on corrected errors (a subset of all errors) on ten randomly selected failing nodes (6% of nodes had at least one failure): We're talking between 10 and 10k errors for a typical node that has a failure, and that's a cherry-picked study from a post that's arguing that you don't need ECC. Note that the nodes here only have 16GB of RAM, which is an order of magnitude less than modern servers often have, and that this was on an older process node that was less vulnerable to noise than we are now. For anyone who's used to dealing with reliability issues and just wants to know the FIT rate, the study finds a FIT rate of between 0.057 and 0.071 faults per Mbit (which, contra Atwood's assertion, is not a shockingly low number). If you take the most optimistic FIT rate, .057, and do the calculation for a server without much RAM (here, I'm using 128GB, since the servers I see nowadays typically have between 128GB and 1.5TB of RAM)., you get an expected value of .057 * 1000 * 1000 * 8760 / 1000000000 = .5 faults per year per server. Note that this is for faults, not errors. From the graph above, we can see that a fault can easily cause hundreds or thousands of errors per month. Another thing to note is that there are multiple nodes that don't have errors at the start of the study but develop errors later on. So, in fact, the cherry-picked study that Jeff links contradicts Jeff's claim about reliability. Sun/Oracle famously ran into this a number of decades ago. Transistors and DRAM capacitors were getting smaller, much as they are now, and memory usage and caches were growing, much as they are now. Between having smaller transistors that were less resilient to transient upset as well as more difficult to manufacture, and having more on-chip cache, the vast majority of server vendors decided to add ECC to their caches. Sun decided to save a few dollars and skip the ECC. The direct result was that a number of Sun customers reported sporadic data corruption. It took Sun multiple years to spin a new architecture with ECC cache, and Sun made customers sign an NDA to get replacement chips. Of course there's no way to cover up this sort of thing forever, and when it came up, Sun's reputation for producing reliable servers took a permanent hit, much like the time they tried to cover up poor performance results by introducing a clause into their terms of services disallowing benchmarking. Another thing to note here is that when you're paying for ECC, you're not just paying for ECC, you're paying for parts (CPUs, boards) that have been qual'd more thoroughly. You can easily see this with disk failure rates, and I've seen many people observe this in their own private datasets. In terms of public data, I believe Andrea and Remzi's group had a SIGMETRICS paper a few years back that showed that SATA drives were 4x more likely than SCSI drives to have disk read failures, and 10x more likely to have silent data corruption. This relationship held true even with drives from the same manufacturer. There's no particular reason to think that the SCSI interface should be more reliable than the SATA interface, but it's not about the interface. It's about buying a high-reliability server part vs. a consumer part. Maybe you don't care about disk reliability in particular because you checksum everything and can easily detect disk corruption, but there are some kinds of corruption that are harder to detect. [2024 update, almost a decade later]: looking at this retrospectively, we can see that Jeff's assertion that commodity parts are reliable, "modern commodity computer parts from reputable vendors are amazingly reliable" is still not true. Looking at real-world user data from Firefox, Gabriele Svelto estimated that approximately 10% to 20% of all Firefox crashes were due to memory corruption. Various game companies that track this kind of thing also report a significant fraction of user crashes appear to be due to data corruption, although I don't have an estimate from any of those companies handy. A more direct argument is that if you talk to folks at big companies that run a lot of ECC memory and look at the rate of ECC errors, there are quite a few errors detected by ECC memory despite ECC memory typically having a lower error rate than random non-ECC memory. This kind of argument is frequently made (here, it was detailed above a decade ago, and when I looked at this when I worked at Twitter fairly recently and there has not been a revolution in memory technology that has reduced the need for ECC over the rates discussed in papers a decade ago), but it often doesn't resontate with folks who say things like "well, those bits probably didn't matter anyway", "most memory ends up not getting read", etc. Looking at real-world crashes and noting that the amount of silent data corruption should be expected to be much higher than the rate of crashes seems to resonate with people who aren't excited by looking at raw FIT rates in datacenters.

4. If ECC were actually important, it would be used everywhere and not just servers.

One way to rephrase this is as a kind of cocktail party efficient markets hypothesis. This can't be important, because if it was, we would have it. Of course this is incorrect and there are many things that would be beneficial to consumers that we don't have, such as cars that are designed to safe instead of just getting the maximum score in crash tests. Looking at this with respect to the server and consumer markets, this argument can be rephrased as “If this feature were actually important for servers, it would be used in non-servers”, which is incorrect. A primary driver of what's available in servers vs. non-servers is what can be added that buyers of servers will pay a lot for, to allow for price discrimination between server and non-server parts. This is actually one of the more obnoxious problems facing large cloud vendors — hardware vendors are able to jack up the price on parts that have server features because the features are much more valuable in server applications than in desktop applications. Most home users don't mind, giving hardware vendors a mechanism to extract more money out of people who buy servers while still providing cheap parts for consumers. Cloud vendors often have enough negotiating leverage to get parts at cost, but that only works where there's more than one viable vendor. Some of the few areas where there aren't any viable competitors include CPUs and GPUs. There have been a number of attempts by CPU vendors to get into the server market, but each attempt so far has been fatally flawed in a way that made it obvious from an early stage that the attempt was doomed (and these are often 5 year projects, so that's a lot of time to spend on a doomed project). The Qualcomm effort has been getting a lot of hype, but when I talk to folks I know at Qualcomm they all tell me that the current chip is basically for practice, since Qualcomm needed to learn how to build a server chip from all the folks they poached from IBM, and that the next chip is the first chip that has any hope of being competitive. I have high hopes for Qualcomm as well an ARM effort to build good server parts, but those efforts are still a ways away from bearing fruit. The near total unsuitability of current ARM (and POWER) options (not including hypothetical variants of Apple's impressive ARM chip) for most server workloads in terms of performance per TCO dollar is a bit of a tangent, so I'll leave that for another post, but the point is that Intel has the market power to make people pay extra for server features, and they do so. Additionally, some features are genuinely more important for servers than for mobile devices with a few GB of RAM and a power budget of a few watts that are expected to randomly crash and reboot periodically anyway.

Conclusion

Should you buy ECC RAM? That depends. For servers, it's probably a good bet considering the cost, although it's hard to really do a cost/benefit analysis because it's really hard to figure out the cost of silent data corruption, or the cost of having some risk of burning half a year of developer time tracking down intermittent failures only to find that the were caused by using non-ECC memory. For normal desktop use, I'm pro-ECC, but if you don't have regular backups set up, doing backups probably has a better ROI than ECC. But once you have the absolute basics set up, there's a fairly strong case for ECC for consumer machines. For example, if you have backups without ECC, you can easily write corrupt data into your primary store and replicate that corrupt data into backup. But speaking more generally, big companies running datacenters are probably better set up to detect data corruption and more likely to have error correction at higher levels that allow them to recover from data corruption than consumers, so the case for consumers is arguably stronger than it is for servers, where the case is strong enough that's generally considered a no brainer. A major reason consumers don't generally use ECC isn't that it isn't worth it for them, it's that they just have no idea how to attribute crashes and data corruption when they happen. Once you start doing this, as Google and other large companies do, it's immediately obvious that ECC is worth the cost even when you have multiple levels of error correction operating at higher levels. This is analogous to what we see with files, where big tech companies write software for their datacenters that's much better at dealing with data corruption than big tech companies that write consumer software (and this is often true within the same company). To the user, the cost of having their web app corrupt their data isn't all that different from when their desktop app corrupts their data, the difference is that when their web app corrupts data, it's clearer that it's the company's fault, which changes the incentives for companies.

Appendix: security

If you allow any sort of code execution, even sandboxed execution, there are attacks like rowhammer which can allow users to cause data corruption and there have been instances where this has allowed for privilege escalation. ECC doesn't completely mitigate the attack, but it makes it much harder. Thanks to Prabhakar Ragde, Tom Murphy, Jay Weisskopf, Leah Hanson, Joe Wilder, and Ralph Corderoy for discussion/comments/corrections. Also, thanks (or maybe anti-thanks) to Leah for convincing me that I should write up this off the cuff verbal comment as a blog post. Apologies for any errors, the lack of references, and the stilted prose; this is basically a transcription of half of a conversation and I haven't explained terms, provided references, or checked facts in the level of detail that I normally do.

One of the funnier examples I can think of this, at least to me, is the magical self-healing fuse. Although there are many implementations, you can think of a fuse on a chip as basically a resistor. If you run some current through it, you should get a connection. If you run a lot of current through it, you'll heat up the resistor and eventually destroy it. This is commonly used to fuse off features on chips, or to do things like set the clock rate, with the idea being that once a fuse is blown, there's no way to unblow the fuse. Once upon a time, there was a semiconductor manufacturer that rushed their manufacturing process a bit and cut the tolerances a bit too fine in one particular process generation. After a few months (or years), the connection between the two ends of the fuse could regrow and cause the fuse to unblow. If you're lucky, the fuse will be something like the high-order bit of the clock multiplier, which will basically brick the chip if changed. If you're not lucky, it will be something that results in silent data corruption. I heard about problems in that particular process generation from that manufacturer from multiple people at different companies, so this wasn't an isolated thing. When I say this is funny, I mean that it's funny when you hear this story at a bar. It's maybe less funny when you discover, after a year of testing, that some of your chips are failing because their fuse settings are nonsensical, and you have to respin your chip and delay the release for 3 months. BTW, this fuse regrowth thing is another example of a class of error that can be mitigated with ECC. This is not the issue that Google had; I only mention this because a lot of people I talk to are surprised by the ways in which hardware can fail. [return]
In case you don't want to dig through the whole book, most of the relevant passage is: In a system that can tolerate a number of failures at the software level, the minimum requirement made to the hardware layer is that its faults are always detected and reported to software in a timely enough manner as to allow the software infrastructure to contain it and take appropriate recovery actions. It is not necessarily required that hardware transparently corrects all faults. This does not mean that hardware for such systems should be designed without error correction capabilities. Whenever error correction functionality can be offered within a reasonable cost or complexity, it often pays to support it. It means that if hardware error correction would be exceedingly expensive, the system would have the option of using a less expensive version that provided detection capabilities only. Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost. Relaxing the requirement that hardware errors be detected, however, would be much more difficult because it means that every software component would be burdened with the need to check its own correct execution. At one early point in its history, Google had to deal with servers that had DRAM lacking even parity checking. Producing a Web search index consists essentially of a very large shuffle/merge sort operation, using several machines over a long period. In 2000, one of the then monthly updates to Google's Web index failed prerelease checks when a subset of tested queries was found to return seemingly random documents. After some investigation a pattern was found in the new index files that corresponded to a bit being stuck at zero at a consistent place in the data structures; a bad side effect of streaming a lot of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring, and no further problems of this nature were reported. Note, however, that this workaround did not guarantee 100% error detection in the indexing pass because not all memory positions were being checked—instructions, for example, were not. It worked because index data structures were so much larger than all other data involved in the computation, that having those self-checking data structures made it very likely that machines with defective DRAM would be identified and excluded from the cluster. The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM. [return]

← Files are hard What's worked in Computer Science: 1999 v. 2015 → Archive Patreon Mastodon LinkedIn Twitter RSS *[I said that Yahoo had been warped from the start by their fear of Microsoft]: Nam Nguyen points out this may have been incorrect — Yahoo's biggest fear was Google, not Microsoft — but insofar as this pair of quotes are relevant, they're being used because Graham wrote the most famous 'Microsoft is declining' posts, which is why these quotes are used. Whether or not Graham was right about Yahoo isn't material for this use case *[Microsoft has outperformed all of those companies since then]: as of my filling in the numbers when updating the draft of this post, in July 2024, if you include dividend reinvestment; I have drafts of this post going back to 2022 and Microsoft looked quite good every time I looked up the numbers *[$14B or $22B to $83B]: Most sources cite $22B to $78B, which probably stems from not understanding that the fiscal year and the calendar year are not the same. The revenue from the last four quarters Ballmer presided over was 20.403B+24.519B+18.529B+19.114B=82.565B *[current P/E ratio, would be 12th most valuable tech company in the world]: as of when the draft of this post was written in mid-2024 *[like Paul Graham did when he judged Microsoft to be dead]: Graham notes that Microsoft is dead because it's no longer dangerous *[infallible]: in terms of maximizing the bottom line *[Writely]: Writely would later become Google Docs *[easy for the old guard at a company to shut down efforts to bring in senior outside expertise]: what usually happens is that the old guard ignore or slow play the new senior hires, saying they don't really understand the issues and you have to have been here a long time to get the company; this gets easier with each new hire since the old guard can point to the long list of failures as a reason to not listen to new outside hires *[never been seen before]: excluding the very early days, when you could say that there was only one programmer using one thing *[three guesses a turn]: fewer if you decide to have your team guess incorrectly *[JS for the Codenames bot failed to load!]: Aside to Dan: the JS is loaded from a seperate file due to Hugo a Hugo parsing issue with the old version of Hugo that's currently being used. Updating to a version where this is reasoanbly fixed would require 20+ fixes for this blog due to breaking changes Hugo has made over time. If you see this and Hugo is not in use anymore, you can probably inline the javascript here. *[The site had a bit of a smutty feel to it]: The front page has been cleaned up, but it makes sense to look at the site as it was when it was linked to, and the front page has also gotten considerably less Asian as the smut has cleaned up *[excel formatting]: There are bugs where people relatively frequently blame the user; maybe 5% to 10% of commenters like blaming the user on Excel formatting issues causing data corruption, but that's still much less than the half-ish we'll see on ML bias, and the level of agitation/irrigation/anger in the comments seems lower as well *[even particularly representative of text people send]: Looking at languages spoken vs. languages in the dictionary, for some reason, the dictionary has words from the #1, #2, #3, #4, #6, and #9 most spoken languages, but is missing #5, #7, and #8 for some reason, although there's a bit of overlap in one case *[Vietnamese]: of course, you can substitute any other minority here as well *[Maybe so for that particular person]: Although I doubt it — my experience is that people who say this sort of thing have a higher than average error rate *[they saw no need to ask me about it]: my HR rep asked me at once company, but of course that's one of the three companies they didn't get my name wrong *[It's just that now, many uses of ML make these kinds of biases a lot more legible to lay people and therefore likely to make the news]: and this increased visibility seems to have caused increased pushback in the form of people insisting that doing the opposite of what the user wants is not a bug *[don't care about this]: Until 2021 or so, workers in tech had an increasing amount of power and were sometimes able to push for more diversity, but the power seems to have shifted in the other direction for the foreseeable future *[Intel shifted effort away from verification and validation in order to increase velocity because their quality advantage wasn't doing them any favors]: One might hope that Intel's quality advantage was the reason it had monopoly power, but it looks like that was backwards and it was Intel's monopoly power that allowed it to invest in quality. *[absent a single dominant player, like Intel in its heyday]: Other possibilities, each unlikely, are that consumers will actually care about bias and meaningful regulatory change that doesn't backfire *[the memos from directors and other higher-ups]: that are available *[FTC leadership's]: at least among people whose memos were made public *[while it's growing rapidly]: BE staff acknowledge that mobile is rapidly growing, they do so in a minor way that certainly does not imply that mobile is or soon will be important *[it's generally very difficult to convert a lightly-engaged user who barely registers as an MAU to a heavily-engaged user who uses the product regularly]: There are some circumstances where this isn't true, but it would be difficult to make a compelling case that the search market in 2012 was one of those markets in general *[hyperscalers]: of course this term wasn't in use at the time *[most people in industry]: especially people familiar with the business or product side of the industry *[believe]: or believe something equivalent to *[onebox]: 'OneBoxes' were used to put vertical content above Google's SERP *[standard online service agreements]: standard agreements, as opposed to their bespoke negotiated agreements with large partners *[At an antitrust conference a while back]: Sorry, I didn't take notes on this and can't recall specifically who said this or even which conference, although I believe it was at the University of Chicago *[statements]: statements only, not explanations *[wouldn't be worth looking it up to copy]: perhaps one might copy it if one were bulk copying a large amount of code, but copying this single function to make haste is implausible *[SOM]: business school *[this is 50% per year for high-end connections]: Unfortunately, I don't know of a public source for low-end data, say 10%-ile or 1%-ile; let me know if you have numbers on this *[a fraction of median household income, that's substantially more than a current generation iPhone in the U.S. today.]: The estimates for Nigerian median income that I looked at seem good enough, but the Indian estimate I found was a bit iffier; if you have a good source for Indian income distribution, please pass it along. *[because]: For the 'real world' numbers, this is also because users with slow devices can't really use some of these sites, so their devices aren't counted in the distribution and PSI doesn't normalize for this. *[much of the web was unusable for people with slow connections and slow devices are no different]: One thing to keep in mind here is that having a slow device and a slow connection have multiplicative impacts. *[illustrative]: the founder has made similar comments elsewhere as well, so this isn't a one-off analogy for him, nor do I find it to be an unusual line of thinking in general *[74% to 85% of the performance of Apple]: I think it could be reasonable to cite a lower number, but I'm using the number he cited, not what I would cite *[74% to 85% of an all-time-great team is considered an embarrassment worthy of losing your job]: recall that, on a Tecno Spark 8, Discourse is 33 times slower than MyBB, which isn't particularly optimized for performance *[hardware engineers]: using this term loosely, to include materials scientists, etc., which is consistent with Knuth's comments *[long-term holdbacks]: where you keep a fraction of users on the old arm of the A/B test for a long duration, sometimes a year or more, in order to see the long-term impact of a change *[15 seconds looking up rough wealth numbers for these countries]: so, these probably aren't the optimal numbers one would use for a comparison, but I think they're good enough for this purpose *[an infinitely higher ratio]: To find the plausible range of underlying ratios, we can do a simple Bayesian adjustment here and we still find that the ratio of hate mail has increased by much more than the increase in traffic; maybe one can argue that hate mail for slow sites is spread across all slow sites, so a second adjustment needs to be done here? *[$30k/yr (which would result in a very good income in most countries where moderators are hired today, allowing them to have their pick of who to hire) on 1.6 million additional full-time staffers]: if you think this is too low, feel free to adjust this number up and the number of employees down *[normalized performance]: normalizing for both average performance as well as performance variance among competitors *[difficult to understand]: depending on what you mean by understand, it could be fair to say that it's impossible *[did support]: to be clear, this was first-line support where I talked to normal, non-technical, users, not a role like 'Field Applications Engineer', where the typical user you interact with is an engineer *[ensuring]: this word is used colloquially and not rigorously since you can't really guarantee this at scale *[protect the company]: for different values of protect the company than HR; to me, this feels more analogous to insurance, where there are people whose job it is to keep costs down. I've had insurance companies deny coverage on things that, by their written policy, should clearly be covered. In one case, I could call a company rep on the phone, who would explain to me why their company was wrong to deny the claim and how I should appeal, but written appeals were handed in writing and always denied. Luckily, when working for a big tech company, you can tell your employer what's happening, who will then tell the insurance company to stop messing with you, much like our big sporting event cloud support story, but for most users of insurance, this isn't an option and their only recourse is to sue, which they generally won't do or will settle for peanuts even if they do sue. Insurance companies know this and routinely deny claims without even looking at them (this has come out in discovery in lawsuits); accounting for the cost of lawsuits, this kind of claim denial is much cheaper than handling claims correctly. Similarly, providing actual support costs much more than not doing so and getting the user to stop pestering the company about how their account is broken saves money, hence standard responses claiming that the review is final, nothing can be done, etc.; anything to get the user to reduce the support cost of the company (except actually provide support) *[Jo Freeman]: among tech folks, probably best known as the author of The Tyranny of Structurelessness *[Kyle Vogt]: CEO, CTO, President, and co-founder *[Aaron McLear]: Communications VP *[Matt Wood]: Director of Systems Integrity *[Wood]: Director of Systems Integrity *[Vogt]: CEO, CTO, President, and co-founder *[McLear]: Communications VP *[Raman]: VP of Global Government Affairs Prashanthi *[Gil West]: COO *[Jeff Bleich]: Chief Legal Officer *[West]: COO *[David Estrada]: SVP of Government Affairs *[Estrada]: SVP of Government Affairs *[Prashanthi Raman]: VP of Global Government Affairs Prashanthi *[Eric Danko]: Senior Director of Federal Affairs *[Bleich]: Chief Legal Officer *[Alicia Fenrick]: Deputy General Counsel *[Matthew Wood]: Director of Systems Integrity *[Andrew Rubenstein]: Managing Legal Counsel *[Fenrick]: Deputy General Counsel *[Rubenstein]: Managing Legal Counsel *[Kyle]: CEO, CTO, President, and co-founder *[Danko]: Senior Director of Federal Affairs *[Jeff's argument here appears to be that, if he were there at the time, he would've seen the exact same opportunities that early Google enigneers did, and since Google did this, it must've been the right thing even if it doesn't look like it. But, a number of things that make it look like not the right thing actually made it not the right thing.]: When someone looks in the answer key and says, 'I would've come up with that', that's often plausible when their answer is perfect. But when they say that after seeing a specific imperfect answer, it's a bit less plausible that they'd reproduce the exact same mistakes

< back to blog