Cloud users and data center managers began to understand the performance effects of the patches required by the Meltdown/Spectre chip vulnerabilities over the weekend. While it seems initial fears of a 30 percent hit were overblown, there are plenty of users wondering what to do next after realizing they’re not getting everything they expected out of their hardware.
It does seem that the impact of the patches required to address vulnerabilities in just about every chip in a data center was not universal, as Intel and cloud vendors predicted last week in rolling out their patches following the disclosure of the Meltdown and Spectre vulnerabilities. First discovered last summer, the two hardware vulnerabilities are the result of a design flaw in modern processors that could allow malicious hackers to read sensitive data or eavesdrop on other applications running on a shared server.
The mass of vendors we’ll call Big IT scrambled to assure its customers that the patches, which add some degree of overhead to operating systems and virtualization software depending on the type of workload, would not cause widespread performance problems. For the most part, that seems to be the case, as John Graham-Cumming, who oversees a huge network of servers as CTO of Cloudflare, reported:
We continue to test various patches for #meltdown and #spectre but impact on @Cloudflare infrastructure appears to be negligible.
Impact on SRE and engineering time is a different matter… lots of effort going into testing and validating.
— John Graham-Cumming (@jgrahamc) January 5, 2018
Yet there are definitely people having meetings this week to discuss what to do about the fact that their infrastructure strategy is now somewhere between one and 20 percent less effective.
Real information about tech performance can be surprisingly hard to come by, in part because performance can vary quite a bit across complex tech infrastructure, even ones designed to power the same industries.
Vendors only want to highlight the best parts of their products, of course, and tech organizations can be reluctant to advertise their own issues to customers or competitors, regardless of the source. And these days, the contractual relationships between suppliers like Intel and customers like the cloud vendors are draped in legalese and non-disclosure agreements that prevent companies involved from talking freely about the situation.
But those who are sharing information have identified several common and prominent workload scenarios seem to be getting hit the most. Representatives for Amazon Web Services and Microsoft either didn’t respond to a request for comment or declined to comment beyond the statements they issued last week about the expected performance impact of their patches.
Google issued this statement, which seems to suggest any performance problems are your fault:
“The performance impact seen by a cloud customer depends on two things: the mitigations put in place by the cloud provider, and any additional mitigations the customer chooses to deploy. We designed our mitigations for Google Cloud to have minimal impact on performance, and tested them before rolling them out. Customer-deployed mitigations may have different implementation strategies or offer varying protection against potential attacks and thus vary in their performance impact.”
Red Hat compiled a list of the types of applications that users running Red Hat Enterprise Linux 7 are seeing from the patches added last week.
The worst impact?
Measureable: 8-19% – Highly cached random memory, with buffered I/O, OLTP database workloads, and benchmarks with high kernel-to-user space transitions are impacted between 8-19%. Examples include OLTP Workloads (tpc), sysbench, pgbench, netperf (
One of those workloads, OLTP (or, online transaction processing), is a fairly basic type of database used to, well, process transactions online. Any application that needs to request data from the operating system kernel on a regular basis is going to be hit by the mitigations necessarily to deal with Meltdown and Spectre, since until quite recently operating systems assumed the chip was capable of handling that securely.
Red Hat said Java virtual machines fell into a lesser category where users could expect a three-to-seven-percent impact on their workloads, but the impact on some Java users was more severe:
The fix for the #Intel CPU vulnerabilities has a #brutal effect on compile times.
Compiling the #syslog_ng package on #Fedora changed drastically: from 4 minutes to 21!
As far as I can see compiling #Java is affected most.#Meltdown #Spectre
— Peter Czanik (@PCzanik) January 5, 2018
The fixes for Spectre, applied directly to the Linux kernel and other operating systems, appear to also be hitting performance for applications (Kafka data streams, in the below example) that tap into hardware virtual machines, as an AWS engineer confirmed lower in that Twitter thread.
The #Meltdown patch (presumably) being applied to the underlying AWS EC2 hypervisor on some of our production Kafka brokers [d2.xlarge]. Ranges from 5-20% relative CPU increase. Ooof. pic.twitter.com/fXM0OzfdKx
— Ian Chan (@chanian) January 6, 2018
So now comes the hard part. Cloud vendors and operating system vendors have emphasized that the protections they rolled out last week and early this week are preliminary, and they expect to develop better methods to deal with these vulnerabilities over time.
Meltdown, the Intel-specific vulnerability, will likely fade into the background as companies apply the patch, which protects users from subsequent attacks and adjust to the new level of performance. But the remedies for Spectre (other than the multiyear-process now underway to redesign just about every chip in the world) only make exploiting that vulnerability more difficult, as opposed to closing the hole. And they only address known ways of exploiting Spectre; you can bet all your bitcoin that somebody, somewhere, is racing to find a new exploit for Spectre.
That means cloud vendors and operating system companies will probably have to roll out additional fixes for Spectre-related issues until the installed base of data center chips turns over to new chips that lack these flaws, which will take forever given that Intel has roughly 95 percent of the data center chip market.
That also means the cost of operating tech infrastructure is going to be a bit higher for a lot of companies in 2018. They didn’t necessarily plan on that, and tech vendors are going to have to make decisions about whether or not they pass the cost of that additional infrastructure on to their customers.
We’ve seen CPU usage go from ~20% to ~40% (and now critical machines with redundancy upscale under loads that before didnt made them blink). Costs this month in AWS will go up 10%, I predict (very least, haven’t checked EMR effect yet, if similar, 20-30%) #spectre #meltdown #fb
— Ruben Berenguel, PhD (@berenguel) January 6, 2018
AWS, Microsoft, and Google either did not respond to a request for comment, or declined to comment, on their plans for customers facing increased costs or decreased performance due to factors they could not possibly have controlled.