back to article AMD to fix slippery hypervisor-busting bug in its CPU microcode

AMD will release on Monday new processor microcode to crush an esoteric bug that can be potentially exploited by virtual machine guests to hijack host servers. Machines using AMD Piledriver CPUs, such as the Opteron 6300 family of server chips, and specifically CPU microcode versions 0x6000832 and 0x6000836 – the latest …

Page:

  1. edge_e
    Facepalm

    Learnt something new today

    you can alter cpu firmware from inside an operating system

    what could possibly go wrong

    1. bazza Silver badge

      Re: Learnt something new today

      It gets loaded at boot time. CPUs have been like this for a veeeery long time.

      The advantage is that bugs in the microcode (such as this) can be fixed. If the CPU didn't use microcode and suffered from this bug, the only fix would be a new CPU.

      Disadvantages - microcode means having instruction translation units in one's CPU, which require a bunch of transistors, which take power to run them. ARMs don't use microcode, which is one of the many architectural features that help them beat x86 CPUs on power consumption.

      1. Destroy All Monsters Silver badge

        Re: Learnt something new today

        And not only the CPU

        At Linux boot time, there may be several "firmware updates" of various stuff of the machine innards.

        This ain't your ZX81 anymore.

        1. bazza Silver badge

          Re: Learnt something new today

          "And not only the CPU At Linux boot time, there may be several "firmware updates" of various stuff of the machine innards."

          It's certainly quite astonishing to see just how many files are lurking in /lib/firmware. Makes one wonder if there's any such thing as a real piece of hardware these days. Seems like "hardware" is now firmware running on some sort of micro / cpu / etc. I guess the best I can hope for is that some of it is FPGA or CPLD images, but even that doesn't feel like a good old fashioned collection of dedicated transistors.

          1. Doctor Syntax Silver badge

            Re: Learnt something new today

            "It's certainly quite astonishing to see just how many files are lurking in /lib/firmware."

            Including new_code.bin and new_code_fix.bin in one directory. Nice to see explicit file naming practices being followed.

            1. fajensen
              Trollface

              Re: Learnt something new today

              Including new_code.bin and new_code_fix.bin ...

              If you need to ask what a blob is *for* then you are not smart enough to run it ....

    2. bazza Silver badge

      Re: Learnt something new today

      AFAIK no one has ever successfully tinkered with microcode. It's a security through obscurity thing on a very large scale.

      Pity though. It'd be cool to be able to load up microcode onto an x86 and make it execute, say, PowerPC instructions!

      1. Doctor Syntax Silver badge

        Re: Learnt something new today

        "AFAIK no one has ever successfully tinkered with microcode. It's a security through obscurity thing on a very large scale."

        My first reaction reading this was that someone who was able to get the old firmware loaded could then trigger the exploit. But I suppose anyone with that level of access wouldn't need to worry about finding exploits to use.

        1. Paul Shirley

          Re: Learnt something new today

          Nowadays flashing a BIOS from Windows is common. You dont need physical access to hack the firmware in theory. Easier than trying to subvert the os boot. Image checksumming/signing shoild still be a problem.

      2. Shaha Alam

        Re: Learnt something new today

        when u say "AFAIK no one has ever successfully tinkered with microcode." you may be interested to know that the NSA have been <redacted> ever since as long ago as <redacted> and can demonstrate the ability to completely <redacted> your <redacted> using nothing more than <redacted>, <redacted> and a cucumber.

    3. Anonymous Coward
      Anonymous Coward

      Re: Learnt something new today

      The problem is that a lot of so called "sysadmin" don't bother to upgrade systems' firmwares - sometimes they learn it the hard way like Github people, sometimes they never learn. So the OS tries to put a patch.

      Once we were handed three almost new big servers for our lab because "they don't work with the new OS". They just needed a firmware upgrade, as I quickly discovered. When the sysadmin who gave them us saw them happily running with the new OS, asked them back because it had to replace them with smaller ones because of budget reasons... the answer was a big "f**********". When he discovered he would have had to admit he wasn't able to properly manage his systems and had spent more money because of his ineptitude, he preferred to keep it quiet and leave us the servers...

  2. Snake Silver badge
    Thumb Up

    Excellent article

    Lovely analysis and, unlike so many tech blog posts including ones right here on El Reg, a useable explanation including the register dumps with an explanation of the suspected logic flow error thereby giving readers more than just "It went boom, now it's fixed" insight.

    Well done!

    1. GrumpenKraut
      Pint

      Re: Excellent article

      Excellent article indeed. Extra points for "A scrap of torn red silk left at the GDB process's murder scene.", lovely!

    2. Ian 55

      Re: Excellent article

      There's something Not Quite Right with the state of IT when an article here thinks it has to explain what a stack is, though.

      I don't expect everyone to have done assembly / Forth / PostScript / etc where using a stack is an essential part of the job description, but they're a basic data structure that anyone writing software should know about, aren't they?

      Next week: what an 'array' is...

      1. MrT

        "Next week: what an 'array' is..."

        ... it's what comes after "'ip! 'ip!", ain't it? Gawd blessya, sah!

      2. Nate Amsden

        Re: Excellent article

        And you are assuming all of the readers write code?

        For me at least the bulk of the article was WAY over my head. I do a bit of scripting here and there but mostly server network and storage mananagement for the past 20 years(there is a big bright line between scripting and coding I refuse to cross, knowing ins ands outs of what i know is a hell of a lot already). Sort of reminds me of what I used to browse over in BYTE magazine many years ago.

        I am aware of the stack terminology though my understang stops right about there. It's just not knowledge that is helpful for what I do.

  3. JeffyPoooh
    Pint

    "...execute data as software...

    "...execute data as software..."

    The Harvard Architecture really is better. :-)

    1. Destroy All Monsters Silver badge

      Re: "...execute data as software...

      No.

    2. 0x407ab506

      Re: "...execute data as software...

      Except all those extra pins.

      1. JeffyPoooh
        Pint

        Re: "...execute data as software...

        "...extra pins..."

        The ASCC Harvard Mark 1 was about 50 feet long, plenty of room for pins... ...as well as relays, switches, clutches, drive shafts...

        But yes, point taken. Of course, as history proves.

        Still, it is annoying how today's software generally scours any data file looking for something to execute (a huge oversimplification, but might as well be true). "Ooh look, code! RUN!"

        Why on Earth do we need to scan DATA files for malicious executable code?

        1. Anonymous Coward
          Anonymous Coward

          Re: "...execute data as software...

          "Why on Earth do we need to scan DATA files for malicious executable code?"

          Partly history, partly convenience (I know you know the answer, I'm just amplifying a bit.)

          In the ancient days of slow cpus, little memory and two-digit dates, some CPUs actually required self-modifying code. On the PDP-8, for instance, you put the return address (if I remember correctly) immediately before the start of a subroutine, so if the program is in ROM, no subroutines. Of course in those days we were taught that self modifying code was very bad but sometimes it was the only way to get the job done. (On the PIC, which is a degenerate Harvard architecture, program flexibility can be obtained by putting a new PC value into a register and branching to it, and there is an evil patch of program-writable ROM so you could in unlikely circumstances get a virus of sorts into an embedded microcontroller.)

          But the real rot set in with Microsoft's original idea of run everything everywhere. Microsoft programs had links all over the place - some undocumented of course - to enable unlikely things to be done. It provided lots of flexibility but had all the security of a safe made of gallium, resulting in much pain as Windows acquired security.

          As an offshoot of this, what happened if you wanted a document with a new document model and didn't want the inconvenience of having to install new code to handle it? In the days before widespread Internet, the answer was obvious: include the necessary extensions to use the new model in the document. And bingo, chcocolate bank vault time again.

          Now we see the effect of trying to keep Moore's Law going as if it was a law of nature and not an enthusiastic guideline. CPU progress is so rapid that stuff enters the wild with microcode bugs. OK, borrow a trick from mainframes and make the microcode modifiable. If we have to release completely tested CPUs we'll never get anything out of the door ahead of the competition. Heck, cars get released nowadays and the bugs get fixed during servicing.

          One good thing about mobile phones is that their CPUs aren't a near monoculture with just a minor digression. We have Apple designs, Qualcomm, Kirin, Mediatek and Exynos, and probably others. If someone comes up with a truly terrible exploitable whoopsie, there's an alternative. So there's hope for the future, of a sort.

          1. Anonymous Coward
            Anonymous Coward

            Re: "...execute data as software...

            Actually, the protected mode of the x86 architecture allows for a clear distinction of code (executable) segments and data segments. You can also have read only data segments. IIRC, yo can have segments that are are executable but not readable as "data", less so writeable.

            But every OS preferred to get rid of the more complex (and somewhat slower) segmented model - and not only Windows, but Linux and others as well. All of them preferred simpler compiler, more portable kernels and applications, thus preferred to load segment registers at the beginning - addressing the whole space - and then forget about them.

            Sure, there is a speed penalty when a segment register is loaded, exactly because of the added security. You see here that the later added NX bit - required exactly because lack of "proper" segment usage - spotted the jump into data code and stopped it.

            Sure, there is some software, i.e. scripting languages, and even p-code ones, that could need to execute "data" - but not all software needs it, and any software requiring it should be handled in a properly sandboxed environment.

  4. Anonymous Coward
    Anonymous Coward

    I'd have assumed that their test code suite would catch something like that...

    I think it's obviously a fairly trivial and straightforward exercise for the Test Dept boffins at AMD to semi-automatically create (even hundreds of MB of) self-test machine code. The sort of code that would exercise their CPU designs or prototypes and would report back such errors. Given how trivial and semi-automatic this exercise should be by now, the 'coverage' of their test code suite should be very nearly 100% by now. And to preempt the too-predictable rebuttal about 'obscure timing of interrupts' etc., the Test Code (and associated hardware) can be left running for weeks x GHz clock speed. Test coverage should be a long string of '9's.

    A blatant error such as popping the incorrect item off of a stack is the sort of thing that should be caught quite early and reliably in the process. I can't imagine why it wouldn't be caught, assuming that their approach to testing is as it really should be.

    My conclusion is that this is a double failure: 1) a bug, and 2) the failure of the AMD Test Dept to catch it.

    I'd almost be more worried about the latter.

    1. xenny

      Re: I'd have assumed that their test code suite would catch something like that...

      I'm unhappy about the latter, but then I consider the recent history of Intel CPU bugs that have been discovered, admittedly more in computational accuracy than basic stack operation, and I wonder about the test process in both cases.

      I do remember one of the P4 architects describing how they could no longer mentally anticipate how the CPU was going to behave in some circumstances though, so maybe this kind of thing is now just really really hard, and I don't have a good enough understanding of how one might go about designing a test suite.

      1. Anonymous Coward
        Anonymous Coward

        Re: I'd have assumed that their test code suite would catch something like that...

        xenny - how one might go about designing a test suite.

        For Functional (i.e. card edge, as opposed to In-Circuit with probes) testing of logic circuits, you (your tool) builds a set of input 'vectors' that can be propagated through the hardware to ensure that every point is exercised (0 & 1) *and* is propagated to the outputs. For a qualification of a CPU design, that would be the first infinitesimal % of the test.

        The book 'Fatal Defect' mentions that 'randomized' testing might be better than 'designed' testing, because it avoids making the same erroneous assumptions at Test Design as were made at UUT Design. so the randomized test might stumble across something unforeseen. Obviously it compares the UUT outputs against the 'Requirements', in an automated fashion. I'd say do both.

        Testing of hardware triggered interrupts would require specialized stimulation hardware to endlessly walk the timing of the interrupt back and forth relative to the clock phase, in tiny fractions of a picosecond (fun design that; perhaps mechanical!). This should be SOP due to the fundamentals of Setup and Hold timings for such asynchronous inputs; it must be checked.

        Since the interrupts are using the stack, then that area must be fully explored. As it's automated and GHz clock, that might need maybe an hour of run time.

        As an example, when one is testing memory, one doesn't test every possible combination of bits due to 'age of Universe' etc. (something I had to work out once, when asked how long it would take). One tests with Walking Ones, Walking Zeroes, etc. etc. Minutes, not 'age of Universe'.

        As mentioned above, the test software executing on the UUT CPU should be massive by now (the year of our Lord, 2016). There's no reason it shouldn't be hundreds of MB of encoded tests, covering all the boundary conditions and lessons learned. They've been doing this for a while.

    2. Primus Secundus Tertius

      Re: I'd have assumed that their test code suite would catch something like that...

      In the 1980s I was taught how to develop microcode for a processor built by Norsk Data. It was hard. There were different objects within the CPU addressed by different fields within a very long instruction word. These objects had to be kept working together consistently, and with regard to their timing needs.

      It makes me wonder if things have evolved since then; whether perhaps one can do a software emulation of microcode; and whether such an emulation could be more rigorously tested.

    3. A Non e-mouse Silver badge

      Re: I'd have assumed that their test code suite would catch something like that...

      I think it's obviously a fairly trivial and straightforward exercise for the Test Dept boffins at AMD to semi-automatically create

      This isn't a bug that's caused by "Execute instruction X followed by instruction Y and get the wrong results".

      It's a very precise timing bug between an NMI and the processor being in a certain state. These corner cases are *very* hard to find and reproduce.

      As several other commentards have noted, CPUs nowadays are *very* complex and testing every state is almost impossible.

      1. Anonymous Coward
        Anonymous Coward

        Re: I'd have assumed that their test code suite would catch something like that...

        A Non e-mouse... ...timing bug between an NMI and the processor being in a certain state...

        Again!, "...And to preempt the too-predictable rebuttal about 'obscure timing of interrupts' etc., the Test Code (and associated hardware) can be left running for weeks x GHz clock speed. Test coverage should be a long string of '9's."

        I shouldn't have to explain this. The interrupts' timing can be (should be) walked back and forth. We did exactly this only weeks ago, admittedly at a coarser time scale appropriate to our purposes. This is Test 101, really basic.

        1. Anonymous Coward
          Anonymous Coward

          Re: I'd have assumed that their test code suite would catch something like that...

          Might want to watch the video linked at the end of the article. It's actually pretty cool. The guy even explains why leaving the CPU to test for weeks at a time still won't get you full coverage.

        2. Hugh McIntyre

          Re: I'd have assumed that their test code suite would catch something like that...

          Re: "...And to preempt the too-predictable rebuttal about 'obscure timing of interrupts' etc., the Test Code (and associated hardware) can be left running for weeks x GHz clock speed. Test coverage should be a long string of '9's. [...] I shouldn't have to explain this. The interrupts' timing can be (should be) walked back and forth."

          You're assuming that AMD (and the other CPU vendors) don't already do this - they do. Specifically random code running on a huge number of systems for weeks as well as all through the design process. And also directed tests where "the interrupt timing is walked backward and forward".

          But even with this it's not possible to cover every possible bug. While I've not seen the full details of this specific bug the article contains this hint for the VMware/ESXi bug report: "Under a highly specific and detailed set of internal timing conditions, the AMD Opteron Series 63xx processor may read an internal branch status register while the register is being updated, resulting in an incorrect RIP".

          So this is more complicated than just walking the NMI timing -- it only fails if the timing also hits "while the BSR is being updated", so you need other specific unlucky event(s) as well, and possibly requires other specifics such as a particular set of cache hits/misses or VM state to trigger the failing case. Put another way, the fact that Piledriver has been shipping for years with this bug only found now means that "running prototypes for weeks" does not cover everything, because there has been an enormous amount of random customer code running for a lot more than "weeks" on a lot more than prototypes before this bug was found.

    4. Paul Shirley

      Re: I'd have assumed that their test code suite would catch something like that...

      "Test coverage should be a long string of '9's."

      You're assuming it isn't. And as long as any 9's are in the test, bugs can get through.

      The alternative is never shipping complex products.

    5. fajensen

      Re: I'd have assumed that their test code suite would catch something like that...

      Well, it works like this: You have good QA, the process works, you have good people and very few recalls. The the bean-counters arrive. They quickly conclude that since there are very few recalls and other disasters, there is not an urgent need for QA so the budget is cut ... and cut ... and cut. Then there is a major recall and scores of consultants come in to set up a QA system ....

      1. Anonymous Coward
        Anonymous Coward

        Re: I'd have assumed that their test code suite would catch something like that...

        "They quickly conclude that since there are very few recalls and other disasters, there is not an urgent need for QA so the budget is cut ."

        Good job this kind of management behaviour is completely unacceptable in highly regulated safety critical industries such as avionics and automotive (can't speak for medical and other potential candidates).

        Now, where's my flying pig.

        :(

        1. asdf

          Re: I'd have assumed that their test code suite would catch something like that...

          >highly regulated safety critical industries

          Who never hire lobbyists to get rid of some of that regulation huh? Eventually the Boomers will be gone and the fresh out MBAs will fix us good.

  5. Anonymous Coward
    Anonymous Coward

    Slight Schadenfreude

    Recently I posted that a "firewall" running on the same host as other VMs wasn't really a firewall. A number of people were kind enough to upvote me: I got at least one response saying that what I described couldn't happen. One was from someone who claimed to have been involved in writing hypervisors.

    I hope that guy is reading this article.

    1. Anonymous Coward
      Anonymous Coward

      Re: Slight Schadenfreude

      Well, ideally it can't happen. Unfortunately, people design these CPUs and their "firmware", and humans do make mistakes, thus it becomes possible.

      1. Anonymous Coward
        Anonymous Coward

        Re: Slight Schadenfreude

        "Well, ideally it can't happen."

        Only in a1 world in which Murphy's Law isn't correct.

        1 pink unicorn fuelled fantasy

        1. Anonymous Coward
          Anonymous Coward

          Re: Slight Schadenfreude

          Ohh yes, and I've just about had it to HERE with Murphy and his bloody partner-in-crime, O'Toole!

          (In the middle of projects, in the middle of business negotiations, then one of our key people manages to crash his bicycle and kill himself. But I digress…)

          1. Anonymous Coward
            Anonymous Coward

            Re: Slight Schadenfreude

            SL: "...one of our key people manages to crash his bicycle and kill himself."

            Crikey. Condolences.

          2. Anonymous Coward
            Anonymous Coward

            Re: Slight Schadenfreude

            "Murphy and his bloody partner-in-crime, O'Toole!"

            Sorry to hear that.

            You failed to mention O'Reilly's Addendum:

            "and anything that cannot go wrong is malicious."

    2. Voland's right hand Silver badge

      Re: Slight Schadenfreude

      I am reading it :)

      You got correctly downvoted then and there. Let me explain why (as someone who has _WRITTEN_ hypervisor software in use for virtual routing and firewalls).

      This is no different from any firmware or CPU bug. You can break out of protected mode, exploit buggy network card firmware, etc. If anything, virtualization, when used correctly provides an _ADDITIONAL_ layer of protection.

      By the way, from that perspective, in the specific cases of virtual routing and firewalls you are better of to consider forms of virtualization which use as little as possible in terms of hardware accel features. Sure, you pay in terms of absolute performance. You get it back in terms of maintainability and security. If you do it _THAT_ way, your virtual firewall is actually more secure than one running on bare metal as you have one more layer of "defense in depth". It is more maintainable too. That is is also exactly the use case I would advocate for (and what I used to do for a living). I would also not go schadenfreude-ing on every single firmware bug as the reason to invalidate the whole concept.

      This is no different from the argument which Cisco tried to mandate to all of its indoctrinates ~ 10 years ago that they answer that PIX is more secure than firewalls which use combined kernelspace + userspace mode because it runs everything privileged in a monolithic system. That as we all know is bollocks. Sure you get a bug from splitting things once in a while - that is still better than doing everything in one blob.

      By the way, looking at the bug, it offers a specific exploitation route in kvm. That does not mean that it does not have an exploitation route outside virtualization domain. There is a gazillion ways to trigger an NMI on a NUMA system. In fact, I have some userspace, unprivileged code lying around somewhere which will kill any older (and probably newer) 2+ socket Xeon running Linux within 15 seconds by hard fault through NMI storm. It is not that difficult.

      1. Anonymous Coward
        Anonymous Coward

        Re: Slight Schadenfreude

        I haven't looked that closely at the drivers available in KVM, but normally use the VirtIO drivers for performance.

        From a security standpoint, are there better options available in KVM or are they more-or-less all "accelerated"?

      2. asdf

        Re: Slight Schadenfreude

        >consider forms of virtualization which use as little as possible in terms of hardware accel features

        Not really related but I know the people working on fq_codel didn't have a lot of nice things to say about NIC offloads and what they did to latency in the name of throughput.

  6. Chairo
    Thumb Up

    The really incredible thing is...

    .. that they didn't just re-start the build and were happy when it finally passed, but they sat back, gathered all evidence and deep-dived into what went wrong, all the way to the microcode.

    I'd say that's impressive.

    1. Richard 12 Silver badge

      Re: The really incredible thing is...

      It's lucky that it was in a VM.

      A guest taking down the host is a big and clear WTF!? as it's supposed to be impossible.

      In an organisation that knows what it's doing, that's an immediate "We need to know why" - it's a serious bug!

      1. asdf

        Re: The really incredible thing is...

        >In an organisation that knows what it's doing

        As the original poster said luck that it hit just such a rare organization or more likely it has hit many others and this is just one of the first to pay attention.

  7. amanfromMars 1 Silver badge

    Out of the Darkness Comes Light ‽ .

    The other ingredient in this saga is virtualization: the OpenSUSE build server was compiling GDB and testing it in a QEMU-KVM virtual machine. That means an unprivileged user in a guest virtual machine merely building software was able to trigger an "oops" in the host server's kernel. That's not good.

    Is there an rapid escalation and elevation and expansion of the facility/utility/vulnerability/call it whatever you will, whenever an underprivileged user, not merely just building virtual kernel software with SsecuredDwares, realises the triggering potential and APT ACT portal is always present and a critical element/component in Intel designs?

    Is the logical solution, to mitigate and cover the risk whenever the problem is inherent and core value, to raise said underprivileged user privileges to build ….. well, to be more than just sure of future security in all manner of practically real and virtual machine operating systems, both guests and safer hardened kernels?

    Or would a real world fear of remote transfer of virtual command and control to unknown forces and anonymous sources cause an almighty all systems crash?

    Wow ….. that be so much more than just a right bugger of a bug in systems to debug, methinks. Wherever would one start? And that make it a very valuable, fortunate weapon too, methinks. Do you also think it so?

    Here's somebody else who realises the problems ahead ... http://www.zerohedge.com/news/2016-03-06/were-eye-storm-rothschild-fears-daunting-litany-problems-ahead

    1. Anonymous Coward
      Anonymous Coward

      Re: Out of the Darkness Comes Light ‽ .

      Hey man, can I have some of your microcode? I promise not to inject it.

      P.S. Obligatory xkcd reference

    2. Dan 55 Silver badge
      Mushroom

      Re: Out of the Darkness Comes Light ‽ .

      Today's breaking news from Zerohedge: The apocalypse is coming sometime soon.

      1. fajensen

        Re: Out of the Darkness Comes Light ‽ .

        ... and we should buy some Gold?

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like