I must say I’m encouraged by the flow of novel ideas here, some of which perhaps being viable. Even the CPU vendors will listen to reason, given an overwhelmingly convincing proof of their incompetence and a map out of hell.
First of all, the reason I advocate killing speculation entirely is: (1) it will kill Spectre; (2) while – yes – all cores will slow down, most day-to-day workstation problems become bounded by frontside bus speed (or IO speed) when run at suficient scale, so we have more and more time to spare as the core count increases; (3) as @vmedea pointed out, it would free up silicon real estate; and (4) doing so does not preclude the existence of an asymmetric architecture with a few brawny cores and lots of whimpy ones, as for example we could simply boost the cache sizes of the former, to great effect.
However, I’m also sober about CPU vendors’ assumptions that, even when security matters, it’s likely to be overridden by the dodo birds in purchasing who want to impress the boss with the highest performance tools. In game theory terms, this is a tragedy of the commons: everyone benefits if everyone does the right thing (kill speculation, add cores, and let Moore’s Law wash our tears away); but the bad guys win if only one good guy disables it, because the buyers are generally ignorant of the tradeoff; so in the end it’s likely that no one disables it. At best, as @kieran suggested, it’s possible that we’ll get a kill switch (which would require all sorts of yucky interprocessor synchronization, but might just work). Given that, or nothing, I’ll take it. I will pay more for a CPU that can disable this clowning, even if the OS hasn’t quite figured out how to do it properly yet. (No doubt Intel is hanging on my every word!)
I do like @Fan 's suggestion that we reboot the entire industry (yes, it might take that) from secure primitives and all the way back up to full stacks, but I don’t see it happening because laziness is the way of the world. At best, perhaps we could force contracts on apps around timing, e.g. “The OS agrees to tell you the wall clock time if you agree that you will be terminated before the next task switch occurs.” But the devil is in the details. I for one don’t see how to provide real timing information (even local timing) to an app without risking that it learns something about neighboring processes. Even the idea of constant timeslices doesn’t work because I can still count how many memory accesses I can do within such a slice, and thereby glean such information.
But let’s rewind a bit, as it’s important to remember what Spectre is: it’s a family of security vulnerabilities which exploit timing information to leak data. The only difference between one Spectre sibling and another is the manner in which they induce such information to leak, and across which particular wall they seek to leak it. The archectypical exploit goes like this: (1) using conditional branches at locally controlled addresses, prime the branch target buffer (BTB) so that the code on the other side of the wall will mispredict a branch (which is possible if the attacker knows the way local addresses map to BTB entries, which is in general dangerously simple, so even address space randomization (ASR) doesn’t help much), causing it to use a sensitive piece of data to generate a speculative memory address; (2) after many iterations of this experiment, and using timing information provided by the OS, determine which lines have been (un)loaded to or from the cache as a result of that misprediction, simply by measuring memory latency to relevant addresses; (3) work backward from the cache tags corresponding to the affected lines (which are deterministically, if weakly pseudorandomly, generated from the target addresses) to determine what sensitive data had been misinterpreted as an address; and (4) repeat until sufficient data has been extracted, such as all the bytes of a password. This is best done with small granularity timing measurement, but over longer periods of time, perhaps spread across multiple sessions, no amount of jitter is enough; even Javascript will suffice. Unfortuantely, a lot of effort has been put into decreasing timing resolution in order to thwart Spectre. This slightly raises the bar but cripples a lot of productive apps which require finer timing. It’s a mistake that should be reversed.
The easiest mitigation would be prevent to the BTB from being primeable in any useful way. One popular “solution” is to reset the BTB state when switching tasks, so that all priming will have been forgotten upon arrival at the other side of the wall. When the original task is resumed later, the BTB will need to retrain, which will cost only a small performance penalty. Unfortunately, this doesn’t work, either, because the attacker can just choose those branch locations in the target binary which will be primed the wrong way simply by virtue of the BTB reset. (With no BTB history for a given branch instruction, speculation should be disabled the first time that the branch is encountered. But AFAIK it’s just set to speculate (un)taken – whatever the default happens to be.) If too many silicon changes would be implied by making the first branch nonspeculative, then a better approach would be to randomize (un)taken on that first iteration. So when switching tasks set all the BTB predictions pseudorandomly (in a trapdoor way which is not predictable to an observer at any privilege level). That way, an attacker couldn’t easily determine whether specific cache lines had been (un)loaded due to bona fide accesses or misused data. (This isn’t quite solid, which I’ll get to below.)
Another approach would be to create a per-task random mapping of memory address hashes to cache lines – in order words, a per-task cache tag scheme. This is a mess to implement, though, as it would require memory accesses in order to redirect other memory accesses, rather like a translation lookaside buffer (TLB) lookup – but that might create yet other Spectre vulnerabilities, so we would be stuck with an unwieldy register implementation. It’s theoretically possible, but much harder than the BTB route, and its only advantage is a lack of BTB disturbance.
The BTB randomization approach is a very high bar, but perhaps the code under attack has multiple potentially leaky branches, so in theory the attacker might just have to look at a larger number of cache lines over more wall crossings in order to ascertain the sensitive data. This is why I think the page approach is the most robust, short of a speculation kill switch. It disallows cache line (un)loading until the CPU knows what all preceding branches should actually have done. For that matter, speculation isn’t actually disabled; only memory accesses are, so register computations including floating-point could still happen speculatively, which still helps performance. It does, however, prevent data being misused for memory access, so there will be no information provided by the cache state after the fact, provided that the attacked code follows a data-agnostic execution path, as any competent crypto library surely does. (This means you don’t: touch matrix row 0 if the next key bit is 0, or row 1 if it’s 1; instead, you touch both of them and just nullify one of them after loading into registers, such that total execution time is essentially constant and memory access pattern is always the same, assuming no speculation.) Speculation could only be enabled by driver code that didn’t touch the data, but instead pointed a DMA engine to it for IO transfer – or for machines without access to sensitive data, whatever that means to the user. It’s also possible under this scheme to have trusted and untrusted apps, which speculate or not, respectively, but that would require user intervention and considerable due diligence, above all because even trusted engineers are themselves still susceptible to compiletime malware injection.
OK so where’s my speculation disable bit?! Oh I forgot, that was only in a dream. Maybe the next best thing is to have a really obtuse compiler switch (ahem, GCC people?) that inserts memory fences prior to all the branch instructions. (I seem to remember that even that doesn’t necessarily work, in which case we’d need bona fide serializing instructions like CPUID and friends. Gross. But maybe acceptable if we still use hardware acceleration for most of the crypto (AES) tasks. At least that would cut down on the most damaging attacks.)