A lot has been learnt since Project Trustless TEE got going. As we start to formalise our initiative, its important that we have a sense of the problems which we want to address and the time horison and degree to which they are important. There’s definitely still a lot of work to do in building our understanding of the problem space and the problems will certainly change over time, but I thought I’d provide an initial update of my view of what’s going on.
For anyone new who is reading this. I refer to some teams in this post. You can find more here:
What’s Happening Now
Currently, we - the community, loosely defined - are trying to solve 4 problems which we can divide into two categories:
Supply chain
Verification
How do we guarantee that a chip isn’t faulty and doesn’t contain any trojans?
We’ve gone into this topic in depth in ZTEE: Trustless Supply Chains - Trustless TEEs - The Flashbots Collective and we have some ideas for attaining a pretty high bar in this category. However, this will take a good deal of time so I’m focusing on the first milestone here. The most easily attainable setup is one in which the RTL/netlist is public and a set of trusted entities (e.g. some university labs) are provided with the GDS and any other sensitive material. These entities would be expected to sample chips and attest to their correctness while the sensitive material remains non-public. Before being sure of this, we need to confirm that there are no advanced open PDKs coming online soon as some have reported may be the case.
On a technical level, there are two main ongoing efforts required to get us there:
- A useful IC with open RTL & netlist needs to be designed. This requires navigating patents and likely design several components for which there are no good open implementations. The Fabric team is seriously considering this undertaking and is well-positioned to do this given the size of their team.
- Currently, we do have some options for imaging technology so there isn’t a strict necessity to make advances here. Nonetheless, more effective, non-destructive and cheaper imaging technology allow us to provide better guarantees by imaging more chips and by being able to incorporate a wider set of actors into the set of verifiers. Our destructive options are SEM and FIB, but there are non-destructive alternatives like the $1mm Hamamatsu iPhemos FA. Our best bed on improving over these technologies is Bunnie’s work on IRIS. General improvements to IRIS are useful (especially since it’s not an off-the-shelf technology yet), but we also need to verify how good IRIS is in its current form. With this aim, the folks at UC Louvain have sent some chips to Bunnie for imaging to compare to what their SEM produced.
The feasibility of different approaches will depend on the designs we are trying to image, but the goal of more accuracy at low cost is a good goal independent of the chosen design.
In the long run, we will want, in some way, to help realise an advanced open PDK and process node.
Key Gen & Storage
How are hardware-bound keys generated and stored?
Again, the longer blog post covers this problem space in greater detail. Here we have two efforts:
- The UC Louvain PUF-based signer. An auditable and secure key gen and storage component is clearly needed. This will take up to 3 years by current estimates.
- Fabric’s design would not be able to rely on the UC Louvain signing oracle as this won’t be ready for several years (and may not succeed anyway). As such, they will need to find some kind of reasonable interim solution. TRNGs with different forms of NVM are one option, multiple closed-source PUFs from different vendors is another. The design space and its interaction with verification protocols has yet to be fully explored.
Physical Attacks
Invasive Attacks
The research into defense against invasive attacks is much less clean cut or theoretically sound than for non-invasive attacks. There are a myriad of solution concepts to try. Many of these are known to provide defenses against some known attacks, but we don’t the tools to make statements about possible future attacks yet. Its worth noting that the world of tamper resistance technologies is particularly fraught with patents, which we will need to navigate.
The relevant efforts here are:
- The Simple Crypto signer is intended to be used such that the PUF provides tamper evidence to the whole device. Thus, the hardware key should be destroyed upon tampering.
- Again, due to a mismatch in timelines, Fabric will need to rely on other off-the-shelf solutions in their first iteration.
Non-invasive Attacks
This is where many of the recent discussions have been focused. One reason for this focus is that these attacks are the easiest to pull off and therefore require prioritised attention. A useful piece of context here is that while there is a sizeable body of literature on side channel and fault injection attacks, the analyses and defenses typically apply specifically to cryptographic operations and do not try to provide protection for arbitrary computation as we would ideally do. One reason for this is that the defenses (e.g. masking) have serious tradeoffs in terms speed and size of the chip.[1]
Side channels
Defenses to side channel attacks like DPA can be loosely divided into hardware and software countermeasures. Hardware-based masking can be more efficient than software-based masking and removes onus from those writing software. On the other hand, software-based masking is much more flexible as different levels of security can be provided to different computations. This can end up being more flexible than hardware-based masking in many cases as one can scale down the security order for non-sensitive computations. It is possible to combine these techniques so that some operations are masked by hardware while other by software or even to combine software and hardware masking for the same operation (e.g.)
Masking is only one part of the picture and cannot work alone. The micro-architecture in general must be designed so that certain leakages do not happen. For example, if one share of a secret is used to overwrite another, an assumption about the independence of leakages that underpins masking is broken. Standard pipelines, translation tables, caches etc. are not designed to avoid such collisions so considerable work will be required in this direction.
Fault Injection
The research community knows less about defenses against fault injection than side channels. Particularly challenging is side channel resistance in the presence of faults. Increased redundancy and randomisation in instruction timing are common approaches, but are insufficient on their own (e.g.). At a higher level of abstraction, there are also attacks that attack security features like memory isolation (e.g.).
Practically, it is worth mentioning that FI attacks are much easier to pull off on chips with low clock rates so process nodes can make a big difference. As I understand, SKY130 clock rates can be as low as 50MHz which makes timing a fault well much easier than the multiple GHz we see on state of the art chips. We still need to do more work to verify what’s actually possible on open PDKs - 50 MHz is a conservative estimate.
Fabric’s work will require more work to catch up with the state of the art, but the state of the art itself also needs to advance for us to be comfortable in the long term.
What to build
Our secure chips need to be useful as well. To this end we need to better understand how these chips will be used in 2-4 years time. This will allow us to select tradeoff points (e.g. what kind of masking) for the problems above, but also to optimise chip performance and functionality. For example, if we are confident (which no one is apparently) that a certain post-quantum key-exchange protocol will be the standard in a few years then specialised hardware-masked circuits can be included for the relevant operations. Similarly, if most computation is delegated externally running FHE, then perhaps appropriate circuitry for decryption would be helpful.
Understanding how TTEEs fit into security schemes a few years down the line is one of the most pressing questions given Fabric’s timelines.
Longer Term Problems
- The verification technology we have mentioned above is mostly imaging, but, we will probably want to bolster this with other techniques as other techniques could be cheaper and thus allow us to check more chips and find different trojans which the imaging may not. One area of work here would be to derive a better theoretical understanding of what these techniques can provide (e.g. a lower bound on the size of a trojan that can circumvent a scan chain, or lower bound a deviation in P&R due to RTL changes in order to preserve timing closure). Depending on how heavily we rely on these techniques, this problem is more or less pressing. From my view, approaches like IRIS will be essential in the long run as a means to make testing accessible. As IRIS realistically requires these kinds of techniques to be exhaustive, it makes sense to invest into them.
- We will need to navigate the tradeoffs between thwarting physical attackers and detecting trojans. For example, IRIS requires the chip backside to be exposed, which rules out some kind of defenses like meshes. The biggest concern, though, is that the frontrunner for being IRIS’ complement is scan chains which introduce large attack surface area for physical attackers. Similarly, it’s not clear that PUFs - our frontrunner for defense against invasive attacks - can easily be audited by IRIS.
Naturally there are a many smaller problems within each of these categories but its not clear which of those we will have to address.
If we are producing chips on low process nodes, the surface area overhead is less impactful. If we decide to go for an open process node (which is not what Fabric is considering today) then the surface area tradeoff is more serious. ↩︎