TEE Workloads Thread

In order to build better hardware, we need to understand the workloads and associated use cases that the hardware would need to support. Every other week I speak to someone who has a new TEE use case, so I’m starting this thread to crowdsource information and to make sure we can keep everyone in mind!

Please share:

  • A summary of your use case, explaining clearly why the TEE is necessary.
  • A link to the software you are currently running. If there are things you expect will definitely change or remain the same in coming years, please mention that too.
  • Benchmarks you care about, ideally with workloads.

Thanks!

2 Likes

BuilderNet uses TEEs to sequence public blockchains like Ethereum. In other words, we use TEEs to protect blockchain transactions as we find and finalize an ordering for them to be committed, after which they are made public.

Why necessary: TEEs are the only technology we’ve found that can meet our requirements in practice.

  • Privacy (need to keep pretrade data confidential to prevent frontrunning in financial markets)
  • Integrity (want apps and users to be able to verify what sequencing logic was applied to their trades)
  • Speed (need to build blocks in sub-100ms to support low latency use cases like trading)
  • Architectural decentralization (want to remove dependency on central infrastructure to avoid censorship or faults, so we need to involve other infra operators without degrading our security guarantees too much)

Technical requirements:

  • Currently our target is to build a 45Mgas block (<1000 txs) in <100ms.
  • One very latency sensitive operation is updating bids. This is CPU bound and involves changing one transaction in the block. The full operation should take 1-2ms and every ms matters.
  • The building stage (in which we select and execute transactions and form a state root) is less strict and it can vary from 30ms to 200ms. This runs async from bidding and sometimes depends on IO (reading from disc). All else equal, we can usually accept an additional ~10ms.
  • Current hardware setup:
    • Read a lot from disc (randomly) from multiple threads (we care about throughput and latency).
    • State size is 100s GBs. We have a huge RAM and we rely on OS page cache to cache files but this kind of caching may not be perfect.
    • Use a lot of cores (32+) and make them busy all the time with CPU tasks.

A simple heuristic to test:

We experimented with MPC and FHE as well, but these are mostly long term PoCs (too slow today):

References:

2 Likes

Use case: Run a ZIPNet DC‑net node (anytrust server + optional aggregator) in a TEE

What ZIPNet is

ZIPNet is a DC‑net–style anonymous broadcast system introduced in ZIPNet: Low-bandwidth anonymous broadcast from (dis)Trusted Execution Environments. It supports many anytrust servers while offloading inbound client aggregation to untrusted relays/aggregators. Server work is symmetric crypto (PRF via AES‑CTR + XOR) and simple signatures; no MPC on the hot path. In the paper’s eval ZIPNet cuts server runtime by 4.2×–7.6× vs prior art (CPU Blinder) and makes cover traffic cheap—≈84 bytes of extra server bandwidth per non‑talking client per round (with 160 B messages and 1,024 talkers). Privacy does not depend on TEEs; TEEs are used for DoS resistance and integrity (“falsifiable trust”).

Software / reproducibility.

  • Reference implementation (paper): Rust; client in Intel SGX (Teaclave SGX SDK); aggregator & servers in Rust using aes-ctr, hkdf, x25519-dalek, ed25519-dalek; servers use AES‑NI.
  • Community WIP Go implementation: interfaces + load gen: GitHub - Ruteri/go-zipnet: WIP zipnet protocol implementation (includes aggregator hierarchy knobs and an example config).
  • Baselines used by the paper: CPU Blinder (repo: cryptobiu/MPCAnonymousBloging) and OrgAn (repo: zhtluo/organ).

Why a TEE is needed for this workload

  • DoS & integrity enforcement on clients: client TEEs enforce per‑round participation limits & message format; violations are detectable. (Privacy holds even if the client TEE fails.)
  • Optional hardening for servers: running anytrust servers in a TEE adds attested integrity and key protection without changing the anytrust privacy model.

Parameters & notation

Let:

  • M = registered clients, Mᵣ = participants this round (talkers + cover),
  • N = talkers per round, |m| = message size (bytes),
  • B = N · |m| = broadcast payload per round (bytes),
  • S = size of the schedule vector (bytes), i.e., #sched_slots × footprint_bytes (e.g., the Go repo shows SchedulingSlots=4096, FootprintBits=64 ⇒ S≈32 KiB),
  • N_S = number of anytrust servers.

Message sizes used in the paper’s eval: ~400 B (Bitcoin), ~108 B (Ethereum), ~2 KB (Zcash), ~2.38 KB (Monero), and 160 B for microblog examples.


Role A — Anytrust server

Hot path per round (Algorithm 3).

For each userPK that participated this round (the aggregator sends userPKs, a schedule aggregate, and a message aggregate), the server:

  1. derives (pad1, pad2) = KDF(shared_secret[userPK], round, publishedSchedule),
  2. XORs pad1 into the schedule aggregate and pad2 into the message aggregate, then
  3. signs the output. Work scales with Mᵣ and (S+B).

Compute model (what to measure):

  • PRF expansion bytes/round ≈ Mᵣ · (S + B) bytes (AES‑CTR output).
  • XOR bytes/round ≈ Mᵣ · (S + B) (pads) + ~B (final combine).
  • KDF ops/round: Mᵣ HKDFs (or your PRF’s equivalent).
  • Sig verify/sign: 1 verify (aggregator) + 1 sign per round (e.g., Ed25519).Measure AES‑CTR GB/s (in‑TEE), XOR GB/s, HKDF/s, Ed25519 verify/s. (The paper’s reference server used AES‑NI to accelerate the PRNG/CTR expansion.)

Memory:

  • State: O(M) shared‑secret/ratchet entries (e.g., 32‑byte keys + small metadata) plus small sealed state if you run inside a TEE.
  • Buffers: O(B) for the broadcast vector(s) and O(S) for the schedule vector. (With N=1,024 and |m|=160 B, B≈160 KiB.)

Networking (empirically validated shape):

  • Ingress per server per round:B + S + IDs + signature—the paper reports ~535,607 B at N=1,024 talkers & 0 cover, plus ~84 B per additional non‑talker (cover) at 160 B messages. So with 1,024 talkers and total clients 8,000 (i.e., 6,976 cover), a server ingests ≈ 1,123,952 B/round.
  • Egress: one signed final broadcast (≈B + small metadata); inter‑server traffic is small (“a single broadcast message”).

Latency (from the paper’s WAN runs): see “End‑to‑end round time” below; server time rises roughly linearly with cover (Mᵣ↑, B fixed) and quadratically with talkers (B = N·|m| grows with N).

What to publish (server microbench & round metrics):

  • AES‑CTR GB/s and XOR GB/s inside the TEE on (S+B)‑sized buffers.
  • Per‑round CPU% and P99 vs N, |m|, Mᵣ, N_S.
  • Bandwidth per round vs talkers (slope ≈|m|) and vs cover (≈84 B/non‑talker at 160 B, 1,024 talkers—re‑check for your S/encoding).

Role B — Aggregator (untrusted; optional TEE for ops assurance)

Hot path per round (Algorithm 2).

Validate signature, XOR client (or lower‑tier aggregate) payloads into a single schedule vector and a single B‑byte message vector, sign the running aggregate. It’s XOR + bookkeeping.

Scaling & what to measure:

  • Runtime is linear in payload size B for fixed #clients; insensitive to the count of clients when B is fixed (it’s memory‑bandwidth bound). Publish XOR GiB/s vs B and P99 contribution under realistic WAN RTTs.

Networking:

  • Ingress: many small client packets (or aggregates).
  • Egress: per round, a single aggregate to each server; with one (root) aggregator this is ~N_S × (B + S + IDs + sig). (The Go implementation exposes hierarchy levels if you want a tree.)

Role C — Client TEE (mandatory TEE role)

Hot path per round (Algorithm 1 in‑enclave):

  • Rate‑limit/format: compute footprint slot + attestation tag;
  • Pad & inject: PRF to create slot pad; XOR message or zeros (cover);
  • Ratchet: HKDF update. Client runtime scales with N_S (OTP length B per server) and |m|. In evals, client runtime stayed sub‑second for the small message sizes considered. Measure AES‑CTR/HKDF GB/s, Ed25519 sign/s, x25519 handshakes/s.

State model: sealed state only; no enclave monotonic counters or trusted timers required (fits “falsifiable trust”).


Concrete sizing examples (to anchor hardware targets)

Example 1 (paper’s table): N=1,024 talkers, |m|=160 B ⇒ B≈160 KiB .

Per server, bandwidth/round (ingress + small control) was:

  • 0 cover (Mᵣ=1,024): ~535,607 B
  • 8,000 total clients (6,976 cover): ~1,123,952 B (≈ 1.12 MB/round).At 1 s rounds, budget ≈ 1.12 MB/s per server for this channel.

Example 2 (aggregator egress sizing): same scenario, N_S=10 servers ⇒ root aggregator egress ≈ 10 × 1.12 MB ≈ 11.2 MB/round (≈ 90 Mb/s at 1 s rounds), plus client ingress. (Adjust if S or ID encoding changes.)

Message sizes to provision for (paper): ~400 B (Bitcoin), ~108 B (Ethereum), ~2 KB (Zcash), ~2.38 KB (Monero). Use these to sweep B = N·|m|.

Relative compute vs Blinder: ZIPNet shows 4.2×–7.6× lower server runtime than CPU Blinder on the same hardware (WAN), with 5–10 servers in their setup. Hardware should favor symmetric‑crypto throughput & memory bandwidth, not big‑integer/MPC accelerators.


End‑to‑end latency target (what to report)

Report total round time (aggregator → servers → final broadcast) under WAN and LAN placements, sweeping Mᵣ, N, |m|, N_S. The paper’s Figure 5 shows end‑to‑end round time rising with users and only slightly exceeding the sum of aggregator+server microbenches (due to network fan‑out). Publish your target round duration alongside microbench results.


What we ask from “trust‑minimized TEE” hardware for this workload

Crypto + memory:

  • High‑throughput AES‑CTR (or a fast PRF) and large XOR bandwidth inside the TEE; strong DRBG; HKDF; constant‑time primitives. (The reference uses AES‑NI; similar hardware offload on ARM/others is desirable.)

Enclave I/O:

Low‑overhead, high‑pps secure I/O so the server/aggregator can process bursty (B+S) payloads entirely inside enclave hot paths (avoid frequent exits).

Attestation:

Cheap remote attestation at client scale; ability to verify peer attestation in‑enclave (client→aggregator/server). Keep sealed storage simple (no monotonic counters).

NIC targets:

  • Aggregator: sustain ~N_S × (B+S+IDs) egress per round (plus client ingress).
  • Servers: sustain ≈(B+S+IDs) ingress and ≈B broadcast egress per round, plus ~84 B per non‑talker (for the 160 B/1,024‑talker configuration; re‑measure under your encoding).

Benchmarks we care about

  1. End‑to‑end round time (WAN/LAN): aggregator→servers→final broadcast; sweep Mᵣ, N, |m|, N_S. Compare measured round time to the sum of your server/aggregator microbenches to surface I/O overhead.
  2. TEE microbenches (server & client):
  • AES‑CTR GB/s on buffers of sizes B and S,
  • XOR GB/s on the same sizes,
  • HKDF keys/s , Ed25519 verify/s , x25519 handshakes/s .Structure loops to match Algorithm 3 (server) and Algorithm 1 (client).
  1. Bandwidth scaling (server): re‑plot bytes/round vs cover set size to validate the ~84 B per extra non‑talker slope at your message size and encoding; re‑plot vs talkers to show linear growth with slope ≈ |m| and your constant terms. (Replicate Table 3 with your stack.)
  2. Aggregator scaling: XOR‑only runtime vs B and egress fan‑out cost vs N_S; report P99 contribution to round time under WAN RTTs.