What I Learned Building 8 Tbps of CDN

Five years, 8 Tbps of capacity, a two-engineer core team, and a streaming-only audience watching live sport at scale.

20th May 2026 · 6 min read

Spain won UEFA Euro 2024 at 7am on a Monday in Australia. A national audience watching on phones mid-commute, on transit networks, with nowhere else to go if the stream broke. For five years I ran content delivery at Optus Sport. Two engineers in the core team, 8 Tbps of capacity, four CDNs behaving as one, and a 1.5 million requests per minute load test we ran on ourselves on purpose. When the stream is the only way to watch, the stream is the product.

Two engineers, four CDNs, 8 Tbps

I focused on making four CDNs behave like one, not picking a winner. Australia has a structural capacity shortage for streaming delivery, and on a World Cup night the answer should never be one CDN doing everything. The instinct is to find the “best” CDN; at scale that question stops being the right one.

I built central traffic routing and blocklist systems, normalised across our in-house build (EPYC CDN, named after the AMD EPYC silicon it ran on, with HAProxy and Varnish under the hood), Fastly, CloudFront and Akamai. One config surface, consistent behaviour, capacity drawn from wherever it sat. The audience was Australia-only, every byte landed on an Australian eyeball. EPYC CDN carried the bulk of traffic. The commercial CDNs gave us three independent ways to deliver the same stream to the same country when we needed them.

The lesson: at scale you stop picking a CDN and start running a content delivery portfolio.

CAPEX beats OPEX when you own the pipe

Finance made the CAPEX over OPEX call. The build I shipped under it earned a CTO commendation for cost efficiency. Spend on hardware once, sit it on a network we already ran, and a recurring CDN bill turns into capacity we own. Optus is a telco. They own the network. As a Tier 1 carrier with presence at every major interconnect and exchange where traffic flows in Australia, peering inside the country changes what’s economic to build in-country. The target was quality of service first, and owning the input let us tune for it directly.

Quality of service (QoS) is what viewers feel, reported as stream starts, rebuffering, and bitrate held. Mux is how we measured it. EPYC CDN gave Optus Sport headroom on the numbers that matter at scale: faster starts, fewer rebuffers, steadier bitrate at the moments the entire audience is locked on the same frame. Sam Kerr’s goal in the World Cup semi. Italy winning the EURO 2020 final on penalties. Those are the seconds the platform is judged on.

What you build at that scale becomes brand-defining infrastructure. The FIFA Women’s World Cup 2023 was the catalyst for a new 400G Metro Core that the Networks team stood up behind the in-house EPYC CDN. The tournament drew a 1.2 million peak concurrent audience across the Seven and Optus co-broadcast and 11 million viewers for the semi across streaming and broadcast. The platform holding through every one of those moments is part of what those results rest on. The same network now sits under every workload that runs on top of it.

The lesson: return on investment isn’t only the balance sheet. Stream quality on the night and innovation across the org are returns still paying off today.

Building one and operating one are different disciplines

We were better at running Fastly, CloudFront, and Akamai because we built EPYC. Operating and building are different disciplines, and doing one sharpens the other. Designing the cache yourself, memory and NVMe tiers, sharding across nodes, and a mid-tier shield in front of origin, means you understand why architectures behave the way they do under load. You can debug looking under the hood, not just at the surface.

Building also means understanding NICs, IRQ affinity, TCP tuning and NUMA domains. The substrate decides what the cache can actually do under load, and once you have tuned it yourself the behaviour of any CDN stops being a black box.

Operating teaches breadth, building teaches depth. Knowing why a cache behaves the way it does under load is what lets you tell when a vendor is the right tool, when it isn’t, and how to hold them to it.

The lesson: build one and you stop taking the others on faith.

DDoS’d ourselves at 1.5M req/min on purpose

1.5 million requests a minute of sequential video segments, sustained against a single EPYC node, the kind of load that to a normal engineer reads as a DDoS in progress. I built the load test. We ran it on ourselves, on purpose. A platform that requires everything to be working in order to work is not a platform.

Multi-CDN, redundant capacity, origin shields, fallback paths. None of it is glamorous and none of it shows up in a feature list. It’s the difference between a night where people talk about the football, and a night where they talk about the stream.

The network under the CDN was never static. Peering shifted, paths changed, capacity moved. We ran chaos monkey engineering at the physical layer. Tested modules being pulled mid-stream while traffic flowed, we didn’t want the audience to have buffering anxiety when hardware went away. The infrastructure had to adapt around all of it and keep the stream steady. Designing for failure meant engineering for a substrate that kept moving underneath.

The point wasn’t proving the servers could take it. The point was that you only trust a system you’ve broken yourself. We kept breaking it on purpose, well before a real audience would.

The lesson: redundancy is the product. You earn it by breaking the system on purpose, before something else does.

Measure what correlates with churn

QoS metrics are easy to make look fine. You can chase startup time, rebuffer ratio and bitrate, and in isolation each of them can land in green while users still leave.

Observability was the foundation, and engineering with metrics was the mindset from day one, not something you get off the shelf from a vendor. You can’t improve a system that’s in the dark. Grafana and Prometheus gave us granularity at every layer of the stack, NIC counters, cache hit ratios, origin shield behaviour, every server metric exported. Kafka shipped every log from every CDN into the data lake. Without that pipeline, picking the right metric is just theory.

Mux was the gold standard. It came from the players themselves, the closest signal we had to what the user actually felt. We tracked Mux KPI targets because they connect playback quality to retention, the numbers that correlate with a subscription not getting cancelled.

The lesson: metrics are engineering inputs, not reporting outputs.

Credit where it’s due. A small core engineering team did the late nights and on-call. The Networks group peered the traffic, scaled NBN CVC capacity ahead of every kickoff, and stood up the 400G Metro Core that made WWC 2023 land the way it did. The broader Optus Sport team backed every call that made a build of this shape possible.

Optus Sport wound down in August 2025 with rights moving to Stan. The discipline isn’t tied to CDNs. Portfolio over single vendor, own the substrate when you can, design for failure, measure what correlates with churn. The next platform I build won’t look like this one, but the thinking will.

Two engineers, four CDNs, 8 Tbps

CAPEX beats OPEX when you own the pipe

Building one and operating one are different disciplines

DDoS’d ourselves at 1.5M req/min on purpose

Measure what correlates with churn

Connect