Skip to content
← all posts

Building a BGP Daemon in Rust

#rust#bgp#networking#rpki#aspa

I built a BGP daemon in Rust. It runs RFC 4271, dual-stack MP-BGP, RPKI origin validation, ASPA upstream verification, route reflection, graceful restart, Add-Path, FlowSpec, and a full gRPC control plane. 1166 tests, 22 automated interop suites against FRR, BIRD, and GoBGP, and a looking glass API that's drop-in compatible with birdwatcher.

This is the story of how it got there.

Why

The BGP daemon space has four real options: FRR, BIRD, OpenBGPd, and GoBGP.

FRR is C, monolithic, and CLI-first. It runs most of the internet's route servers and does it well. But programmatic control is an afterthought — you're screen-scraping vtysh or bolting on gRPC after the fact.

BIRD has its own DSL for policy. Powerful if you learn it, opaque if you don't. No API.

GoBGP got the model right: gRPC-first, route injection, policy via API. But it's Go. GC pauses at full-table scale (900k+ routes) are real, and memory usage is 15-29x higher than it needs to be.

OpenBGPd is solid but OpenBSD-focused and doesn't target the programmable use case.

The gap: a BGP daemon with GoBGP's operational model, written in a language that doesn't garbage collect in the middle of a convergence event.

The Wire Codec

Everything starts with parsing. BGP is a binary protocol from 1995 — fixed headers, variable-length path attributes, NLRI packed as prefix-length + just enough bytes. It's not hard, but it's fiddly. Off-by-one in prefix byte calculation and you eat the next prefix.

rustbgpd-wire is its own crate, published on crates.io. Zero-copy where possible, no allocator in the hot path for reads. The codec handles:

  • All BGP message types (OPEN, UPDATE, KEEPALIVE, NOTIFICATION, ROUTE-REFRESH)
  • MP-BGP NLRI for IPv4 and IPv6 unicast
  • Every path attribute type including Extended Communities, Large Communities, and FlowSpec
  • Capability negotiation (Add-Path, Extended Next-Hop, Graceful Restart, Extended Messages)

Keeping the wire codec as a separate crate was deliberate. The Rust ecosystem had no standalone BGP parser. Now anyone building BGP tooling — route collectors, MRT analyzers, looking glasses — can use it without pulling in the whole daemon.

The FSM

The BGP finite state machine is the core of RFC 4271. Six states, dozens of transitions, and enough edge cases to fill 80 pages of RFC text.

The FSM is pure. No tokio, no I/O, no sockets. The entire thing is:

(State, Event) -> (State, Vec<Action>)

Feed it an event, get back the new state and a list of actions to execute. The transport layer handles the actual I/O — TCP connects, timer management, message sends. The FSM never touches a socket.

This made testing trivial. Property tests feed random event sequences and verify the FSM never enters an illegal state. No mocking TCP connections, no async test harnesses. Just pure functions.

Architecture Decisions That Stuck

Single tokio task per peer. Each BGP session runs in one task with tokio::select! over the TCP stream, timers, and a command channel from the RIB manager. No shared mutable state between peers. Peers communicate through channels.

Single-owner RIB. The RibManager runs as one tokio task. Adj-RIB-In, Loc-RIB, Adj-RIB-Out — all owned by one task, no Arc<RwLock>. Peers send route changes through bounded channels, the RIB manager processes them in order. This eliminates an entire class of concurrency bugs. The tradeoff is throughput under extreme fan-in, but for BGP's convergence patterns it's not the bottleneck.

Arc<Vec<PathAttribute>> for routes. A route's path attributes are shared across Adj-RIB-In, Loc-RIB, and Adj-RIB-Out via reference counting. When export policy needs to modify attributes (prepend an AS, change MED), Arc::make_mut() gives copy-on-write semantics. Most routes pass through unmodified, so most routes are never cloned.

Transport intercepts UPDATEs. The transport layer parses and validates UPDATE messages before forwarding to the RIB. The FSM sees a payloadless "update received" event. This keeps the FSM pure and puts validation close to the wire.

The Performance Story

First benchmark against bgperf2 (2 peers, 100k prefixes each): 71 seconds to converge. GoBGP does the same in about 15. Not great.

The profiler pointed at AdjRibOut::path_ids_for_prefix(). For every prefix the RIB manager needed to distribute, it scanned the entire AdjRibOut HashMap to find matching entries. O(N) per prefix, called 200k times. The fix was a secondary index — HashMap<Prefix, SmallVec<[u32; 1]>> — turning each lookup into O(1).

71 seconds became 11. A 5.9x improvement from one index.

Memory was next. The daemon was using 415 MB for 200k prefixes. Profiling with dhat showed Arc::make_mut() was being called unconditionally in the distribution path, even when export policy made no modifications. Every route got deep-cloned — the Vec<PathAttribute> copied even though nothing changed. A guard checking RouteModifications::is_empty() before calling make_mut() cut memory by 158 MB (38%).

The remaining ~160 MB is structural: HashMap bucket overhead across multiple large tables. That's the cost of O(1) lookups. Compacting the RIB representation is on the roadmap but not blocking anything.

Current numbers at 200k prefixes: 11s convergence, 257 MB RSS. BIRD does it in 7 MB (decades of optimization). GoBGP uses 578 MB. The Rust version sits in the middle — not as tight as BIRD's custom allocator, but using half the memory of GoBGP with no GC pauses.

Interop: Where Theory Meets Reality

Unit tests prove your code does what you think it does. Interop tests prove it does what everyone else thinks it should do.

Every interop suite runs in containerlab — real Docker containers running real FRR, BIRD, or GoBGP instances, connected to rustbgpd over virtual links. The test scripts exercise the gRPC API, poll for convergence, and assert on route tables.

22 suites now. Some highlights:

Route Reflector (M14) — three iBGP nodes, client and non-client reflector topologies, verifying ORIGINATOR_ID and CLUSTER_LIST propagation. 14 assertions. The RFC is clear on the rules but the corner cases around loop detection took three iterations.

LLGR (M16) — Long-Lived Graceful Restart. Kill a peer, watch routes go stale with the LLGR community attached, bring the peer back, verify stale routes are cleared. The timer interaction between GR and LLGR is subtle — you need to promote routes from GR-stale to LLGR-stale at exactly the right moment.

Transparent Route Server (M19) — the IX use case. rustbgpd sits between two FRR peers, doesn't prepend its own ASN, preserves original NEXT_HOP. FRR 10.x requires per-neighbor no enforce-first-as for this to work, which isn't documented anywhere obvious. Found it by reading FRR source.

Add-Path (M17) — send multiple paths for the same prefix. The ranking algorithm assigns path IDs by best-path order. FRR receives all candidates and selects independently. 15 assertions verifying AS_PATH differentiation and path ID stability.

Every BIRD test has a footnote: BIRD sends an empty UPDATE on session establishment when configured with export none, and echoes GR capability back without actually completing Graceful Restart. Both are harmless but both surprised me.

The StayRTR Problem

ASPA — Autonomous System Provider Authorization — is the next layer of routing security after RPKI. Where RPKI validates that an origin AS is authorized to announce a prefix, ASPA validates the AS_PATH topology: is each hop in a legitimate customer-provider relationship?

It matters. Route leaks are the attacks RPKI can't catch, and they happen regularly. RIPE and ARIN started accepting ASPA objects in January 2026. Cloudflare deployed verification globally. The IETF draft is close to RFC.

I implemented ASPA in rustbgpd: RTR v2 codec for receiving ASPA records from cache servers, an upstream verification algorithm per the draft, best-path preference at step 0.7 (Valid > Unknown > Invalid), and policy matching in both import and export.

Then I needed to test it against a real RTR cache server.

StayRTR is the standard open-source RTR cache. It's what everyone uses for RPKI origin validation testing, and it's what my existing M21 RPKI test used. The plan was straightforward: configure StayRTR with ASPA records, point rustbgpd at it, verify the validation pipeline end-to-end.

StayRTR removed ASPA support in v0.6.1.

The feature existed, worked, and was removed because the maintainers decided the draft wasn't stable enough. Fair enough — that's their call. But it left me with an implementation I couldn't interop-test against any real cache server.

The options were: wait for StayRTR to re-add it (no timeline), wait for another cache server to implement it (Routinator and Fort haven't), or build something myself.

I wrote a Python RTR v2 mock server. About 200 lines. It speaks just enough of the RTR v2 protocol to:

  • Complete the version negotiation handshake
  • Serve ASPA PDUs (type 11) alongside ROA PDUs
  • Send Serial Notify and handle Cache Reset
  • Reject v1 downgrade attempts correctly

The M27 interop test spins up this mock server, two FRR peers, and rustbgpd in a containerlab topology. One FRR peer announces a route with a valid AS_PATH (customer-provider chain checks out), the other announces a route for the same prefix with an invalid path (includes a non-provider hop). rustbgpd receives ASPA records from the mock cache, validates both paths, and prefers the valid one at best-path step 0.7.

8 tests, 16 assertions. The test also verifies ROA and ASPA coexistence on a single RTR v2 session — because in production, cache servers send both record types over one connection.

While I was in the area, a separate StayRTR bug fell out.

rustbgpd's RTR client tries v2 first. If the server only speaks v1, it expects an Error Report PDU with code 4 ("Unsupported Protocol Version"), then it downgrades and reconnects. Standard negotiation, RFC 8210 §10. I implemented the fallback, pointed it at StayRTR (which is v1-only), and watched it retry v2 forever. No error report ever arrived.

Reading StayRTR's code: SendWrongVersionError() puts the Error Report on the async transmits channel, then Disconnect() immediately cancels the context. The send loop selects on both — if cancellation wins the race, the error report never makes it to the wire. Which it almost always did.

The fix was small. Send the error report version byte as PROTOCOL_VERSION_1 (the server's highest supported version, per §7) so the client knows what to retry with, and give the send loop a moment to flush before disconnect. Cleaner would be a synchronous flush on the channel; pragmatic was a sleep. PR went up on bgp/stayrtr, sat for a few weeks, merged this week. 16 lines. The bug report took longer to write than the patch.

The broader lesson: if you're implementing new protocol features, your test infrastructure is going to be part of the project. Standards lead implementations by months or years. The ASPA draft is nearly done, but the tooling ecosystem is still catching up. You either wait or you build the scaffolding yourself — and sometimes you fix the thing you were trying to test against on the way through.

What's There Now

v0.8.0, tagged March 2026. The feature list:

  • RFC 4271 FSM, dual-stack MP-BGP, 4-byte ASN
  • Graceful Restart + Long-Lived Graceful Restart
  • Route Reflection, Add-Path (send + receive), Extended Next-Hop
  • FlowSpec (IPv4 + IPv6, all 13 component types)
  • RPKI origin validation via RTR (v1 + v2), best-path step 0.5
  • ASPA upstream verification via RTR v2, best-path step 0.7
  • Policy engine with prefix lists, AS_PATH regex, community matching, chaining
  • Private AS removal (remove/all/replace modes)
  • Dynamic prefix-based neighbors with peer group inheritance
  • gRPC control plane (7 services), config persistence, SIGHUP reload
  • Birdwatcher-compatible looking glass REST API
  • BMP exporter, MRT TABLE_DUMP_V2, Prometheus metrics
  • CLI with route filtering, best-path explain, and a live TUI dashboard
  • 1166 tests, fuzz harnesses on the wire decoder, 22 interop suites

The target is IX route servers and SDN controllers — environments where API-driven route management matters more than running OSPF. The feature set covers that use case. What's missing for general-purpose routing (Confederation, TCP-AO, EVPN) is on the roadmap but not blocking the target market.

The code is a 14-crate Rust workspace. The wire codec is published separately on crates.io. Everything is on GitHub.