The principle of cloud scale routing protocol: lesson & learn from GCP outage
On Wednesday 9 December, 2020, Google Cloud Platform experienced networking unavailability in zone europe-west2-a.
In this article I will show some thought from CAP therom and 3 principles for cloud scale routing protocol design:
- Control Plane must keep Consistency and partition tolerance(CP).
- Data Plane must keep Partition tolerance and Availability(AP).
- Isomorphism.
CAP theorem
In theoretical computer science, the CAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
- Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

The facts of existing design
Destination-based routing is CP mode
The destination-based forwarding requires all nodes have consistent forwarding table, Inconsistent forwarding table scenario may well-known as “Micro Loop” or “Black hole”. It means the CP mode cannot support Availability(CP without A).
This is a kind of trade-off few decades ago. In the old days, the forwarding ASIC or Processor may have limited forwarding capabilities and with limited bandwidth to carry more information for source routing.
Traditional Routing protocols and SDN are CA mode
Traditional Routing protocols like BGP has centralized RR which cannot be partition to multiple nodes, in some massive scale datacenter whom deploy BGP+EVPN must choose EBGP connect to the Spine and Leaf to avoid such issue.
In modern SDN controller design, we must make sure the latency between control nodes low than 50 milliseconds for database/flow-table synchronization, Any Out-of-Sync issue is well-known as “Brain Split” .Which means the CA mode cannot support partition tolerance(CA without P).
Segment Routing is AP mode
When Clarence design Segment Routing, the fact of the TI-LFA assume In-consistent in network, but focus on Availability and Partition tolerance(AP Mode). TI-LFA sperate the topology to P Space and Q space, In each space assume the consistency and use destination based forwarding, but insert a Label for PQ node to connect two space together.

Design principal for cloud scale routing protocol
Principal.1 Control Plane must be CP.
Control Plane must be Consistency and Partition tolerance(CP), but not guarantee availability. ETCD distribution key value store could be used for next generation control plane.
Principal.2 Data Plane must be AP->BASE.
Data Plane must be Partition tolerance and Availability (AP), BASE(Basically Available, Soft State, Eventual Consistency) is our design goal for dataplane. Segment Routing with some smart link-state protocol could be used achive this goal.
Principal.3 Isomorphism.
When we design SD-WAN system ,this whole system must be worked as a big router. when we design cloud scale SDN system, it must be worked as a big switch.
GCP outage root Cause
Google’s underlying networking control plane consists of multiple distributed components that make up the Software Defined Networking (SDN) stack. These components run on multiple machines so that failure of a machine or even multiple machines does not impact network capacity. To achieve this, the control plane elects a leader from a pool of machines to provide configuration to the various infrastructure components. The leader election process depends on a local instance of Google’s internal lock service to read various configurations and files for determining the leader. The control plane is responsible for Border Gateway Protocol (BGP) peering sessions between physical routers connecting a cloud zone to the Google backbone.
Google’s internal lock service provides Access Control List (ACLs) mechanisms to control reading and writing of various files stored in the service. A change to the ACLs used by the network control plane caused the tasks responsible for leader election to no longer have access to the files required for the process. The production environment contained ACLs not present in the staging or canary environments due to those environments being rebuilt using updated processes during previous maintenance events. This meant that some of the ACLs removed in the change were in use in europe-west2-a, and the validation of the configuration change in testing and canary environments did not surface the issue.
Learn from GCP’s outage
From the RCA, it clearly shows Google’s internal routing protocol is based on CP(consistent but not available under network partitions) mode, It could be Chubby lock service to elect leader in this massive scale distributed system.
Although Google’s RCA describe the issue was wrong ACL. The principal behind the scenes is the CP mode cannot guaranty availability. This is a kind of trade-off, the destination based forwarding is efficient but must be consistency which means we can not accept AP(available but not consistent under network partitions).
Another issue is we can not keep availabilities, this is very important and the major difference between SDN and SDWAN. In SDN(BGP-EVPN MSDC), the control link is reliable. But in SDWAN, the edge node always suffering “Headless” mode. So we cannot guarantee the availability. The only option for control plane protocol is CP mode.
When the control plane can not support availability, we must leaverage the dataplane itself to provide BASE(Basically Available, Soft State, Eventual Consistency).
In summary of GCP’s outage: Control plane design follow the principal.1 ControlPlane:= CP Mode. But learn from this outage indicate violation on principal.2 Dataplane must be BASE.
The best cloud scale routing system
Based on these thought and the principles, a prototype which called “Ruta” was made by me. It used ETCD as control plane(draft-zartbot-srou-signalling), and Segment Routing over UDP as dataplane(draft-zartbot-sr-udp).

The control plane design follows the principal.1 and leverage existing opensource project ETCD to keep it in CP Mode, the prefix format follows the EVPN. Currently we implement Type-2 and Type-5 route only.
It has a specific cache to address the availability(headless) issues during network partition. It has two lease timer, the shorter one is used for node keepalive and link state, the longer one is used for resource allocation. When control plane outage, the node still could keep this cache and leverage Segment Routing and itself link-probe capability to make the network basically available.
It leverages ETCD proxy function to enhance the availability. This kind of design allows the devices self-organized with zero configuration.

The data plane we did not directly use Segment Routing-MPLS or SRv6 because we are targeting for the next generation SD-WAN case and the intent behind this is trying to leverage public cloud bandwidth resource as backbone. So we have to make sure the SR working under IPv4 environment and make sure NAT-traversal is supported. So Segment Routing over UDP(SRoU) was designed for this purpose.


We deploy it with 20 nodes all over the world, then I can access any network within 200ms latency and nearly ZERO packet drop.

The route lookup is in recursive mode,the overlay route(Type2/Type5) point to a SR node. Then the underlay SRoU Route use for node private/public address.

The underlay public address reachability is used an active probe(like TWAMP) to measure the performance.

All of these linkstate could be update to ETCD, but selective download by each node. Each linecard node will random select 3 farbric, 2 with the min geo distance and 1 for geo-redundancy. the node itself decide the best forwarding path, if the compution can not finished or can not meet the SLA, the node may subscribe more nodes’ linkstate.
This linkstate data is also very useful for AI based predictive routing.

We also have AI based passive monitoring algorithm to detect the performance, which was the winner project of Cisco Innovation EverywhereChallenge.

CONCLUSION
Based on CAP theorem, we analysis the SOTA routing protocol and SDN desgin with their limitation and proposal 3 principals for next generation cloud scale routing protcol:
- Control Plane must keep Consistency and partition tolerance(CP).
- Data Plane must keep Partition tolerance and Availability(AP).
- Isomorphism.
We implement a working prototype base on this theorem, it shows many benefits compare with existing system in case of reliability / scale / performance. It could not only be used for SD-WAN,but also could be used to replace the BGP-EVPN+VXLAN based cloud scale DC. The dataplane implementation is simple for any P4 like switch. we will write some example code in the nearly future.