Move Fast, Unbreak Things! Network debugging at scale

Move Fast, Unbreak Things!
Network debugging at scale
Petr Lapukhov
Network Engineer
People who made this possible
Aijay Adams
Lance Dryden
Angelo Failla
Zaid Hammoudi
James Paussa
James Zeng
Basics of fault detection
How people fix broken networks
Data-center network (3)
Spine switches!
- Multi-stage Clos Topologies

- Lots of devices and links
- BGP Only
Cluster!
- IPv6 >> IPv4 switches!
- Large ECMP fan-out

- L2 and L3 ECMP
Rack switches!
Backbone network (3)
Data-Center!
DR! DR!
- MPLS core
- BB = Backbone Router (LSR) BB! BB!
- Data-center attachment
MPLS !
- DR = Datacenter Router (LER) Core!
MPLS!
- Auto-bandwidth LSP! BB! BB!
- ECMP over MPLS tunnels
DR! DR!
Data-Center !
Detecting packet loss (4)
Standard counters Non-Standard counters
fsw001.p001.f01.atn1# show platform trident counters
Debug counters
Description
T2Fabric19/0/1 RX - Non congestion discards
T2Fabric19/0/1 TX - IPV4 L3 unicast aged and dropped pkts
Too slow
T2Fabric19/0/1 RX - Receive policy discard
Unreliable
T2Fabric19/0/1 TX - L2 multicast drop
T2Fabric19/0/1 RX - Tunnel error packets
T2Fabric19/0/1 TX - Invalid VLAN
T2Fabric19/0/1 RX - Receive VLAN drop
T2Fabric19/0/1 RX - Receive multicast drop
T2Fabric19/0/1 TX - Dropped because TTL counter
T2Fabric19/0/1 RX - Receive uRPF drop
T2Fabric19/0/1 TX - Packet dropped due to any condition
T2Fabric19/0/1 RX - IBP discard and CPB full
T2Fabric19/0/1 TX - Miss in VXLT table counter
How human debugs it? (4)
- Ping/hping/nping (TCP/ICMP/UDP probing)
- Change src port to try all ECMP paths
- Find a broken path, then run traceroute over it
- ping && traceroute are still important
NetNORAD
The network fault detector
Massive pinging FTW
Pingers!
- Run pingers on some machines
- Run responders on lots of machines
- Targets count ~= 100x pingers count
Network!
- Collect packet loss and RTT
- Analyze and report!
Responders!
NetNORAD evolution (4)
- 1 st Run`ping` from python agent
- 2 nd Raw sockets, fast TCP probes
- 3 rd Raw sockets, fast ICMP probes
- Now: UDP probing + responder agent
Pinger and Responders (5)
Pingers! Responders!
Send UDP probes to target list! ! ! Receive/Reply to UDP probe!
!
Timestamp & Log results! Timestamp!
High ping-rate (up to 1Mpps)! Low load: thousands of pps!
Set DSCP marking! Reflect DSCP value back!
Open sourced (C++), !

https://github.com/facebook/UdpPinger !
Allocating pingers and targets (2)
Pingers! Targets!
1+ cluster per DC! 2+ targets per each rack !
10+ racks per cluster! 10s of thousands targets!
Two pingers per rack! Consult host alarms !
! !
!
Probe timestamping
tt
- Path changes / congestion

- Kernel time-stamps
- Application timeout tuning
Why UDP probing? Probe Format!
- No TCP RST packets Signature!
- Efficient ECMP
- RSS friendly SentTime!
- Extensible
RcvdTime!
ResponseTime!
Traffic Class!
C Deployment caveats (4)
Caveat! Solution!
Polarization with ICMP!
!
Use UDP! !
!
Slow IPv6 FIB lookups! 4.X kernels!
High-CPU boxes! Multi-threaded responder/RSS!
Checksum offloading! Disable offloading ! !
NetNORAD
How to ping and process data?
C Challenges (4)
- Nx 100Gbps of ping traffic
- Tens thousands of targets
- Hundreds of pingers
- Lots of data to process
- We really do not care about each host
- The unit of interest is cluster health
The network hierarchy (4)
Region 2!
Backbone! POP 1!
Region 1!
DC 1! DC 2! POP 2!
Cluster 1! Cluster 2! Cluster 1!
Rack 1!
Rack 2! Rack 2!
Rack 1!
Pinging inside clusters (4)
CSW 4!
CSW 3!
- Detect issues with rack switches CSW 2!
- Dedicated pingers per cluster CSW 1!
- Probe ALL machines in cluster

- Store time-series per host/rack
- Think HBase for storage RSW 1! RSW 2! RSW 3!
- Lags real-time by ~2-3 minutes target! target! pinger!
target! target! pinger!
CSW cluster switch!

RSW rack switch!
Pinging the clusters (4)
Pinger 2!
Cluster!
Pinger 3!
Data-Center!
Cluster !
Data-Center!
Region ! WAN!
Region !
Data-Center!
Target!
Cluster! Pinger 1: Same DC
Cluster!
Pinger 1!
Pinger 2: Same Region
Pinger 3: Outside of region
Proximity tagging (3)
Proximity Scope Goal

Pinging hierarchy
Across backbone
Outside of region! End-to-end issues!
network !
Pinging hierarchy
Across backbone
Outside of region! End-to-end issues!
network !
Between data-centers in Issues inside/between

Same region!
region! DCs !
Pinging hierarchy
Across backbone
Outside of region! WAN issues!
network !
Between data-centers in
Same region! Issues between DCs !
region!
Issues in cluster
Same DC! Inside one data-center!
switches!
Processing the data
Processing pipeline: Scribe (4)
- Scribe: distributed logging system
- Similar OSS project: Kafka
Data-set!
- Pingers write results
Shard! Shard ! Shard ! Shard !
- Processors consume them
- Propagation delay ~1-20 seconds
pingers! Processors!
(write)! (read)!
Alarming on packet loss (4)
Alarm!
- Build packet-loss time-series
- Track percentiles
90th pctile!
- Alarm on rising threshold
- Clear on falling threshold
- Time to detect loss: 20 seconds
Clear Alarm!
Cluster X!
DC data!
Visual analysis: Scuba
- In-memory row-oriented storage
- Scuba: Diving into Data at Facebook
- Similar OSS project: InfluxDB
Detecting false-positives
Bad target detection (3)
- Baseline loss
Rack 1! Rack 2! Rack 3! Rack 4!
- Packet loss spike
- Filter outliers target! target! target! target!
target! target! target! target!
- Done in pinger
Machine reboots!
Bad Pinger problem (3) Loss
?!
Loss!
?!
- Bad cluster switch!

cluster ! cluster !
- Pingers see loss everywhere sw1! sw2!
- Population size is small

rsw 1! rsw 2!
- Harder to weed outliers
pinger! pinger!
Line-card ! pinger! pinger!
malfunction!
Bad Pinger detection (2)
- Need more data
Pinger 2!
- Monitor pinger cluster
Data-center!
- Use DC/Region pingers
- Mark bad clusters
Region!
- Done in processor
Data-center!
Pinger 1!
Cluster X!
Conclusions
- Pinger/responder asymmetry
- Real-time is key
- Pinging hierarchy
- False positive elimination
Isolating network faults
Detecting is not everything
Root cause isolation (4)
Likely problem with !
Pinger 2!
spine switches!
Cluster!
Pinger 3!
Data-Center! Issue!
Cluster !
Data-Center!
Data-Center!
Region ! WAN!
Cluster X! Cluster!
Region !
Data-Center!
Cluster X! Cluster! Pinger 1: no loss to X!

Pinger 1!
Pinger 2: I see loss!
Pinger 3: I see loss too!
Downstream suppression (3)
Data-Center
X!
Single alarm!
Data-Center
X!
Loss!
Cluster 1! Cluster 2! Cluster 3! Cluster 4! Cluster 5!
Loss! Loss! Loss!
Multiple alarms!
Next steps to isolate (4)
- Approximate location
- Still lots of devices/links Loss!
Custer 1! Custer 2! Custer 3!
- Check device counters
- if that does not help
- Remember traceroute?
Fbtracert: fast and wide traceroute (6)
Src ports! 4!
Src ports !
32701! Src ports ! Src ports ! 32701!
32703! 32701! 32701! 32703!
Src ports! Src Ports!
TTL 2-6! TTL 3-6! TTL 4-5! TTL 5-6!
37701! 37701!
32702! 2! 8! 32702!
Src ports ! Src ports!
32703! 32703! 32703! Loss! 32703!
32704! TTL 3-6! TTL 4-5! 32704!
TTL 1-6! TTL 6!
5!
Source! 1! 10! Target!
6!
Src ports! Src ports!
32702! 32702!
TTL 3-6! TTL 4-5!
3! 9!
Src ports ! Src ports! Src ports!
Src ports!
32704! 32704! 32702!
32702!
TTL 3-6! TTL 4-5! 32704!
32704!
TTL 2-6! TTL 5-6!
7!
Fbtracert: fast and wide traceroute
Port 32701! Port 32702! Port 32703! Port 32704!
Path! Sent! Rcvd! Path! Sent! Rcvd! Path! Sent! Rcvd! Path! Sent! Rcvd!
1! 20! 20! 1! 20! 20! 1! 20! 20! 1! 20! 20!
2! 20! 20! 3! 20! 20! 2! 20! 20! 3! 20! 20!
4! 20! 20! 6! 20! 20! 5! 20! 20! 7! 20! 20!
8! 20! 20! 9! 20! 20! 8! 20! 20! 9! 20! 20!
10! 20! 14! 10! 20! 20! 10! 20! 16! 10! 20! 20!
TGT! 20! 15! TGT! 20! 20! TGT! 20! 17! TGT! 20! 20!
Fbtracert limitations (5)
- CoPP drops ICMP responses
- Paths may flap (MPLS LSP)
- ICMP gets tunneled with MPLS TE
- ICMP responses from wrong interfaces
Open sourced (Golang), !

https://github.com/facebook/fbtracert!
Conclusions
- Fault isolation is actively evolving
- Traceroute approach looks generic
- Limited by current hardware
- Backbone path churn is a serious challenge
Evolving fault detection & isolation
Near and far future
Support for on-box agents (4)
- Run same code on routers
- POSIX API
- Other SDK is welcome
- Some vendors already do that
- Be like FBOSS !
Streaming telemetry (3)
- Publishing device counters
- Faster detection Drop!
counters! Switch!
- Protobuf/Thrift for encoding

Consumer 1!
- Limited amount of counters Thrift/Protobuf!
- Platform-specific
Consumer 2!
In-band telemetry (4)
- Next generation of silicon emerging IP/UDP hdr!
- Embed device stats in packets

- E.g. device ID, or queue depth
Switch!
- Use extra space in UDP probes
- Allow for real-time path tracing
Queue depth!
Device ID!
IP/UDP hdr!

Move Fast, Unbreak Things! Network debugging at scale

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Move Fast, Unbreak Things! Network debugging at scale

Cargado por

Copyright:

Formatos disponibles

Move Fast, Unbreak Things!

Network debugging at scale

- Multi-stage Clos Topologies

- Large ECMP fan-out

- ECMP over MPLS tunnels

Open sourced (C++), !

- Path changes / congestion

- No TCP RST packets Signature!

Cluster 1! Cluster 2! Cluster 1!

- Probe ALL machines in cluster

- Lags real-time by ~2-3 minutes target! target! pinger!

target! target! pinger!

CSW cluster switch!

Proximity Scope Goal

Proximity Scope Goal

Proximity Scope Goal

Between data-centers in Issues inside/between

Proximity Scope Goal

- Bad cluster switch!

- Population size is small

Cluster X! Cluster! Pinger 1: no loss to X!

Cluster 1! Cluster 2! Cluster 3! Cluster 4! Cluster 5!

Loss! Loss! Loss!

1! 20! 20! 1! 20! 20! 1! 20! 20! 1! 20! 20!

2! 20! 20! 3! 20! 20! 2! 20! 20! 3! 20! 20!

4! 20! 20! 6! 20! 20! 5! 20! 20! 7! 20! 20!

8! 20! 20! 9! 20! 20! 8! 20! 20! 9! 20! 20!

Open sourced (Golang), !

- Protobuf/Thrift for encoding

- Embed device stats in packets

También podría gustarte