In designing a network device, you make dozens of decisions that affect the speed with which it will perform-sometimes for better, but sometimes for worse. Network Algorithmics provides a complete, coherent methodology for maximizing speed while meeting your other design goals. Author George Varghe
Network Algorithmics. An Interdisciplinary Approach to Designing Fast Networked Devices
β Scribed by George Varghese, Jun Xu
- Publisher
- Elsevier
- Year
- 2022
- Tongue
- English
- Leaves
- 596
- Edition
- 2
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Table of Contents
Front Cover
Network Algorithmics
Copyright
Contents
Preface to the second edition
Preface
Audience
What this book is about
Organization of the book
Features
Usage
Why this book was written
Acknowledgments
15 principles used to overcome network bottlenecks
Part 1 The rules of the game
1 Introducing network algorithmics
1.1 The problem: network bottlenecks
1.1.1 Endnode bottlenecks
1.1.2 Router bottlenecks
1.2 The techniques: network algorithmics
1.2.1 Warm-up example: scenting an evil packet
1.2.2 Strawman solution
1.2.3 Thinking algorithmically
1.2.4 Refining the algorithm: exploiting hardware
1.2.5 Cleaning up
1.2.6 Characteristics of network algorithmics
1.3 Exercise
2 Network implementation models
2.1 Protocols
2.1.1 Transport and routing protocols
2.1.2 Abstract protocol model
2.1.3 Performance environment and measures
2.2 Hardware
2.2.1 Combinatorial logic
2.2.2 Timing and power
2.2.3 Raising the abstraction level of hardware design
2.2.4 Memories
Registers
SRAM
Dynamic RAM
Page-mode DRAMs
Interleaved DRAMs
2.2.5 Memory subsystem design techniques
2.2.6 Component-level design
2.2.7 Final hardware lessons
2.3 Network device architectures
2.3.1 Endnode architecture
2.3.2 Router architecture
Lookup
Switching
Queuing
Header validation and checksums
Route processing
Protocol processing
Fragmentation, redirects, and ARPs
2.4 Operating systems
2.4.1 Uninterrupted computation via processes
2.4.2 Infinite memory via virtual memory
2.4.3 Simple I/O via system calls
2.5 Summary
2.6 Exercises
3 Fifteen implementation principles
3.1 Motivating the use of principlesβupdating ternary content-addressable memories
3.2 Algorithms versus algorithmics
3.3 Fifteen implementation principlesβcategorization and description
3.3.1 Systems principles
P1: Avoid obvious waste in common situations
P2: Shift computation in time
P3: Relax system requirements
P4: Leverage off system components
P5: Add hardware to improve performance
3.3.2 Principles for modularity with efficiency
P6: Create efficient specialized routines by replacing inefficient general-purpose routines
P7: Avoid unnecessary generality
P8: Don't be tied to reference implementations
P9: Pass hints in module interfaces
P10: Pass hints in protocol headers
3.3.3 Principles for speeding up routines
P11: Optimize the expected case
P11a: Use caches
P12: Add or exploit state to gain speed
P12a: Compute incrementally
P13: Optimize degrees of freedom
P14: Use special techniques for finite universes such as integers
P15: Use algorithmic techniques to create efficient data structures
3.4 Design versus implementation principles
3.5 Caveats
3.5.1 Eight cautionary questions
Q1: Is it worth improving performance?
Q2: Is this really a bottleneck?
Q3: What impact does the change have on the rest of the system?
Q4: Does the initial analysis indicate significant improvement?
Q5: Is it worth adding custom hardware?
Q6: Can protocol changes be avoided?
Q7: Do prototypes confirm the initial promise?
Q8: Will performance gains be lost if the environment changes?
3.6 Summary
3.7 Exercises
4 Principles in action
4.1 Buffer validation of application device channels
4.2 Scheduler for asynchronous transfer mode flow control
4.3 Route computation using Dijkstra's algorithm
4.4 Ethernet monitor using bridge hardware
4.5 Demultiplexing in the x-kernel
4.6 Tries with node compression
4.7 Packet filtering in routers
4.8 Avoiding fragmentation of LSPs
4.9 Policing traffic patterns
4.10 Identifying a resource hog
4.11 Getting rid of the TCP open connection list
4.12 Acknowledgment withholding
4.13 Incrementally reading a large database
4.14 Binary search of long identifiers
4.15 Video conferencing via asynchronous transfer mode
Part 2 Playing with endnodes
5 Copying data
5.1 Why data copies
5.2 Reducing copying via local restructuring
5.2.1 Exploiting adaptor memory
5.2.2 Using copy-on-write
Implementing copy-on-write
5.2.3 Fbufs: optimizing page remapping
Fbufs
5.2.4 Transparently emulating copy semantics
5.2.5 Are zero copies used today?
5.3 Avoiding copying using remote DMA
5.3.1 Avoiding copying in a cluster
5.3.2 Modern-day incarnations of RDMA
Fiber channel
Infiniband
iSCSI via iWarp and RoCE
5.4 Broadening to file systems
5.4.1 Shared memory
5.4.2 IO-lite: A unified view of buffering
5.4.3 Avoiding file system copies via I/O splicing
5.5 Broadening beyond copies
5.6 Broadening beyond data manipulations
5.6.1 Using caches effectively
Data Cache Optimization
Code arrangement
Locality-driven layer processing
Software engineering considerations
5.6.2 Direct memory access versus programmed I/O
5.7 Conclusions
5.8 Exercises
6 Transferring control
6.1 Why control overhead?
6.2 Avoiding scheduling overhead in networking code
6.2.1 Making user-level protocol implementations real
6.3 Avoiding context-switching overhead in applications
6.3.1 Process per client
6.3.2 Thread per client
6.3.3 Event-driven scheduler
6.3.4 Event-driven server with helper processes
6.3.5 Task-based structuring
6.4 Scalable I/O Notification
6.4.1 A server mystery
6.4.2 Problems with implementations of select()
Parameters
Usage in a Web server
Implementation
6.4.3 Analysis of select()
Obvious waste in select() implementation
Strategies and principles to fix select()
6.4.4 Speeding up select() without changing the API
6.4.5 Speeding up select() by changing the API
6.5 Avoiding system calls or Kernel Bypass
6.5.1 The virtual interface architecture proposal
6.5.2 Data Plane Development Kit (DPDK)
6.5.3 Single Root I/O Virtualization (SR-IOV)
6.6 Radical Restructuring of Operating Systems
6.7 Reducing interrupts
6.7.1 Avoiding receiver livelock
6.8 Conclusions
6.9 Exercises
7 Maintaining timers
7.1 Why timers?
7.2 Model and performance measures
7.3 Simplest timer schemes
7.4 Timing wheels
7.5 Hashed wheels
7.6 Hierarchical wheels
7.7 BSD implementation
7.8 Google Carousel implementation
7.9 Obtaining finer granularity timers
7.10 Conclusions
7.11 Exercises
8 Demultiplexing
8.1 Opportunities and challenges of early demultiplexing
8.2 Goals
8.3 CMU/Stanford packet filter: pioneering packet filters
8.4 Berkeley packet filter: enabling high-performance monitoring
8.5 Pathfinder: factoring out common checks
8.6 Dynamic packet filter: compilers to the rescue
8.7 Conclusions
8.8 Exercises
9 Protocol processing
9.1 Buffer management
9.1.1 Buffer allocation
Segregated pool allocator
Linux allocator
Batch allocator
9.1.2 Sharing buffers
Buffer stealing
Dynamic thresholds
9.2 Cyclic redundancy checks and checksums
9.2.1 Cyclic redundancy checks
Naive implementation
Implementation using linear feedback shift registers
Faster implementations
9.2.2 Internet checksums
Header checksum
9.2.3 Finessing checksums
9.3 Generic protocol processing
9.3.1 UDP processing
9.4 Reassembly
9.4.1 Efficient reassembly
9.5 Conclusions
9.6 Exercises
Part 3 Playing with routers
10 Exact-match lookups
10.1 Challenge 1: Ethernet under fire
10.2 Challenge 2: wire speed forwarding
10.3 Challenge 3: scaling lookups to higher speeds
10.3.1 Scaling via hashing
10.3.2 Using hardware parallelism
10.3.3 The d-left approach
10.4 Summary
10.5 Exercise
11 Prefix-match lookups
11.1 Introduction to prefix lookups
11.1.1 Prefix notation
11.1.2 Why variable-length prefixes?
11.1.3 Lookup model
11.2 Finessing lookups
11.2.1 Threaded indices and tag switching
11.2.2 Flow switching
11.2.3 Status of tag switching, flow switching, and multiprotocol label switching
11.3 Non-algorithmic techniques for prefix matching
11.3.1 Caching
11.3.2 Ternary content-addressable memories
11.4 Unibit tries
11.5 Multibit tries
11.5.1 Fixed-stride tries
11.5.2 Variable-stride tries
11.5.3 Incremental update
11.6 Level-compressed (LC) tries
11.7 Lulea-compressed tries
11.8 Tree bitmap
11.8.1 Tree bitmap ideas
11.8.2 Tree bitmap search algorithm
11.8.3 PopTrie: an alternate bitmap algorithm
11.9 Binary search on ranges
11.10 Binary search on ranges with Initial Lookup Table
11.11 Binary search on prefix lengths
11.12 Linear search on prefix lengths with hardware assist
11.12.1 Using Bloom Filters to compress prefix bitmaps
11.12.2 SAIL: Uncompressed Bitmaps up to a pivot level
11.13 Memory allocation in compressed schemes
11.13.1 Frame-based compaction
11.14 Fixed Function Lookup-chip models
11.15 Programmable Lookup Chips and P4
11.15.1 The P4 language
11.15.2 IP Lookups in the P4 Model
11.16 Conclusions
11.17 Exercises
12 Packet classification
12.1 Why packet classification?
12.2 Packet-classification problem
12.3 Requirements and metrics
12.4 Simple solutions
12.4.1 Linear search
12.4.2 Tuple space search
12.4.3 Caching
12.4.4 Demultiplexing algorithms
12.4.5 Passing labels
12.4.6 Content-addressable memories
12.5 Two-dimensional schemes
12.5.1 Fast searching using set-pruning tries
12.5.2 Reducing memory using backtracking
12.5.3 The best of both worlds: grid of tries
12.6 Approaches to general rule sets
12.6.1 Geometric view of classification
12.6.2 Beyond two dimensions: the bad news
12.6.3 Beyond two dimensions: the good news
12.7 Extending two-dimensional schemes
12.8 Using divide-and-conquer
12.9 Bit vector linear search
12.10 Cross-producting
12.11 Equivalenced cross-producting
12.12 Decision tree approaches
12.13 Hybrid algorithms
12.14 Conclusions
12.15 Exercises
13 Switching
13.1 Router versus telephone switches
13.2 Shared-memory switches
13.3 Router history: from buses to crossbars
13.4 The take-a-ticket crossbar scheduler
13.5 Head-of-line blocking
13.6 Avoiding HOL blocking via output queuing
13.7 Avoiding HOL blocking via virtual output queuing
13.8 Input-queued switching as a bipartite matching problem
13.9 Parallel iterative matching (PIM)
13.10 Avoiding randomization with iSLIP
13.10.1 iSLIP implementation notes
13.11 Computing near-optimal matchings via learning
13.12 Sample-and-compare: a stunningly simple adaptive algorithm
13.13 SERENA: an improved adaptive algorithm
13.13.1 Derive R(t) from the arrival graph
13.13.2 Merge R(t) with S(t-1)
13.14 The queue-proportional sampling strategy
13.15 QPS implementation
13.15.1 The QPS algorithm
13.15.2 The QPS data structure
13.16 Small-batch QPS and sliding-window QPS
13.16.1 Batch switching algorithms
13.16.2 The SB-QPS algorithm
13.16.3 The SW-QPS algorithm
13.17 Combined input and output queueing
13.18 Scaling to larger and faster switches
13.18.1 Measuring switch cost
Metric #1: the number of crosspoints
Metric #2: the complexity of matching computation
13.18.2 A divide-and-conquer approach to building large switches
13.18.3 Clos networks for medium-sized routers
Clos networks in telephony
Reducing the size of the middle stage
Clos networks and multichassis routers
13.18.4 Benes networks for larger routers
The Delta networks
13.18.5 Load-balanced switching
13.19 Scaling to faster link speeds
13.19.1 Using bit slicing for higher-speed fabrics
13.19.2 Using short links for higher-speed fabrics
13.19.3 Memory scaling using randomization
13.20 Conclusions
13.21 Exercises
14 Scheduling packets
14.1 Motivation for quality of service
14.2 Random early detection
14.3 Approximate fair dropping
14.4 Token bucket policing
14.5 Multiple outbound queues and priority
14.6 A quick detour into reservation protocols
14.7 Providing bandwidth guarantees
14.7.1 The parochial parcel service
14.7.2 Deficit round-robin
14.7.3 Implementation and extensions of deficit round-robin
Extensions of deficit round-robin
Hierarchical deficit round-robin
Deficit round-robin plus priority
14.8 Schedulers that provide delay guarantees
14.9 Generalized processor sharing
14.9.1 A simple example
14.9.2 Recursive definition of GPS virtual start and finish times
14.9.3 Another packet arrival scenario
14.9.4 Tracking the GPS clock
14.10 Weighted fair queueing
14.11 Worst-case fair weighed fair queueing
14.12 The data structure and algorithm for efficient GPS clock tracking
14.12.1 The staircase query problem
14.12.2 Augmented data structures
14.12.3 The shape'' data structure
The staircase representation of GPS graphStaircase query processing'' by shape data structure
Detailed data structure design
Invariant maintenance under insertions and deletions
14.13 Implementing WFQ and WF2Q
14.14 Quick fair queueing (QFQ)
14.14.1 Fair round robin (FRR) algorithm
14.14.2 How QFQ improves upon FRR
14.15 Towards programmable packet scheduling
14.15.1 Push-In First-Out (PIFO) framework
14.15.2 Universal packet scheduler
14.16 Scalable fair queuing
14.16.1 Random aggregation
14.16.2 Edge aggregation
14.16.3 Edge aggregation with policing
14.17 Summary
14.18 Exercises
15 Routers as distributed systems
15.1 Internal flow control
15.1.1 Improving performance
15.1.2 Rescuing reliability
15.2 Internal Link Striping
15.2.1 Improving performance
15.2.2 Rescuing reliability
15.3 Distributed Memory
15.3.1 Improving performance
15.3.2 Rescuing reliability
15.4 Asynchronous updates
15.4.1 Improving performance
15.4.2 Rescuing reliability
15.5 Conclusions
15.6 Exercises
Part 4 Endgame
16 Measuring network traffic
16.1 Why measurement is hard
16.1.1 Why counting is hard
16.2 Reducing SRAM width using DRAM backing store
16.3 A randomized counter scheme
16.4 Maintain active counters using BRICK
16.4.1 Motivation and design objectives
16.4.2 Overview of BRICK
16.4.3 Detailed design
Partition into subcounters
Indexing for efficiently locating a counter
Handling increments
16.5 Extending BRICK for maintaining associated states
16.6 Reducing counter width using approximate counting
16.7 Reducing counters using threshold aggregation
16.8 Reducing counters using flow counting
16.9 Reducing processing using sampled NetFlow
16.10 Reducing reporting using sampled charging
16.11 Correlating measurements using trajectory sampling
16.12 A concerted approach to accounting
16.13 Computing traffic matrices
16.13.1 Approach 1: Internet tomography
16.13.2 Approach 2: per-prefix counters
16.13.3 Approach 3: class counters
16.14 Sting as an example of passive measurement
16.15 Generating better traffic logs via data streaming
16.16 Counting the number of distinct flows
16.16.1 The min-hash algorithm
16.16.2 An extension of min-hash for estimating |AβͺB|
16.16.3 Application to data mining
16.16.4 Bitmap sketch: a worthy alternative to min-hash
16.17 Detection of heavy hitters
16.18 Estimation of flow-size distribution
16.18.1 Motivation
16.18.2 A data streaming algorithm solution
16.19 The Tug-of-War algorithm for estimating F2
16.20 Conclusion
16.21 Exercises
17 Network security
17.1 Searching for multiple strings in packet payloads
17.1.1 Integrated string matching using AhoβCorasick
17.1.2 Integrated string matching using BoyerβMoore
17.2 Approximate string matching
17.3 IP traceback via probabilistic marking
17.4 IP traceback via logging
17.4.1 Bloom filters
17.4.2 Bloom filter implementation of packet logging
17.4.3 Scale to higher link speeds
17.5 Detecting worms
17.6 EarlyBird system for worm detection
17.7 Carousel: scalable logging for intrusion prevention systems
17.8 Conclusion
17.9 Exercises
18 Conclusions
18.1 What this book has been about
18.1.1 Endnode algorithmics
18.1.2 Router algorithmics
18.1.3 Toward a synthesis
18.2 What network algorithmics is about
18.2.1 Interdisciplinary thinking
18.2.2 Systems thinking
18.2.3 Algorithmic thinking
18.3 Network algorithmics and real products
18.4 Network algorithmics: back to the future
18.4.1 New abstractions
18.4.2 New connecting disciplines
18.4.3 New requirements
18.5 The inner life of a networking device
A Detailed models
A.1 TCP and IP
A.1.1 Transport protocols
A.1.2 Routing protocols
A.2 Hardware models
A.2.1 From transistors to logic gates
A.2.2 Timing delays
A.2.3 Hardware design building blocks
Programmable logic arrays and programmable array logics
Standard cells
A.2.4 Memories: the inside scoop
Registers
Static RAM
Dynamic RAM
A.2.5 Chip design
Interconnects, power, and packaging
A.3 Switching theory
A.3.1 Matching algorithms for Clos networks with k = n
A.4 The interconnection network Zoo
References
Index
Back Cover
π SIMILAR VOLUMES
In designing a network device, you make dozens of decisions that affect the speed with which it will perform-sometimes for better, but sometimes for worse. Network Algorithmics provides a complete, coherent methodology for maximizing speed while meeting your other design goals. Author George Varghe
In designing a network device, you make dozens of decisions that affect the speed with which it will perform-sometimes for better, but sometimes for worse. Network Algorithmics provides a complete, coherent methodology for maximizing speed while meeting your other design goals.Author George Varghese
<p><span>Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices, Second Edition</span><span> takes an interdisciplinary approach to applying principles for efficient implementation of network devices, offering solutions to the problem of network implementation bottle
In designing a network device, you make dozens of decisions that affect the speed with which it will perform--sometimes for better, but sometimes for worse.<i>Network Algorithmics</i>provides a complete, coherent methodology for maximizing speed while meeting your other design goals.<br /><br />Auth