idnits 2.17.00 (12 Aug 2021) /tmp/idnits31942/draft-przygienda-rift-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 20 instances of too long lines in the document, the longest one being 12 characters in excess of 72. == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 2440 has weird spacing: '...itsType defau...' == Line 2442 has weird spacing: '...velType leaf...' == Line 2443 has weird spacing: '...velType defa...' == Line 2447 has weird spacing: '...velType defa...' == Line 2449 has weird spacing: '...kIDType undef...' == (23 more instances...) == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Thrift serializer/deserializer MUST not discard optional, unknown fields but preserve and serialize them again when re-flooding whereas missing optional fields MAY be replaced with according default values if present. -- The document date (Mar 01, 2018) is 1537 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 198, but not defined == Missing Reference: 'B' is mentioned on line 198, but not defined == Missing Reference: 'C' is mentioned on line 208, but not defined == Missing Reference: 'D' is mentioned on line 208, but not defined == Missing Reference: 'E' is mentioned on line 201, but not defined == Missing Reference: 'F' is mentioned on line 201, but not defined == Missing Reference: '0-31' is mentioned on line 1077, but not defined == Missing Reference: '32-63' is mentioned on line 1078, but not defined == Missing Reference: 'P' is mentioned on line 1318, but not defined == Missing Reference: 'NH' is mentioned on line 1430, but not defined == Missing Reference: 'RFC5880' is mentioned on line 2418, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10589' ** Obsolete normative reference: RFC 1142 (Obsoleted by RFC 7142) ** Downref: Normative reference to an Informational RFC: RFC 4655 ** Downref: Normative reference to an Informational RFC: RFC 6234 ** Obsolete normative reference: RFC 6822 (Obsoleted by RFC 8202) ** Downref: Normative reference to an Informational RFC: RFC 7855 ** Downref: Normative reference to an Informational RFC: RFC 7938 == Outdated reference: draft-ietf-spring-segment-routing has been published as RFC 8402 Summary: 7 errors (**), 0 flaws (~~), 21 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT Working Group T. Przygienda, Ed. 3 Internet-Draft Juniper Networks 4 Intended status: Standards Track A. Sharma 5 Expires: September 2, 2018 Comcast 6 A. Atlas 7 J. Drake 8 Juniper Networks 9 Mar 01, 2018 11 RIFT: Routing in Fat Trees 12 draft-przygienda-rift-05 14 Abstract 16 This document outlines a specialized, dynamic routing protocol for 17 Clos and fat-tree network topologies. The protocol (1) deals with 18 automatic construction of fat-tree topologies based on detection of 19 links, (2) minimizes the amount of routing state held at each level, 20 (3) automatically prunes the topology distribution exchanges to a 21 sufficient subset of links, (4) supports automatic disaggregation of 22 prefixes on link and node failures to prevent black-holing and 23 suboptimal routing, (5) allows traffic steering and re-routing 24 policies, (6) allows non-ECMP forwarding, (7) automatically re- 25 balances traffic towards the spines based on bandwidth available and 26 ultimately (8) provides mechanisms to synchronize a limited key-value 27 data-store that can be used after protocol convergence to e.g. 28 bootstrap higher levels of functionality on nodes. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on September 2, 2018. 47 Copyright Notice 49 Copyright (c) 2018 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 65 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 5 66 2. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 5 67 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 68 2.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 8 69 3. Requirement Considerations . . . . . . . . . . . . . . . . . 10 70 4. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 12 71 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 12 72 4.2. Specification . . . . . . . . . . . . . . . . . . . . . . 13 73 4.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 13 74 4.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 13 75 4.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 14 76 4.2.3.1. Topology Information Elements . . . . . . . . . . 14 77 4.2.3.2. South- and Northbound Representation . . . . . . 15 78 4.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 18 79 4.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 18 80 4.2.3.5. Initial and Periodic Database Synchronization . . 20 81 4.2.3.6. Purging . . . . . . . . . . . . . . . . . . . . . 20 82 4.2.3.7. Southbound Default Route Origination . . . . . . 21 83 4.2.3.8. Optional Automatic Flooding Reduction and 84 Partitioning . . . . . . . . . . . . . . . . . . 21 85 4.2.4. Policy-Guided Prefixes . . . . . . . . . . . . . . . 23 86 4.2.4.1. Ingress Filtering . . . . . . . . . . . . . . . . 24 87 4.2.4.2. Applying Policy . . . . . . . . . . . . . . . . . 24 88 4.2.4.3. Store Policy-Guided Prefix for Route Computation 89 and Regeneration . . . . . . . . . . . . . . . . 25 90 4.2.4.4. Re-origination . . . . . . . . . . . . . . . . . 26 91 4.2.4.5. Overlap with Disaggregated Prefixes . . . . . . . 26 92 4.2.5. Reachability Computation . . . . . . . . . . . . . . 26 93 4.2.5.1. Northbound SPF . . . . . . . . . . . . . . . . . 27 94 4.2.5.2. Southbound SPF . . . . . . . . . . . . . . . . . 27 95 4.2.5.3. East-West Forwarding Within a Level . . . . . . . 28 96 4.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 28 97 4.2.7. Attaching Policy-Guided Prefixes . . . . . . . . . . 29 98 4.2.8. Automatic Disaggregation on Link & Node Failures . . 30 99 4.2.9. Optional Autoconfiguration . . . . . . . . . . . . . 33 100 4.2.9.1. Terminology . . . . . . . . . . . . . . . . . . . 34 101 4.2.9.2. Automatic SystemID Selection . . . . . . . . . . 35 102 4.2.9.3. Generic Fabric Example . . . . . . . . . . . . . 35 103 4.2.9.4. Level Determination Procedure . . . . . . . . . . 36 104 4.2.9.5. Resulting Topologies . . . . . . . . . . . . . . 37 105 4.2.10. Stability Considerations . . . . . . . . . . . . . . 39 106 4.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 40 107 4.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 40 108 4.3.2. Optimized Route Computation on Leafs . . . . . . . . 40 109 4.3.3. Key/Value Store . . . . . . . . . . . . . . . . . . . 40 110 4.3.3.1. Southbound . . . . . . . . . . . . . . . . . . . 40 111 4.3.3.2. Northbound . . . . . . . . . . . . . . . . . . . 41 112 4.3.4. Interactions with BFD . . . . . . . . . . . . . . . . 41 113 4.3.5. Fabric Bandwidth Balancing . . . . . . . . . . . . . 41 114 4.3.5.1. Northbound Direction . . . . . . . . . . . . . . 42 115 4.3.5.2. Southbound Direction . . . . . . . . . . . . . . 43 116 4.3.6. Segment Routing Support with RIFT . . . . . . . . . . 43 117 4.3.6.1. Global Segment Identifiers Assignment . . . . . . 43 118 4.3.6.2. Distribution of Topology Information . . . . . . 44 119 4.3.7. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 44 120 4.3.8. Other End-to-End Services . . . . . . . . . . . . . . 45 121 4.3.9. Address Family and Multi Topology Considerations . . 45 122 4.3.10. Reachability of Internal Nodes in the Fabric . . . . 45 123 4.3.11. One-Hop Healing of Levels with East-West Links . . . 45 124 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 46 125 5.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 46 126 5.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 47 127 5.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 48 128 5.4. Northbound Partitioned Router and Optional East-West 129 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 50 130 6. Implementation and Operation: Further Details . . . . . . . . 51 131 6.1. Considerations for Leaf-Only Implementation . . . . . . . 51 132 6.2. Adaptations to Other Proposed Data Center Topologies . . 51 133 6.3. Originating Non-Default Route Southbound . . . . . . . . 52 134 7. Security Considerations . . . . . . . . . . . . . . . . . . . 52 135 8. Information Elements Schema . . . . . . . . . . . . . . . . . 53 136 8.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 53 137 8.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 57 138 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 63 139 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 63 140 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 63 141 11.1. Normative References . . . . . . . . . . . . . . . . . . 63 142 11.2. Informative References . . . . . . . . . . . . . . . . . 65 144 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 66 146 1. Introduction 148 Clos [CLOS] and Fat-Tree [FATTREE] have gained prominence in today's 149 networking, primarily as result of the paradigm shift towards a 150 centralized data-center based architecture that is poised to deliver 151 a majority of computation and storage services in the future. 152 Today's routing protocols were geared towards a network with an 153 irregular topology and low degree of connectivity originally but 154 given they were the only available mechanisms, consequently several 155 attempts to apply those to Clos have been made. Most successfully 156 BGP [RFC4271] [RFC7938] has been extended to this purpose, not as 157 much due to its inherent suitability to solve the problem but rather 158 because the perceived capability to modify it "quicker" and the 159 immanent difficulties with link-state [DIJKSTRA] based protocols to 160 perform in large scale densely meshed topologies. 162 In looking at the problem through the lens of its requirements an 163 optimal approach does not seem however to be a simple modification of 164 either a link-state (distributed computation) or distance-vector 165 (diffused computation) approach but rather a mixture of both, 166 colloquially best described as "link-state towards the spine" and 167 "distance vector towards the leafs". In other words, "bottom" levels 168 are flooding their link-state information in the "northern" direction 169 while each switch generates under normal conditions a default route 170 and floods it in the "southern" direction. Obviously, such 171 aggregation can blackhole in cases of misconfiguration or failures 172 and this has to be addressed somehow. 174 For the visually oriented reader, Figure 1 presents a first 175 simplified view of the resulting information and routes on a RIFT 176 fabric. The top of the fabric is holding in its link-state database 177 the nodes below it and routes to them. In the second row of the 178 database we indicate that a partial information of other nodes in the 179 same level is available as well; the details of how this is achieved 180 should be postponed for the moment. Whereas when we look at the 181 "bottom" of the fabric we see that the topology of the leafs is 182 basically empty and they only hold a load balanced default route to 183 the next level. 185 The balance of this document details the resulting protocol and fills 186 in the missing details. 188 . [A,B,C,D] 189 . [E] 190 . +-----+ +-----+ 191 . | E | | F | A/32 @ A 192 . +-+-+-+ +-+-+-+ B/32 @ B 193 . | | | | C/32 @ C 194 . | | +-----+ | D/32 @ D 195 . | | | | 196 . | +------+ | 197 . | | | | 198 . [A,B] +-+---+ | | +---+-+ [A,B] 199 . [D] | C +--+ +-+ D | [C] 200 . +-+-+-+ +-+-+-+ 201 . 0/0 @ [E,F] | | | | 0/0 @ [E,F] 202 . A/32 @ A | | +-----+ | A/32 @ A 203 . B/32 @ B | | | | B/32 @ B 204 . | +------+ | 205 . | | | | 206 . +-+---+ | | +---+-+ 207 . | A +--+ +-+ B | 208 . 0/0 @ [C,D] +-----+ +-----+ 0/0 @ [C,D] 210 Figure 1: RIFT information distribution 212 1.1. Requirements Language 214 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 215 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 216 document are to be interpreted as described in RFC 2119 [RFC2119]. 218 2. Reference Frame 220 2.1. Terminology 222 This section presents the terminology used in this document. It is 223 assumed that the reader is thoroughly familiar with the terms and 224 concepts used in OSPF [RFC2328] and IS-IS [RFC1142], [ISO10589] as 225 well as the according graph theoretical concepts of shortest path 226 first (SPF) [DIJKSTRA] computation and directed acyclic graphs (DAG). 228 Level: Clos and Fat Tree networks are trees and 'level' denotes the 229 set of nodes at the same height in such a network, where the 230 bottom level is level 0. A node has links to nodes one level down 231 and/or one level up. Under some circumstances, a node may have 232 links to nodes at the same level. As footnote: Clos terminology 233 uses often the concept of "stage" but due to the folded nature of 234 the Fat Tree we do not use it to prevent misunderstandings. 236 Spine/Aggregation/Edge Levels: Traditional names for Level 2, 1 and 237 0 respectively. Level 0 is often called leaf as well. 239 Point of Delivery (PoD): A self-contained vertical slice of a Clos 240 or Fat Tree network containing normally only level 0 and level 1 241 nodes. It communicates with nodes in other PoDs via the spine. 242 We number PoDs to distinguish them and use PoD #0 to denote 243 "undefined" PoD. 245 Spine: The set of nodes that provide inter-PoD communication. These 246 nodes are also organized into levels (typically one, three, or 247 five levels). Spine nodes do not belong to any PoD and are 248 assigned the PoD value 0 to indicate this. 250 Leaf: A node without southbound adjacencies. Its level is 0 (except 251 cases where it is deriving its level via ZTP and is running 252 without LEAF_ONLY which will be explained in Section 4.2.9). 254 Connected Spine: In case a spine level represents a connected graph 255 (discounting links terminating at different levels), we call it a 256 "connected spine", in case a spine level consists of multiple 257 partitions, we call it a "disconnected" or "partitioned spine". 258 In other terms, a spine without east-west links is disconnected 259 and is the typical configuration forf Clos and Fat Tree networks. 261 South/Southbound and North/Northbound (Direction): When describing 262 protocol elements and procedures, we will be using in different 263 situations the directionality of the compass. I.e., 'south' or 264 'southbound' mean moving towards the bottom of the Clos or Fat 265 Tree network and 'north' and 'northbound' mean moving towards the 266 top of the Clos or Fat Tree network. 268 Northbound Link: A link to a node one level up or in other words, 269 one level further north. 271 Southbound Link: A link to a node one level down or in other words, 272 one level further south. 274 East-West Link: A link between two nodes at the same level. East- 275 west links are normally not part of Clos or "fat-tree" topologies. 277 Leaf shortcuts (L2L): East-west links at leaf level will need to be 278 differentiated from East-west links at other levels. 280 Southbound representation: Information sent towards a lower level 281 representing only limited amount of information. 283 TIE: This is an acronym for a "Topology Information Element". TIEs 284 are exchanged between RIFT nodes to describe parts of a network 285 such as links and address prefixes. It can be thought of as 286 largely equivalent to ISIS LSPs or OSPF LSA. We will talk about 287 N-TIEs when talking about TIEs in the northbound representation 288 and S-TIEs for the southbound equivalent. 290 Node TIE: This is an acronym for a "Node Topology Information 291 Element", largely equivalent to OSPF Node LSA, i.e. it contains 292 all neighbors the node discovered and information about node 293 itself. 295 Prefix TIE: This is an acronym for a "Prefix Topology Information 296 Element" and it contains all prefixes directly attached to this 297 node in case of a N-TIE and in case of S-TIE the necessary default 298 and de-aggregated prefixes the node passes southbound. 300 Policy-Guided Information: Information that is passed in either 301 southbound direction or north-bound direction by the means of 302 diffusion and can be filtered via policies. Policy-Guided 303 Prefixes and KV Ties are examples of Policy-Guided Information. 305 Key Value TIE: A S-TIE that is carrying a set of key value pairs 306 [DYNAMO]. It can be used to distribute information in the 307 southbound direction within the protocol. 309 TIDE: Topology Information Description Element, equivalent to CSNP 310 in ISIS. 312 TIRE: Topology Information Request Element, equivalent to PSNP in 313 ISIS. It can both confirm received and request missing TIEs. 315 PGP: Policy-Guided Prefixes allow to support traffic engineering 316 that cannot be achieved by the means of SPF computation or normal 317 node and prefix S-TIE origination. S-PGPs are propagated in south 318 direction only and N-PGPs follow northern direction strictly. 320 De-aggregation/Disaggregation: Process in which a node decides to 321 advertise certain prefixes it received in N-TIEs to prevent black- 322 holing and suboptimal routing upon link failures. 324 LIE: This is an acronym for a "Link Information Element", largely 325 equivalent to HELLOs in IGPs and exchanged over all the links 326 between systems running RIFT to form adjacencies. 328 FL: Flooding Leader for a specific system has a dedicated role to 329 flood TIEs of that system. 331 BAD: This is an acronym for Bandwidth Adjusted Distance. RIFT 332 calculates the amount of northbound bandwidth available for a node 333 compared to other nodes at the same level and adjusts the default 334 route distance accordingly to allow for the lower level to weight 335 their forwarding load balancing. 337 Overloaded: Applies to a node advertising `overload` attribute as 338 set. The semantics closely follow the meaning of the same 339 attribute in [RFC1142]. 341 2.2. Topology 342 . +--------+ +--------+ 343 . | | | | ^ N 344 . |Spine 21| |Spine 22| | 345 .Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W 346 . | | | | | | | | | 347 . P111/2| |P121 | | | | S v 348 . ^ ^ ^ ^ | | | | 349 . | | | | | | | | 350 . +--------------+ | +-----------+ | | | +---------------+ 351 . | | | | | | | | 352 . South +-----------------------------+ | | ^ 353 . | | | | | | | All TIEs 354 . 0/0 0/0 0/0 +-----------------------------+ | 355 . v v v | | | | | 356 . | | +-+ +<-0/0----------+ | | 357 . | | | | | | | | 358 .+-+----++ optional +-+----++ ++----+-+ ++-----++ 359 .| | E/W link | | | | | | 360 .|Node111+----------+Node112| |Node121| |Node122| 361 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 362 . | | | South | | | | 363 . | +---0/0--->-----+ 0/0 | +----------------+ | 364 . 0/0 | | | | | | | 365 . | +---<-0/0-----+ | v | +--------------+ | | 366 . v | | | | | | | 367 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 368 .| | (L2L) | | | | Level 0 | | 369 .|Leaf111~~~~~~~~~~~~Leaf112| |Leaf121| |Leaf122| 370 .+-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 371 . + + \ / + + 372 . Prefix111 Prefix112 \ / Prefix121 Prefix122 373 . multi-homed 374 . Prefix 375 .+---------- Pod 1 ---------+ +---------- Pod 2 ---------+ 377 Figure 2: A two level spine-and-leaf topology 379 We will use this topology (called commonly a fat tree/network in 380 modern DC considerations [VAHDAT08] as homonym to the original 381 definition of the term [FATTREE]) in all further considerations. It 382 depicts a generic "fat-tree" and the concepts explained in three 383 levels here carry by induction for further levels and higher degrees 384 of connectivity. However, this document will deal with designs that 385 provide only sparser connectivity as well. 387 3. Requirement Considerations 389 [RFC7938] gives the original set of requirements augmented here based 390 upon recent experience in the operation of fat-tree networks. 392 REQ1: The control protocol should discover the physical links 393 automatically and be able to detect cabling that violates 394 fat-tree topology constraints. It must react accordingly to 395 such mis-cabling attempts, at a minimum preventing 396 adjacencies between nodes from being formed and traffic from 397 being forwarded on those mis-cabled links. E.g. connecting 398 a leaf to a spine at level 2 should be detected and ideally 399 prevented. 401 REQ2: A node without any configuration beside default values 402 should come up at the correct level in any PoD it is 403 introduced into. Optionally, it must be possible to 404 configure nodes to restrict their participation to the 405 PoD(s) targeted at any level. 407 REQ3: Optionally, the protocol should allow to provision data 408 centers where the individual switches carry no configuration 409 information and are all deriving their level from a "seed". 410 Observe that this requirement may collide with the desire to 411 detect cabling misconfiguration and with that only one of 412 the requirements can be fully met in a chosen configuration 413 mode. 415 REQ4: The solution should allow for minimum size routing 416 information base and forwarding tables at leaf level for 417 speed, cost and simplicity reasons. Holding excessive 418 amount of information away from leaf nodes simplifies 419 operation and lowers cost of the underlay. 421 REQ5: Very high degree of ECMP must be supported. Maximum ECMP is 422 currently understood as the most efficient routing approach 423 to maximize the throughput of switching fabrics 424 [MAKSIC2013]. 426 REQ6: Non equal cost anycast must be supported to allow for easy 427 and robust multi-homing of services without regressing to 428 careful balancing of link costs. 430 REQ7: Traffic engineering should be allowed by modification of 431 prefixes and/or their next-hops. 433 REQ8: The solution should allow for access to link states of the 434 whole topology to enable efficient support for modern 435 control architectures like SPRING [RFC7855] or PCE 436 [RFC4655]. 438 REQ9: The solution should easily accommodate opaque data to be 439 carried throughout the topology to subsets of nodes. This 440 can be used for many purposes, one of them being a key-value 441 store that allows bootstrapping of nodes based right at the 442 time of topology discovery. 444 REQ10: Nodes should be taken out and introduced into production 445 with minimum wait-times and minimum of "shaking" of the 446 network, i.e. radius of propagation (often called "blast 447 radius") of changed information should be as small as 448 feasible. 450 REQ11: The protocol should allow for maximum aggregation of carried 451 routing information while at the same time automatically de- 452 aggregating the prefixes to prevent black-holing in case of 453 failures. The de-aggregation should support maximum 454 possible ECMP/N-ECMP remaining after failure. 456 REQ12: Reducing the scope of communication needed throughout the 457 network on link and state failure, as well as reducing 458 advertisements of repeating, idiomatic or policy-guided 459 information in stable state is highly desirable since it 460 leads to better stability and faster convergence behavior. 462 REQ13: Once a packet traverses a link in a "southbound" direction, 463 it must not take any further "northbound" steps along its 464 path to delivery to its destination under normal conditions. 465 Taking a path through the spine in cases where a shorter 466 path is available is highly undesirable. 468 REQ14: Parallel links between same set of nodes must be 469 distinguishable for SPF, failure and traffic engineering 470 purposes. 472 REQ15: The protocol must not rely on interfaces having discernible 473 unique addresses, i.e. it must operate in presence of 474 unnumbered links (even parallel ones) or links of a single 475 node having same addresses. 477 REQ16: It would be desirable to achieve fast re-balancing of flows 478 when links, especially towards the spines are lost or 479 provisioned without regressing to per flow traffic 480 engineering which introduces significant amount of 481 complexity while possibly not being reactive enough to 482 account for short-lived flows. 484 Following list represents possible requirements and requirements 485 under discussion: 487 PEND1: Supporting anything but point-to-point links is a non- 488 requirement. Questions remain: for connecting to the 489 leaves, is there a case where multipoint is desirable? One 490 could still model it as point-to-point links; it seems there 491 is no need for anything more than a NBMA-type construct. 493 PEND2: What is the maximum scale of number leaf prefixes we need to 494 carry. Is 500'000 enough ? 496 Finally, following are the non-requirements: 498 NONREQ1: Broadcast media support is unnecessary. 500 NONREQ2: Purging is unnecessary given its fragility and complexity 501 and today's large memory size on even modest switches and 502 routers. 504 NONREQ3: Special support for layer 3 multi-hop adjacencies is not 505 part of the protocol specification. Such support can be 506 easily provided by using tunneling technologies the same 507 way IGPs today are solving the problem. 509 4. RIFT: Routing in Fat Trees 511 Derived from the above requirements we present a detailed outline of 512 a protocol optimized for Routing in Fat Trees (RIFT) that in most 513 abstract terms has many properties of a modified link-state protocol 514 [RFC2328][RFC1142] when "pointing north" and path-vector [RFC4271] 515 protocol when "pointing south". Albeit an unusual combination, it 516 does quite naturally exhibit the desirable properties we seek. 518 4.1. Overview 520 The singular property of RIFT is that it floods northbound "flat" 521 link-state information so that each level understands the full 522 topology of levels south of it. In contrast, in the southbound 523 direction the protocol operates like a path vector protocol or rather 524 a distance vector with implicit split horizon since the topology 525 constraints make a diffused computation front propagating in all 526 directions unnecessary. 528 To account for the "northern" and the "southern" information split 529 the link state database is partitioned into "north representation" 530 and "south representation" TIEs, whereas in simplest terms the N-TIEs 531 contain a link state topology description of lower levels and and 532 S-TIEs carry simply default routes. This oversimplified view will be 533 refined gradually in following sections while introducing protocol 534 procedures aimed to fulfill the described requirements. 536 4.2. Specification 538 4.2.1. Transport 540 All protocol elements are carried over UDP. Once QUIC [QUIC] 541 achieves the desired stability in deployments it may prove a valuable 542 candidate for TIE transport. 544 All packet formats are defined in Thrift models in Section 8. 546 Future versions may include a [PROTOBUF] schema. 548 4.2.2. Link (Neighbor) Discovery (LIE Exchange) 550 LIE exchange happens over well-known administratively locally scoped 551 IPv4 multicast address [RFC2365] or link-local multicast scope for 552 IPv6 [RFC4291] and SHOULD be sent with a TTL of 1 to prevent RIFT 553 information reaching beyond a single L3 next-hop in the topology. 554 LIEs are exchanged over all links running RIFT. 556 Unless Section 4.2.9 is used, each node is provisioned with the level 557 at which it is operating and its PoD (or otherwise a default level 558 and "undefined" "PoD are assumed; meaning that leafs do not need to 559 be configured at all). Nodes in the spine are configured with an 560 "undefined" PoD. This information is propagated in the LIEs 561 exchanged. 563 A node tries to form a three way adjacency if and only if 564 (definitions of LEAF_ONLY are found in Section 4.2.9) 566 1. the node is in the same PoD or either the node or the neighbor 567 advertises "undefined" PoD membership (PoD# = 0) AND 569 2. the neighboring node is running the same MAJOR schema version AND 571 3. the neighbor is not member of some PoD while the node has a 572 northbound adjacency already joining another PoD AND 574 4. the neighboring node uses a valid System ID AND 576 5. the neighboring node uses a different System ID than the node 577 itself 579 6. the advertised MTUs match on both sides AND 580 7. both nodes advertise defined level values AND 582 8. [ 584 i) the node is at level 0 and has no three way adjacencies 585 already to nodes with level higher than the neighboring node 586 OR 588 ii) the neighboring node is at level 0 OR 590 iii) both nodes are at level 0 AND both indicate support for 591 Section 4.3.7 OR 593 iii) neither node is at level 0 and the neighboring node is at 594 most one level away 596 ]. 598 Rule in Paragraph 3 MAY be optionally disregarded by a node if PoD 599 detection is undesirable or has to be disregarded. 601 A node configured with "undefined" PoD membership MUST, after 602 building first northbound adjacency making it participant in a PoD, 603 advertise that PoD as part of its LIEs. 605 LIEs arriving with a TTL larger than 1 MUST be ignored. 607 A node SHOULD NOT send out LIEs without defined level in the header 608 but in certain scenarios it may be beneficial for trouble-shooting 609 purposes. 611 LIE exchange uses three-way handshake mechanism [RFC5303]. Precise 612 finite state machines will be provided in later versions of this 613 specification. LIE packets contain nonces and may contain an SHA-1 614 [RFC6234] over nonces and some of the LIE data which prevents 615 corruption and replay attacks. TIE flooding reuses those nonces to 616 prevent mismatches and can use those for security purposes in case it 617 is using QUIC [QUIC]. Section 7 will address the precise security 618 mechanisms in the future. 620 4.2.3. Topology Exchange (TIE Exchange) 622 4.2.3.1. Topology Information Elements 624 Topology and reachability information in RIFT is conveyed by the 625 means of TIEs which have good amount of commonalities with LSAs in 626 OSPF. 628 TIE exchange mechanism uses port indicated by each node in the LIE 629 exchange and the interface on which the adjacency has been formed as 630 destination. It SHOULD use TTL of 1 as well. 632 TIEs contain sequence numbers, lifetimes and a type. Each type has a 633 large identifying number space and information is spread across 634 possibly many TIEs of a certain type by the means of a hash function 635 that a node or deployment can individually determine. One extreme 636 point of the design space is a prefix per TIE which leads to BGP-like 637 behavior vs. dense packing into few TIEs leading to more traditional 638 IGP trade-off with fewer TIEs. An implementation may even rehash at 639 the cost of significant amount of re-advertisements of TIEs. 641 More information about the TIE structure can be found in the schema 642 in Section 8. 644 4.2.3.2. South- and Northbound Representation 646 As a central concept to RIFT, each node represents itself differently 647 depending on the direction in which it is advertising information. 648 More precisely, a spine node represents two different databases to 649 its neighbors depending whether it advertises TIEs to the north or to 650 the south/sideways. We call those differing TIE databases either 651 south- or northbound (S-TIEs and N-TIEs) depending on the direction 652 of distribution. 654 The N-TIEs hold all of the node's adjacencies, local prefixes and 655 northbound policy-guided prefixes while the S-TIEs hold only all of 656 the node's adjacencies and the default prefix with necessary 657 disaggregated prefixes and southbound policy-guided prefixes. We 658 will explain this in detail further in Section 4.2.8 and 659 Section 4.2.4. 661 The TIE types are symmetric in both directions and Table 1 provides a 662 quick reference to the different TIE types including direction and 663 their function. 665 +----------+--------------------------------------------------------+ 666 | TIE-Type | Content | 667 +----------+--------------------------------------------------------+ 668 | node | node properties, adjacencies and information helping | 669 | N-TIE | in complex disaggregation scenarios | 670 +----------+--------------------------------------------------------+ 671 | node | same content as node N-TIE except the information to | 672 | S-TIE | help disaggregation | 673 +----------+--------------------------------------------------------+ 674 | Prefix | contains nodes' directly reachable prefixes | 675 | N-TIE | | 676 +----------+--------------------------------------------------------+ 677 | Prefix | contains originated defaults and de-aggregated | 678 | S-TIE | prefixes | 679 +----------+--------------------------------------------------------+ 680 | PGP | contains nodes north PGPs | 681 | N-TIE | | 682 +----------+--------------------------------------------------------+ 683 | PGP | contains nodes south PGPs | 684 | S-TIE | | 685 +----------+--------------------------------------------------------+ 686 | KV | contains nodes northbound KVs | 687 | N-TIE | | 688 +----------+--------------------------------------------------------+ 689 | KV | contains nodes southbound KVs | 690 | S-TIE | | 691 +----------+--------------------------------------------------------+ 693 Table 1: TIE Types 695 As an example illustrating a databases holding both representations, 696 consider the topology in Figure 2 with the optional link between node 697 111 and node 112 (so that the flooding on an east-west link can be 698 shown). This example assumes unnumbered interfaces. First, here are 699 the TIEs generated by some nodes. For simplicity, the key value 700 elements and the PGP elements which may be included in their S-TIEs 701 or N-TIEs are not shown. 703 Spine21 S-TIEs: 704 Node S-TIE: 705 NodeElement(layer=2, neighbors((Node111, layer 1, cost 1), 706 (Node112, layer 1, cost 1), (Node121, layer 1, cost 1), 707 (Node122, layer 1, cost 1))) 708 Prefix S-TIE: 709 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 711 Node111 S-TIEs: 713 Node S-TIE: 714 NodeElement(layer=1, neighbors((Spine21, layer 2, cost 1, links(...)), 715 (Spine22, layer 2, cost 1, links(...)), 716 (Node112, layer 1, cost 1, links(...)), 717 (Leaf111, layer 0, cost 1, links(...)), 718 (Leaf112, layer 0, cost 1, links(...)))) 719 Prefix S-TIE: 720 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 722 Node111 N-TIEs: 723 Node N-TIE: 724 NodeElement(layer=1, 725 neighbors((Spine21, layer 2, cost 1, links(...)), 726 (Spine22, layer 2, cost 1, links(...)), 727 (Node112, layer 1, cost 1, links(...)), 728 (Leaf111, layer 0, cost 1, links(...)), 729 (Leaf112, layer 0, cost 1, links(...)))) 730 Prefix N-TIE: 731 NorthPrefixesElement(prefixes(Node111.loopback) 733 Node121 S-TIEs: 734 Node S-TIE: 735 NodeElement(layer=1, neighbors((Spine21,layer 2,cost 1), 736 (Spine22, layer 2, cost 1), (Leaf121, layer 0, cost 1), 737 (Leaf122, layer 0, cost 1))) 738 Prefix S-TIE: 739 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 741 Node121 N-TIEs: 742 Node N-TIE: 743 NodeLinkElement(layer=1, 744 neighbors((Spine21, layer 2, cost 1, links(...)), 745 (Spine22, layer 2, cost 1, links(...)), 746 (Leaf121, layer 0, cost 1, links(...)), 747 (Leaf122, layer 0, cost 1, links(...)))) 748 Prefix N-TIE: 749 NorthPrefixesElement(prefixes(Node121.loopback) 751 Leaf112 N-TIEs: 752 Node N-TIE: 753 NodeLinkElement(layer=0, 754 neighbors((Node111, layer 1, cost 1, links(...)), 755 (Node112, layer 1, cost 1, links(...)))) 756 Prefix N-TIE: 757 NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112, 758 Prefix_MH)) 760 Figure 3: example TIES generated in a 2 level spine-and-leaf topology 762 4.2.3.3. Flooding 764 The mechanism used to distribute TIEs is the well-known (albeit 765 modified in several respects to address fat tree requirements) 766 flooding mechanism used by today's link-state protocols. Albeit 767 initially more demanding to implement it avoids many problems with 768 diffused computation update style used by path vector. As described 769 before, TIEs themselves are transported over UDP with the ports 770 indicates in the LIE exchanges and using the destination address (for 771 unnumbered IPv4 interfaces same considerations apply as in equivalent 772 OSPF case) on which the LIE adjacency has been formed. 774 On reception of a TIE with an undefined level value in the packet 775 header the node SHOULD issue a warning and indiscriminately discard 776 the packet. 778 Precise finite state machines and procedures will be provided in 779 later versions of this specification. 781 4.2.3.4. TIE Flooding Scopes 783 In a somewhat analogous fashion to link-local, area and domain 784 flooding scopes, RIFT defines several complex "flooding scopes" 785 depending on the direction and type of TIE propagated. 787 Every N-TIE is flooded northbound, providing a node at a given level 788 with the complete topology of the Clos or Fat Tree network underneath 789 it, including all specific prefixes. This means that a packet 790 received from a node at the same or lower level whose destination is 791 covered by one of those specific prefixes may be routed directly 792 towards the node advertising that prefix rather than sending the 793 packet to a node at a higher level. 795 A node's node S-TIEs, consisting of all node's adjacencies and prefix 796 S-TIEs with default IP prefix and disaggregated prefixes, are flooded 797 southbound in order to allow the nodes one level down to see 798 connectivity of the higher level as well as reachability to the rest 799 of the fabric. In order to allow a E-W disconnected node in a given 800 level to receive the S-TIEs of other nodes at its level, every *NODE* 801 S-TIE is "reflected" northbound to level from which it was received. 802 It should be noted that east-west links are included in South TIE 803 flooding; those TIEs need to be flooded to satisfy algorithms in 804 Section 4.2.5. In that way nodes at same level can learn about each 805 other without a lower level, e.g. in case of leaf level. The precise 806 flooding scopes are given in Table 2. Those rules govern as well 807 what SHOULD be included in TIDEs towards neighbors. East-West 808 flooding scopes are identical to South flooding scopes. 810 Node S-TIE "reflection" allows to support disaggregation on failures 811 describes in Section 4.2.8 and flooding reduction in Section 4.2.3.8. 813 +--------------+----------------------------+-----------------------+ 814 | Packet Type | South | North | 815 | vs. Peer | | | 816 | Direction | | | 817 +--------------+----------------------------+-----------------------+ 818 | node S-TIE | flood self-originated only | flood if TIE | 819 | | | originator's level is | 820 | | | higher than own level | 821 +--------------+----------------------------+-----------------------+ 822 | non-node | flood self-originated only | flood only if TIE | 823 | S-TIE | | originator is equal | 824 | | | peer | 825 +--------------+----------------------------+-----------------------+ 826 | all N-TIEs | never flood | flood always | 827 +--------------+----------------------------+-----------------------+ 828 | TIDE | include TIEs in flooding | include TIEs in | 829 | | scope | flooding scope | 830 +--------------+----------------------------+-----------------------+ 831 | TIRE | include all N-TIEs and all | include only if TIE | 832 | | peer's self-originated | originator is equal | 833 | | TIEs and all node S-TIEs | peer | 834 +--------------+----------------------------+-----------------------+ 836 Table 2: Flooding Scopes 838 As an example to illustrate these rules, consider using the topology 839 in Figure 2, with the optional link between node 111 and node 112, 840 and the associated TIEs given in Figure 3. The flooding from 841 particular nodes of the TIEs is given in Table 3. 843 +------------+----------+-------------------------------------------+ 844 | Router | Neighbor | TIEs | 845 | floods to | | | 846 +------------+----------+-------------------------------------------+ 847 | Leaf111 | Node112 | Leaf111 N-TIEs, Node111 node S-TIE | 848 | Leaf111 | Node111 | Leaf111 N-TIEs, Node112 node S-TIE | 849 | | | | 850 | Node111 | Leaf111 | Node111 S-TIEs | 851 | Node111 | Leaf112 | Node111 S-TIEs | 852 | Node111 | Node112 | Node111 S-TIEs | 853 | Node111 | Spine21 | Node111 N-TIEs, Leaf111 N-TIEs, Leaf112 | 854 | | | N-TIEs, Spine22 node S-TIE | 855 | Node111 | Spine22 | Node111 N-TIEs, Leaf111 N-TIEs, Leaf112 | 856 | | | N-TIEs, Spine21 node S-TIE | 857 | | | | 858 | ... | ... | ... | 859 | Spine21 | Node111 | Spine21 S-TIEs | 860 | Spine21 | Node112 | Spine21 S-TIEs | 861 | Spine21 | Node121 | Spine21 S-TIEs | 862 | Spine21 | Node122 | Spine21 S-TIEs | 863 | ... | ... | ... | 864 +------------+----------+-------------------------------------------+ 866 Table 3: Flooding some TIEs from example topology 868 4.2.3.5. Initial and Periodic Database Synchronization 870 The initial exchange of RIFT is modeled after ISIS with TIDE being 871 equivalent to CSNP and TIRE playing the role of PSNP. The content of 872 TIDEs and TIREs is governed by Table 2. 874 4.2.3.6. Purging 876 RIFT does not purge information that has been distributed by the 877 protocol. Purging mechanisms in other routing protocols have proven 878 to be complex and fragile over many years of experience. Abundant 879 amounts of memory are available today even on low-end platforms. The 880 information will age out and all computations will deliver correct 881 results if a node leaves the network due to the new information 882 distributed by its adjacent nodes. 884 Once a RIFT node issues a TIE with an ID, it MUST preserve the ID as 885 long as feasible (also when the protocol restarts), even if the TIE 886 looses all content. The re-advertisement of empty TIE fulfills the 887 purpose of purging any information advertised in previous versions. 888 The originator is free to not re-originate the according empty TIE 889 again or originate an empty TIE with relatively short lifetime to 890 prevent large number of long-lived empty stubs polluting the network. 892 Each node will timeout and clean up the according empty TIEs 893 independently. 895 Upon restart a node MUST, as any link-state implementation, be 896 prepared to receive TIEs with its own system ID and supercede them 897 with equivalent, newly generated, empty TIEs with a higher sequence 898 number. As above, the lifetime can be relatively short since it only 899 needs to exceed the necessary propagation and processing delay by all 900 the nodes that are within the TIE's flooding scope. 902 4.2.3.7. Southbound Default Route Origination 904 Under certain conditions nodes issue a default route in their South 905 Prefix TIEs with metrics as computed in Section 4.3.5.1. 907 A node X that 909 1. is NOT overloaded AND 911 2. has southbound or east-west adjacencies 913 originates in its south prefix TIE such a default route IIF 915 1. all other nodes at X's' level are overloaded OR 917 2. all other nodes at X's' level have NO northbound adjacencies OR 919 3. X has computed reachability to a default route during N-SPF. 921 The term "all other nodes at X's' level" describes obviously just the 922 nodes at the same level in the POD with a viable lower layer 923 (otherwise the node S-TIEs cannot be reflected and the nodes in e.g. 924 POD 1 nad POD 2 are "invisible" to each other). 926 A node originating a southbound default route MUST install a default 927 discard route if it did not compute a default route during N-SPF. 929 4.2.3.8. Optional Automatic Flooding Reduction and Partitioning 931 Several nodes can, but strictly only under conditions defined below, 932 run a hashing function based on TIE originator value and partition 933 flooding between them. 935 Steps for flooding reduction and partitioning: 937 1. select all nodes in the same level for which 939 A. node S-TIEs have been received AND 940 B. which have precisely the same non-empty sets of respectively 941 north and south neighbor adjacencies AND 943 C. have at least one shared southern neighbor including backlink 944 verification and 946 D. support flooding reduction (overload bits are ignored) 948 and then 950 2. run on the chosen set a hash algorithm using nodes flood 951 priorities and IDs to select flooding leader and backup per TIE 952 originator ID, i.e. each node floods immediately through to all 953 its necessary neighbors TIEs that it received with an originator 954 ID that makes it the flooding leader or backup for this 955 originator. The preference (higher is better) is computed as 956 XOR(TIE-ORIGINATOR-ID<<1,~OWN-SYSTEM-ID)), whereas << is a non- 957 circular shift and ~ is bit-wise NOT. 959 3. In the very unlikely case of hash collisions on either of the two 960 nodes with highest values (i.e. either does NOT produce unique 961 hashes as compared to all other hash values), the node running 962 the election does not attempt to reduce flooding. 964 Additional rules for flooding reduction and partitioning: 966 1. A node always floods its own TIEs 968 2. A node generates TIDEs as usual but when receiving TIREs with 969 requests for TIEs for a node for which it is not a flooding 970 leader or backup it ignores such TIDEs on first request only. 971 Normally, the flooding leader should satisfy the requestor and 972 with that no further TIREs for such TIEs will be generated. 973 Otherwise, the next set of TIDEs and TIREs will lead to flooding 974 independent of the flooding leader status. 976 3. A node receiving a TIE originated by a node for which it is not a 977 flooding leader floods such TIEs only when receiving an out-of- 978 date TIDE for them, except for the first one. 980 The mechanism can be implemented optionally in each node. The 981 capability is carried in the node S-TIE (and for symmetry purposes in 982 node N-TIE as well but it serves no purpose there currently). 984 Obviously flooding reduction does NOT apply to self originated TIEs. 985 Observe further that all policy-guided information consists of self- 986 originated TIEs. 988 4.2.4. Policy-Guided Prefixes 990 In a fat tree, it can be sometimes desirable to guide traffic to 991 particular destinations or keep specific flows to certain paths. In 992 RIFT, this is done by using policy-guided prefixes with their 993 associated communities. Each community is an abstract value whose 994 meaning is determined by configuration. It is assumed that the 995 fabric is under a single administrative control so that the meaning 996 and intent of the communities is understood by all the nodes in the 997 fabric. Any node can originate a policy-guided prefix. 999 Since RIFT uses distance vector concepts in a southbound direction, 1000 it is straightforward to add a policy-guided prefix to an S-TIE. For 1001 easier troubleshooting, the approach taken in RIFT is that a node's 1002 southbound policy-guided prefixes are sent in its S-TIE and the 1003 receiver does inbound filtering based on the associated communities 1004 (an egress policy is imaginable but would lead to different S-TIEs 1005 per neighbor possibly which is not considered in RIFT protocol 1006 procedures). A southbound policy-guided prefix can only use links in 1007 the south direction. If an PGP S-TIE is received on an east-west or 1008 northbound link, it must be discarded by ingress filtering. 1010 Conceptually, a southbound policy-guided prefix guides traffic from 1011 the leaves up to at most the north-most layer. It is also necessary 1012 to to have northbound policy-guided prefixes to guide traffic from 1013 the north-most layer down to the appropriate leaves. Therefore, RIFT 1014 includes northbound policy-guided prefixes in its N PGP-TIE and the 1015 receiver does inbound filtering based on the associated communities. 1016 A northbound policy-guided prefix can only use links in the northern 1017 direction. If an N PGP TIE is received on an east-west or southbound 1018 link, it must be discarded by ingress filtering. 1020 By separating southbound and northbound policy-guided prefixes and 1021 requiring that the cost associated with a PGP is strictly 1022 monotonically increasing at each hop, the path cannot loop. Because 1023 the costs are strictly increasing, it is not possible to have a loop 1024 between a northbound PGP and a southbound PGP. If east-west links 1025 were to be allowed, then looping could occur and issues such as 1026 counting to infinity would become an issue to be solved. If complete 1027 generality of path - such as including east-west links and using both 1028 north and south links in arbitrary sequence - then a Path Vector 1029 protocol or a similar solution must be considered. 1031 If a node has received the same prefix, after ingress filtering, as a 1032 PGP in an S-TIE and in an N-TIE, then the node determines which 1033 policy-guided prefix to use based upon the advertised cost. 1035 A policy-guided prefix is always preferred to a regular prefix, even 1036 if the policy-guided prefix has a larger cost. Section 8 provides 1037 normative indication of prefix preferences. 1039 The set of policy-guided prefixes received in a TIE is subject to 1040 ingress filtering and then re-originated to be sent out in the 1041 receiver's appropriate TIE. Both the ingress filtering and the re- 1042 origination use the communities associated with the policy-guided 1043 prefixes to determine the correct behavior. The cost on re- 1044 advertisement MUST increase in a strictly monotonic fashion. 1046 4.2.4.1. Ingress Filtering 1048 When a node X receives a PGP S-TIE or a PGP N-TIE that is originated 1049 from a node Y which does not have an adjacency with X, all PGPs in 1050 such a TIE MUST be filtered. Similarly, if node Y is at the same 1051 layer as node X, then X MUST filter out PGPs in such S- and N-TIEs to 1052 prevent loops. 1054 Next, policy can be applied to determine which policy-guided prefixes 1055 to accept. Since ingress filtering is chosen rather than egress 1056 filtering and per-neighbor PGPs, policy that applies to links is done 1057 at the receiver. Because the RIFT adjacency is between nodes and 1058 there may be parallel links between the two nodes, the policy-guided 1059 prefix is considered to start with the next-hop set that has all 1060 links to the originating node Y. 1062 A policy-guided prefix has or is assigned the following attributes: 1064 cost: This is initialized to the cost received 1066 community_list: This is initialized to the list of the communities 1067 received. 1069 next_hop_set: This is initialized to the set of links to the 1070 originating node Y. 1072 4.2.4.2. Applying Policy 1074 The specific action to apply based upon a community is deployment 1075 specific. Here are some examples of things that can be done with 1076 communities. The length of a community is a 64 bits number and it 1077 can be written as a single field M or as a multi-field (S = M[0-31], 1078 T = M[32-63]) in these examples. For simplicity, the policy-guided 1079 prefix is referred to as P, the processing node as X and the 1080 originator as Y. 1082 Prune Next-Hops: Community Required: For each next-hop in 1083 P.next_hop_set, if the next-hop does not have the community, prune 1084 that next-hop from P.next_hop_set. 1086 Prune Next-Hops: Avoid Community: For each next-hop in 1087 P.next_hop_set, if the next-hop has the community, prune that 1088 next-hop from P.next_hop_set. 1090 Drop if Community: If node X has community M, discard P. 1092 Drop if not Community: If node X does not have the community M, 1093 discard P. 1095 Prune to ifIndex T: For each next-hop in P.next_hop_set, if the 1096 next-hop's ifIndex is not the value T specified in the community 1097 (S,T), then prune that next-hop from P.next_hop_set. 1099 Add Cost T: For each appearance of community S in P.community_list, 1100 if the node X has community S, then add T to P.cost. 1102 Accumulate Min-BW T: Let bw be the sum of the bandwidth for 1103 P.next_hop_set. If that sum is less than T, then replace (S,T) 1104 with (S, bw). 1106 Add Community T if Node matches S: If the node X has community S, 1107 then add community T to P.community_list. 1109 4.2.4.3. Store Policy-Guided Prefix for Route Computation and 1110 Regeneration 1112 Once a policy-guided prefix has completed ingress filtering and 1113 policy, it is almost ready to store and use. It is still necessary 1114 to adjust the cost of the prefix to account for the link from the 1115 computing node X to the originating neighbor node Y. 1117 There are three different policies that can be used: 1119 Minimum Equal-Cost: Find the lowest cost C next-hops in 1120 P.next_hop_set and prune to those. Add C to P.cost. 1122 Minimum Unequal-Cost: Find the lowest cost C next-hop in 1123 P.next_hop_set. Add C to P.cost. 1125 Maximum Unequal-Cost: Find the highest cost C next-hop in 1126 P.next_hop_set. Add C to P.cost. 1128 The default policy is Minimum Unequal-Cost but well-known communities 1129 can be defined to get the other behaviors. 1131 Regardless of the policy used, a node MUST store a PGP cost that is 1132 at least 1 greater than the PGP cost received. This enforces the 1133 strictly monotonically increasing condition that avoids loops. 1135 Two databases of PGPs - from N-TIEs and from S-TIEs are stored. When 1136 a PGP is inserted into the appropriate database, the usual tie- 1137 breaking on cost is performed. Observe that the node retains all PGP 1138 TIEs due to normal flooding behavior and hence loss of the best 1139 prefix will lead to re-evaluation of TIEs present and re- 1140 advertisement of a new best PGP. 1142 4.2.4.4. Re-origination 1144 A node must re-originate policy-guided prefixes and retransmit them. 1145 The node has its database of southbound policy-guided prefixes to 1146 send in its S-TIE and its database of northbound policy-guided 1147 prefixes to send in its N-TIE. 1149 Of course, a leaf does not need to re-originate southbound policy- 1150 guided prefixes. 1152 4.2.4.5. Overlap with Disaggregated Prefixes 1154 PGPs may overlap with prefixes introduced by automatic de- 1155 aggregation. The topic is under further discussion. The break in 1156 connectivity that leads to infeasibility of a PGP is mirrored in 1157 adjacency tear-down and according removal of such PGPs. 1158 Nevertheless, the underlying link-state flooding will be likely 1159 reacting significantly faster than a hop-by-hop redistribution and 1160 with that the preference for PGPs may cause intermittent black-holes. 1162 4.2.5. Reachability Computation 1164 A node has three sources of relevant information. A node knows the 1165 full topology south from the received N-TIEs. A node has the set of 1166 prefixes with associated distances and bandwidths from received 1167 S-TIEs. A node can also have a set of PGPs. 1169 To compute reachability, a node runs conceptually a northbound and a 1170 southbound SPF. We call that N-SPF and S-SPF. 1172 Since neither computation can "loop" (with due considerations given 1173 to PGPs), it is possible to compute non-equal-cost or even k-shortest 1174 paths [EPPSTEIN] and "saturate" the fabric to the extent desired. 1176 4.2.5.1. Northbound SPF 1178 N-SPF uses northbound and east-west adjacencies in North Node TIEs 1179 when progressing Dijkstra. Observe that this is really just a one 1180 hop variety since South Node TIEs are not re-flooded southbound 1181 beyond a single level (or east-west) and with that the computation 1182 cannot progress beyond adjacent nodes. 1184 Default route found when crossing an E-W link is used IIF 1186 1. the node itself does NOT have any northbound adjacencies AND 1188 2. the adjacent node has one or more northbound adjacencies 1190 This rule forms a "one-hop default route split-horizon" and prevents 1191 looping over default routes while allowing for "one-hop protection" 1192 of nodes that lost all northbound adjacencies. 1194 Other south prefixes found when crossing E-W link MAY be used IIF 1196 1. no north neighbors are advertising same or supersuming non- 1197 default prefix AND 1199 2. the node does not originate a non-default supersuming prefix 1200 itself. 1202 i.e. the E-W link can be used as the gateway of last resort for a 1203 specific prefix only. Using south prefixes across E-W link can be 1204 beneficial e.g. on automatic de-aggregation in pathological fabric 1205 partitioning scenarios. 1207 A detailed example can be found in Section 5.4. 1209 For N-SPF we are using the South Node TIEs to find according 1210 adjacencies to verify backlink connectivity. Just as in case of IS- 1211 IS or OSPF, two unidirectional links are associated together to 1212 confirm bidirectional connectivity. 1214 4.2.5.2. Southbound SPF 1216 S-SPF uses only the southbound adjacencies in the south node TIEs, 1217 i.e. progresses towards nodes at lower levels. Observe that E-W 1218 adjacencies are NEVER used in the computation. This enforces the 1219 requirement that a packet traversing in a southbound direction must 1220 never change its direction. 1222 S-SPF uses northbound adjacencies in north node TIEs to verify 1223 backlink connectivity. 1225 4.2.5.3. East-West Forwarding Within a Level 1227 Ultimately, it should be observed that in presence of a "ring" of E-W 1228 links in a level neither SPF will provide a "ring protection" scheme 1229 since such a computation would have to deal necessarily with breaking 1230 of "loops" in generic Dijkstra sense; an application for which RIFT 1231 is not intended. It is outside the scope of this document how an 1232 underlay can be used to provide a full-mesh connectivity between 1233 nodes in the same layer that would allow for N-SPF to provide 1234 protection for a single node loosing all its northbound adjacencies 1235 (as long as any of the other nodes in the level are northbound 1236 connected). 1238 Using south prefixes over horizontal links is optional and can 1239 protect against pathological fabric partitioning cases that leave 1240 only paths to destinations that would necessitate multiple changes of 1241 forwarding direction between north and south. 1243 4.2.6. Attaching Prefixes 1245 After the SPF is run, it is necessary to attach according prefixes. 1246 For S-SPF, prefixes from an N-TIE are attached to the originating 1247 node with that node's next-hop set and a distance equal to the 1248 prefix's cost plus the node's minimized path distance. The RIFT 1249 route database, a set of (prefix, type=spf, path_distance, next-hop 1250 set), accumulates these results. Obviously, the prefix retains its 1251 type which is used to tie-break between the same prefix advertised 1252 with different types. 1254 In case of N-SPF prefixes from each S-TIE need to also be added to 1255 the RIFT route database. The N-SPF is really just a stub so the 1256 computing node needs simply to determine, for each prefix in an S-TIE 1257 that originated from adjacent node, what next-hops to use to reach 1258 that node. Since there may be parallel links, the next-hops to use 1259 can be a set; presence of the computing node in the associated Node 1260 S-TIE is sufficient to verify that at least one link has 1261 bidirectional connectivity. The set of minimum cost next-hops from 1262 the computing node X to the originating adjacent node is determined. 1264 Each prefix has its cost adjusted before being added into the RIFT 1265 route database. The cost of the prefix is set to the cost received 1266 plus the cost of the minimum cost next-hop to that neighbor. Then 1267 each prefix can be added into the RIFT route database with the 1268 next_hop_set; ties are broken based upon type first and then 1269 distance. RIFT route preferences are normalized by the according 1270 thrift model type. 1272 An exemplary implementation for node X follows: 1274 for each S-TIE 1275 if S-TIE.layer > X.layer 1276 next_hop_set = set of minimum cost links to the S-TIE.originator 1277 next_hop_cost = minimum cost link to S-TIE.originator 1278 end if 1279 for each prefix P in the S-TIE 1280 P.cost = P.cost + next_hop_cost 1281 if P not in route_database: 1282 add (P, type=DistVector, P.cost, next_hop_set) to route_database 1283 end if 1284 if (P in route_database) and 1285 (route_database[P].type is not PolicyGuided): 1286 if route_database[P].cost > P.cost): 1287 update route_database[P] with (P, DistVector, P.cost, next_hop_set) 1288 else if route_database[P].cost == P.cost 1289 update route_database[P] with (P, DistVector, P.cost, 1290 merge(next_hop_set, route_database[P].next_hop_set)) 1291 else 1292 // Not preferred route so ignore 1293 end if 1294 end if 1295 end for 1296 end for 1298 Figure 4: Adding Routes from S-TIE Prefixes 1300 4.2.7. Attaching Policy-Guided Prefixes 1302 Each policy-guided prefix P has its cost and next_hop_set already 1303 stored in the associated database, as specified in Section 4.2.4.3; 1304 the cost stored for the PGP is already updated to considering the 1305 cost of the link to the advertising neighbor. By definition, a 1306 policy-guided prefix is preferred to a regular prefix. 1308 for each policy-guided prefix P: 1309 if P not in route_database: 1310 add (P, type=PolicyGuided, P.cost, next_hop_set) 1311 end if 1312 if P in route_database : 1313 if (route_database[P].type is not PolicyGuided) or 1314 (route_database[P].cost > P.cost): 1315 update route_database[P] with (P, PolicyGuided, P.cost, next_hop_set) 1316 else if route_database[P].cost == P.cost 1317 update route_database[P] with (P, PolicyGuided, P.cost, 1318 merge(next_hop_set, route_database[P].next_hop_set)) 1319 else 1320 // Not preferred route so ignore 1321 end if 1322 end if 1323 end for 1325 Figure 5: Adding Routes from Policy-Guided Prefixes 1327 4.2.8. Automatic Disaggregation on Link & Node Failures 1329 Under normal circumstances, node's S-TIEs contain just the 1330 adjacencies, a default route and policy-guided prefixes. However, if 1331 a node detects that its default IP prefix covers one or more prefixes 1332 that are reachable through it but not through one or more other nodes 1333 at the same level, then it MUST explicitly advertise those prefixes 1334 in an S-TIE. Otherwise, some percentage of the northbound traffic 1335 for those prefixes would be sent to nodes without according 1336 reachability, causing it to be black-holed. Even when not black- 1337 holing, the resulting forwarding could 'backhaul' packets through the 1338 higher level spines, clearly an undesirable condition affecting the 1339 blocking probabilities of the fabric. 1341 We refer to the process of advertising additional prefixes as 'de- 1342 aggregation' or 'dis-aggregation'. 1344 A node determines the set of prefixes needing de-aggregation using 1345 the following steps: 1347 1. A DAG computation in the southern direction is performed first, 1348 i.e. the N-TIEs are used to find all of prefixes it can reach and 1349 the set of next-hops in the lower level for each. Such a 1350 computation can be easily performed on a fat tree by e.g. setting 1351 all link costs in the southern direction to 1 and all northern 1352 directions to infinity. We term set of those prefixes |R, and 1353 for each prefix, r, in |R, we define its set of next-hops to 1354 be |H(r). Observe that policy-guided prefixes are NOT affected 1355 since their scope is controlled by configuration. 1357 2. The node uses reflected S-TIEs to find all nodes at the same 1358 level in the same PoD and the set of southbound adjacencies for 1359 each. The set of nodes at the same level is termed |N and for 1360 each node, n, in |N, we define its set of southbound adjacencies 1361 to be |A(n). 1363 3. For a given r, if the intersection of |H(r) and |A(n), for any n, 1364 is null then that prefix r must be explicitly advertised by the 1365 node in an S-TIE. 1367 4. Identical set of de-aggregated prefixes is flooded on each of the 1368 node's southbound adjacencies. In accordance with the normal 1369 flooding rules for an S-TIE, a node at the lower level that 1370 receives this S-TIE will not propagate it south-bound. Neither 1371 is it necessary for the receiving node to reflect the 1372 disaggregated prefixes back over its adjacencies to nodes at the 1373 level from which it was received. 1375 To summarize the above in simplest terms: if a node detects that its 1376 default route encompasses prefixes for which one of the other nodes 1377 in its level has no possible next-hops in the level below, it has to 1378 disaggregate it to prevent black-holing or suboptimal routing. Hence 1379 a node X needs to determine if it can reach a different set of south 1380 neighbors than other nodes at the same level, which are connected to 1381 it via at least one common south or east-west neighbor. If it can, 1382 then prefix disaggregation may be required. If it can't, then no 1383 prefix disaggregation is needed. An example of disaggregation is 1384 provided in Section 5.3. 1386 A possible algorithm is described last: 1388 1. Create partial_neighbors = (empty), a set of neighbors with 1389 partial connectivity to the node X's layer from X's perspective. 1390 Each entry is a list of south neighbor of X and a list of nodes 1391 of X.layer that can't reach that neighbor. 1393 2. A node X determines its set of southbound neighbors 1394 X.south_neighbors. 1396 3. For each S-TIE originated from a node Y that X has which is at 1397 X.layer, if Y.south_neighbors is not the same as 1398 X.south_neighbors but the nodes share at least one southern 1399 neighbor, for each neighbor N in X.south_neighbors but not in 1400 Y.south_neighbors, add (N, (Y)) to partial_neighbors if N isn't 1401 there or add Y to the list for N. 1403 4. If partial_neighbors is empty, then node X does not to 1404 disaggregate any prefixes. If node X is advertising 1405 disaggregated prefixes in its S-TIE, X SHOULD remove them and re- 1406 advertise its according S-TIEs. 1408 A node X computes its SPF based upon the received N-TIEs. This 1409 results in a set of routes, each categorized by (prefix, 1410 path_distance, next-hop-set). Alternately, for clarity in the 1411 following procedure, these can be organized by next-hop-set as ( 1412 (next-hops), {(prefix, path_distance)}). If partial_neighbors isn't 1413 empty, then the following procedure describes how to identify 1414 prefixes to disaggregate. 1416 disaggregated_prefixes = {empty } 1417 nodes_same_layer = { empty } 1418 for each S-TIE 1419 if (S-TIE.layer == X.layer and 1420 X shares at least one S-neighbor with X) 1421 add S-TIE.originator to nodes_same_layer 1422 end if 1423 end for 1425 for each next-hop-set NHS 1426 isolated_nodes = nodes_same_layer 1427 for each NH in NHS 1428 if NH in partial_neighbors 1429 isolated_nodes = intersection(isolated_nodes, 1430 partial_neighbors[NH].nodes) 1431 end if 1432 end for 1434 if isolated_nodes is not empty 1435 for each prefix using NHS 1436 add (prefix, distance) to disaggregated_prefixes 1437 end for 1438 end if 1439 end for 1441 copy disaggregated_prefixes to X's S-TIE 1442 if X's S-TIE is different 1443 schedule S-TIE for flooding 1444 end if 1446 Figure 6: Computation to Disaggregate Prefixes 1448 Each disaggregated prefix is sent with the accurate path_distance. 1449 This allows a node to send the same S-TIE to each south neighbor. 1450 The south neighbor which is connected to that prefix will thus have a 1451 shorter path. 1453 Finally, to summarize the less obvious points partially omitted in 1454 the algorithms to keep them more tractable: 1456 1. all neighbor relationships MUST perform backlink checks. 1458 2. overload bits as introduced in Section 4.3.1 have to be respected 1459 during the computation. 1461 3. all the lower level nodes are flooded the same disaggregated 1462 prefixes since we don't want to build an S-TIE per node and 1463 complicate things unnecessarily. The PoD containing the prefix 1464 will prefer southbound anyway. 1466 4. disaggregated prefixes do NOT have to propagate to lower levels. 1467 With that the disturbance in terms of new flooding is contained 1468 to a single level experiencing failures only. 1470 5. disaggregated prefix S-TIEs are not "reflected" by the lower 1471 layer, i.e. nodes within same level do NOT need to be aware 1472 which node computed the need for disaggregation. 1474 6. The fabric is still supporting maximum load balancing properties 1475 while not trying to send traffic northbound unless necessary. 1477 Ultimately, complex partitions of superspine on sparsely connected 1478 fabrics can lead to necessity of transitive disaggregation through 1479 multiple layers. The topic will be described and standardized in 1480 later versions of this document. 1482 4.2.9. Optional Autoconfiguration 1484 Each RIFT node can optionally operate in zero touch provisioning 1485 (ZTP) mode, i.e. it has no configuration (unless it is a superspine 1486 at the top of the topology or it MUST operate as leaf and/or support 1487 leaf-2-leaf procedures) and it will fully configure itself after 1488 being attached to the topology. Configured nodes and nodes operating 1489 in ZTP can be mixed and will form a valid topology if achievable. 1490 This section describes the necessary concepts and procedures. 1492 4.2.9.1. Terminology 1494 Automatic Level Derivation: Procedures which allow nodes without 1495 level configured to derive it automatically. Only applied if 1496 CONFIGURED_LEVEL is undefined. 1498 UNDEFINED_LEVEL: An imaginary value that indicates that the level 1499 has not beeen determined and has not been configured. Schemas 1500 normally indicate that by a missing optional value without an 1501 available defined default. 1503 LEAF_ONLY: An optional configuration flag that can be configured on 1504 a node to make sure it never leaves the "bottom of the hierarchy". 1505 SUPERSPINE_FLAG and CONFIGURED_LEVEL cannot be defined at the same 1506 time as this flag. It implies CONFIGURED_LEVEL value of 0. 1508 CONFIGURED_LEVEL: A level value provided manually. When this is 1509 defined (i.e. it is not an UNDEFINED_LEVEL) the node is not 1510 participating in ZTP. SUPERSPINE_FLAG is ignored when this value 1511 is defined. LEAF_ONLY can be set only if this value is undefined 1512 or set to 0. 1514 DERIVED_LEVEL: Level value computed via automatic level derivation 1515 when CONFIGURED_LEVEL is equal to UNDEFINED_LEVEL. 1517 LEAF_2_LEAF: An optional flag that can be configured on a node to 1518 make sure it supports procedures defined in Section 4.3.7. 1519 SUPERSPINE_FLAG is ignored when set at the same time as this flag. 1520 LEAF_2_LEAF implies LEAF_ONLY and the according restrictions. 1522 LEVEL_VALUE: In ZTP case the original definition of "level" in 1523 Section 2.1 is both extended and relaxed. First, level is defined 1524 now as LEVEL_VALUE and is the first defined value of 1525 CONFIGURED_LEVEL followed by DERIVED_LEVEL. Second, it is 1526 possible for nodes to be more than one level apart to form 1527 adjacencies if any of the nodes is at least LEAF_ONLY. 1529 Valid Offered Level (VOL): A neighbor's level received on a valid 1530 LIE (i.e. passing all checks for adjacency formation while 1531 disregarding all clauses involving level values) persisting for 1532 the duration of the holdtime interval on the LIE. Observe that 1533 offers from nodes offering level value of 0 do not constitute VOLs 1534 (since no valid DERIVED_LEVEL can be obtained from those). Offers 1535 from LIEs with `not_a_ztp_offer` being true are not VOLs either. 1537 Highest Available Level (HAL): Highest defined level value seen from 1538 all VOLs received. 1540 Highest Adjacency Three Way (HAT): Highest neigbhor level of all the 1541 formed three way adjacencies for the node. 1543 SUPERSPINE_FLAG: Configuration flag provided to all superspines. 1544 LEAF_FLAG and CONFIGURED_LEVEL cannot be defined at the same time 1545 as this flag. It implies CONFIGURED_LEVEL value of 16. In fact, 1546 it is basically a shortcut for configuring same level at all 1547 superspine nodes which is unavoidable since an initial 'seed' is 1548 needed for other ZTP nodes to derive their level in the topology. 1550 4.2.9.2. Automatic SystemID Selection 1552 RIFT identifies each node via a SystemID which is a 64 bits wide 1553 integer. It is relatively simple to derive a, for all practical 1554 purposes collision free, value for each node on startup. As simple 1555 examples either system MAC and two random bytes can be used or an 1556 IPv4/IPv6 router ID interface address recycled as System ID. The 1557 router MUST ensure that such identifier is not changing very 1558 frequently (at least not without sending all its TIEs with fairly 1559 short lifetimes) since otherwise the network may be left with large 1560 amounts of stale TIEs in other nodes (though this is not necessarily 1561 a serious problem if the procedures suggested in Section 7 are 1562 implemented). 1564 4.2.9.3. Generic Fabric Example 1566 ZTP forces us to think about miscabled or unusually cabled fabric and 1567 how such a topology can be forced into a "lattice" structure which a 1568 fabric represents (with further restrictions). Let us consider a 1569 necessary and sufficient physical cabling in Figure 7. We assume all 1570 nodes being in the same PoD. 1572 . +---+ 1573 . | A | s = SUPERSPINE_FLAG 1574 . | S | l = LEAF_ONLY 1575 . ++-++ l2l = LEAF_2_LEAF 1576 . | | 1577 . +--+ +--+ 1578 . | | 1579 . +--++ ++--+ 1580 . | E | | F | 1581 . | +-+ | +-----------+ 1582 . ++--+ | ++-++ | 1583 . | | | | | 1584 . | +-------+ | | 1585 . | | | | | 1586 . | | +----+ | | 1587 . | | | | | 1588 . ++-++ ++-++ | 1589 . | I +-----+ J | | 1590 . | | | +-+ | 1591 . ++-++ +--++ | | 1592 . | | | | | 1593 . +---------+ | +------+ | 1594 . | | | | | 1595 . +-----------------+ | | 1596 . | | | | | 1597 . ++-++ ++-++ | 1598 . | X +-----+ Y +-+ 1599 . |l2l| | l | 1600 . +---+ +---+ 1602 Figure 7: Generic ZTP Cabling Considerations 1604 First, we need to anchor the "top" of the cabling and that's what the 1605 SUPERSPINE_FLAG at node A is for. Then things look smooth until we 1606 have to decide whether node Y is at the same level as I, J or at the 1607 same level as Y and consequently, X is south of it. This is 1608 unresolvable here until we "nail down the bottom" of the topology. 1609 To achieve that we use the the leaf flags. We will see further then 1610 whether Y chooses to form adjacencies to F or I, J successively. 1612 4.2.9.4. Level Determination Procedure 1614 A node starting up with UNDEFINED_VALUE (i.e. without a 1615 CONFIGURED_LEVEL or any leaf or superspine flag) MUST follow those 1616 additional procedures: 1618 1. It advertises its LEVEL_VALUE on all LIEs (observe that this can 1619 be UNDEFINED_LEVEL which in terms of the schema is simply an 1620 omitted optional value). 1622 2. It chooses on an ongoing basis from all VOLs the value of 1623 MAX(HAL-1,0) as its DERIVED_LEVEL. The node then starts to 1624 advertise this derived level. 1626 3. A node that lost all adjacencies with HAL value MUST hold down 1627 computation of new DERIVED_LEVEL for a short period of time 1628 unless it has no VOLs from southbound adjacencies. After the 1629 holddown expired, it MUST discard all received offers, recompute 1630 DERIVED_LEVEL and announce it to all neighbors. 1632 4. A node MUST reset any adjacency that has changed the level it is 1633 offering and is in three way state. 1635 5. A node that changed its defined level value MUST readvertise its 1636 own TIEs (since the new `PacketHeader` will contain a different 1637 level than before). Sequence number of each TIE MUST be 1638 increased. 1640 6. After a level has been derived the node MUST set the 1641 `not_a_ztp_offer` on LIEs towards all systems extending a VOL for 1642 HAL. 1644 A node starting with LEVEL_VALUE being 0 (i.e. it assumes a leaf 1645 function or has a CONFIGURED_LEVEL of 0) MUST follow those additional 1646 procedures: 1648 1. It computes HAT per procedures above but does NOT use it to 1649 compute DERIVED_LEVEL. HAT is used to limit adjacency formation 1650 per Section 4.2.2. 1652 Precise finite state machines will be provided in later versions of 1653 this specification. 1655 4.2.9.5. Resulting Topologies 1657 The procedures defined in Section 4.2.9.4 will lead to the RIFT 1658 topology and levels depicted in Figure 8. 1660 . +---+ 1661 . | As| 1662 . | 64| 1663 . ++-++ 1664 . | | 1665 . +--+ +--+ 1666 . | | 1667 . +--++ ++--+ 1668 . | E | | F | 1669 . | 63+-+ | 63+-----------+ 1670 . ++--+ | ++-++ | 1671 . | | | | | 1672 . | +-------+ | | 1673 . | | | | | 1674 . | | +----+ | | 1675 . | | | | | 1676 . ++-++ ++-++ | 1677 . | I +-----+ J | | 1678 . | 62| | 62| | 1679 . ++--+ +--++ | 1680 . | | | 1681 . +---------+ | | 1682 . | | | 1683 . ++-++ +---+ | 1684 . | X | | Y +-+ 1685 . | 0 | | 0 | 1686 . +---+ +---+ 1688 Figure 8: Generic ZTP Topology Autoconfigured 1690 In case we imagine the LEAF_ONLY restriction on Y is removed the 1691 outcome would be very different however and result in Figure 9. This 1692 demonstrates basically that auto configuration prevents miscabling 1693 detection and with that can lead to undesirable effects when leafs 1694 are not "nailed" and arbitrarily cabled. 1696 . +---+ 1697 . | As| 1698 . | 64| 1699 . ++-++ 1700 . | | 1701 . +--+ +--+ 1702 . | | 1703 . +--++ ++--+ 1704 . | E | | F | 1705 . | 63+-+ | 63+-------+ 1706 . ++--+ | ++-++ | 1707 . | | | | | 1708 . | +-------+ | | 1709 . | | | | | 1710 . | | +----+ | | 1711 . | | | | | 1712 . ++-++ ++-++ +-+-+ 1713 . | I +-----+ J +-----+ Y | 1714 . | 62| | 62| | 62| 1715 . ++-++ +--++ ++-++ 1716 . | | | | | 1717 . | +-----------------+ | 1718 . | | | 1719 . +---------+ | | 1720 . | | | 1721 . ++-++ | 1722 . | X +--------+ 1723 . | 0 | 1724 . +---+ 1726 Figure 9: Generic ZTP Topology Autoconfigured 1728 4.2.10. Stability Considerations 1730 The autoconfiguration mechanism computes a global maximum of levels 1731 by diffusion. The achieved equilibrium can be disturbed massively by 1732 all nodes with highest level either leaving or entering the domain 1733 (with some finer distinctions not explained further). It is 1734 therefore recommended that each node is multi-homed towards nodes 1735 with respective HAL offerings. Fortuntately, this is the natural 1736 state of things for the topology variants considered in RIFT. 1738 4.3. Further Mechanisms 1740 4.3.1. Overload Bit 1742 Overload Bit MUST be respected in all according reachability 1743 computations. A node with overload bit set SHOULD NOT advertise any 1744 reachability prefixes southbound except locally hosted ones. 1746 The leaf node SHOULD set the 'overload' bit on its node TIEs, since 1747 if the spine nodes were to forward traffic not meant for the local 1748 node, the leaf node does not have the topology information to prevent 1749 a routing/forwarding loop. 1751 4.3.2. Optimized Route Computation on Leafs 1753 Since the leafs do see only "one hop away" they do not need to run a 1754 full SPF but can simply gather prefix candidates from their neighbors 1755 and build the according routing table. 1757 A leaf will have no N-TIEs except its own and optionally from its 1758 east-west neighbors. A leaf will have S-TIEs from its neighbors. 1760 Instead of creating a network graph from its N-TIEs and neighbor's 1761 S-TIEs and then running an SPF, a leaf node can simply compute the 1762 minimum cost and next_hop_set to each leaf neighbor by examining its 1763 local interfaces, determining bi-directionality from the associated 1764 N-TIE, and specifying the neighbor's next_hop_set set and cost from 1765 the minimum cost local interfaces to that neighbor. 1767 Then a leaf attaches prefixes as in Section 4.2.6 as well as the 1768 policy-guided prefixes as in Section 4.2.7. 1770 4.3.3. Key/Value Store 1772 4.3.3.1. Southbound 1774 The protocol supports a southbound distribution of key-value pairs 1775 that can be used to e.g. distribute configuration information during 1776 topology bring-up. The KV S-TIEs can arrive from multiple nodes and 1777 hence need tie-breaking per key. We use the following rules 1779 1. Only KV TIEs originated by a node to which the receiver has an 1780 adjacency are considered. 1782 2. Within all valid KV S-TIEs containing the key, the value of the 1783 KV S-TIE for which the according node S-TIE is present, has the 1784 highest level and within the same level has highest originator ID 1785 is preferred. If keys in the most preferred TIEs are 1786 overlapping, the behavior is undefined. 1788 Observe that if a node goes down, the node south of it looses 1789 adjacencies to it and with that the KVs will be disregarded and on 1790 tie-break changes new KV re-advertised to prevent stale information 1791 being used by nodes further south. KV information in southbound 1792 direction is not result of independent computation of every node but 1793 a diffused computation. 1795 4.3.3.2. Northbound 1797 Certain use cases seem to necessitate distribution of essentialy KV 1798 information that is generated in the leafs in the northbound 1799 direction. Such information is flooded in KV N-TIEs. Since the 1800 originator of northbound KV is preserved during northbound flooding, 1801 overlapping keys could be used. However, to omit further protocol 1802 complexity, only the value of the key in TIE tie-broken in same 1803 fashion as southbound KV TIEs is used. 1805 4.3.4. Interactions with BFD 1807 RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures. 1808 In such case following procedures are introduced: 1810 After RIFT 3-way hello adjacency convergence a BFD session MAY be 1811 formed automatically between the RIFT endpoints without further 1812 configuration. 1814 In case RIFT looses 3-way hello adjacency, the BFD session should 1815 be brought down until 3-way adjacency is formed again. 1817 In case established BFD session goes Down after it was Up, RIFT 1818 adjacency should be re-initialized from scratch. 1820 In case of parallel links between nodes each link may run its own 1821 independent BFD session. 1823 In case RIFT changes link identifiers both the hello as well as 1824 the BFD sessions will be brought down and back up again. 1826 4.3.5. Fabric Bandwidth Balancing 1828 A well understood problem in fabrics is that in case of link losses 1829 it would be ideal to rebalance how much traffic is offered to 1830 switches in the next layer based on the ingress and egress bandwidth 1831 they have. Current attempts rely mostly on specialized traffic 1832 engineering via controller or leafs being aware of complete topology 1833 with according cost and complexity. 1835 RIFT presents a very light weight mechanism that can deal with the 1836 problem in an approximative way based on the fact that RIFT is loop- 1837 free. 1839 4.3.5.1. Northbound Direction 1841 In a first step, a node can compare the amount of northbound bandwith 1842 available to neighbors at the same level and modify metric on its 1843 advertised default route (or even other routes) to present a 1844 different distance leading to e.g. e.g. weighted ECMP forwarding on 1845 leafs. We call such a distance Bandwidth Adjusted Distance or BAD. 1846 This is best illustrated by a simple example. 1848 . | x | | 1849 . | x | | 1850 . +-+---+-+ +-+---+-+ 1851 . | | | | 1852 . |Node111| |Node112| 1853 . +-+---+++ ++----+++ 1854 . |x || || || 1855 . || |+---------------+ || 1856 . || +---------------+| || 1857 . || || || || 1858 . || +------------+| || || 1859 . || |+------------+ || || 1860 . |x || || || 1861 . +-+---+++ +--++-+++ 1862 . | | | | 1863 . |Leaf111| |Leaf112| 1864 . +-------+ +-------+ 1866 Figure 10: Balancing Bandwidth 1868 All links in Figure 10 are assumed to have the same bandwith for 1869 simplicity. Node 111 sees in the node S-TIE of 112 that Node 112 has 1870 twice the amount of bandwidth going northbound and therefore Node 111 1871 will advertise its default route cost (BAD) as twice the default 1872 which without further failures would lead to Leaf 111 and Leaf 112 1873 distributing 2/3 of the traffic to Node 111 and 1/3 to Node 112. 1875 Further, in Figure 10 we assume that Leaf111 lost one of the parallel 1876 links to Node 111 and with that wants to push more traffic onto Node 1877 112. This leads to local modification of the received BADs and each 1878 node can choose the ratio here independently based on understanding 1879 of e.g. traffic distribution between E-W and N-S or queue occupancy. 1880 If we assume that 50% of the leaf's traffic is for Leaf112 and 50% 1881 exits northbound we would modify the BADs accordingly to the 1882 bandwidth available towards each of them and end in Leaf 111 with a 1883 weight of 4 to Node 111 and weight of 1 to Node 112 which gives us 1884 roughly 4/5 of the traffic going to Node 112. 1886 Future version of this document will provide the precise algorithm to 1887 compute BADs from all other nodes at the same level using the same 1888 algorithm as Section 4.2.3.8 while ignoring overloaded nodes. 1890 Observe that since BAD is only computed for default routes any 1891 disaggregated prefixes or PGP are not affected. 1893 Observe further that a change in available bandwidth will only affect 1894 one level down in the fabric, i.e. blast radius of bandwidth changes 1895 is contained. 1897 4.3.5.2. Southbound Direction 1899 Due to its loop free properties a node could take during S-SPF into 1900 account the available bandwidth on the nodes in lower layers and 1901 modify the amount of traffic offered to next level's "southbound" 1902 nodes based as what it sees is the total achievable maximum flow 1903 through those nodes. It is worth observing that such computations 1904 will work better if standardized but does not have to be necessarily. 1905 As long the packet keeps on heading south it will take one of the 1906 available paths and arrive at the intended destination. 1908 Future versions of this document will fill in more details. 1910 4.3.6. Segment Routing Support with RIFT 1912 Recently, alternative architecture to reuse labels as segment 1913 identifiers [I-D.ietf-spring-segment-routing] has gained traction and 1914 may present use cases in DC fabric that would justify its deployment. 1915 Such use cases will either precondition an assignment of a label per 1916 node (or other entities where the mechanisms are equivalent) or a 1917 global assignment and a knowledge of topology everywhere to compute 1918 segment stacks of interest. We deal with the two issues separately. 1920 4.3.6.1. Global Segment Identifiers Assignment 1922 Global segment identifiers are normally assumed to be provided by 1923 some kind of a centralized "controller" instance and distributed to 1924 other entities. This can be performed in RIFT by attaching a 1925 controller to the superspine nodes at the top of the fabric where the 1926 whole topology is always visible, assign such identifiers and then 1927 distribute those via the KV mechanism towards all nodes so they can 1928 perform things like probing the fabric for failures using a stack of 1929 segments. 1931 4.3.6.2. Distribution of Topology Information 1933 Some segment routing use cases seem to precondition full knowledge of 1934 fabric topology in all nodes which can be performed albeit at the 1935 loss of one of highly desirable properties of RIFT, namely minimal 1936 blast radius. Basically, RIFT can function as a flat IGP by 1937 switching off its flooding scopes. All nodes will end up with full 1938 topology view and albeit the N-SPF and S-SPF are still performed 1939 based on RIFT rules, any computation with segment identifiers that 1940 needs full topology can use it. 1942 Beside blast radius problem, excessive flooding may present 1943 significant load on implementations. RIFT can be extended beside the 1944 mechanism in Section 4.2.3.8 to provide an algorithm for globally 1945 optimized flooding minimalization should demand for such a use case 1946 solidify. 1948 4.3.7. Leaf to Leaf Procedures 1950 RIFT can optionally allow special leaf East-West adjacencies under 1951 additional set of rules. The leaf supporting those procedures MUST: 1953 advertise the LEAF_2_LEAF flag in node capabilities AND 1955 set the overload bit on all leaf's node TIEs AND 1957 flood only node's own north and south TIEs over E-W leaf 1958 adjacencies AND 1960 always use E-W leaf adjacency in both north as well as south 1961 computation AND 1963 install a discard route for any advertised aggregate in leaf's 1964 TIEs AND 1966 never form southbound adjacencies. 1968 This will allow the E-W leaf nodes to exchange traffic strictly for 1969 the prefixes advertised in each other's north prefix TIEs (since the 1970 southbound computation will find the reverse direction in the other 1971 node's TIE and install its north prefixes). 1973 4.3.8. Other End-to-End Services 1975 Losing full, flat topology information at every node will have an 1976 impact on some of the end-to-end network services. This is the price 1977 paid for minimal disturbance in case of failures and reduced flooding 1978 and memory requirements on nodes lower south in the level hierarchy. 1980 4.3.9. Address Family and Multi Topology Considerations 1982 Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC6822] is used 1983 today in link-state routing protocols to support several domains on 1984 the same physical topology. RIFT supports this capability by 1985 carrying transport ports in the LIE protocol exchanges. Multiplexing 1986 of LIEs can be achieved by either choosing varying multicast 1987 addresses or ports on the same address. 1989 BFD interactions in Section 4.3.4 are implementation dependent when 1990 multiple RIFT instances run on the same link. 1992 4.3.10. Reachability of Internal Nodes in the Fabric 1994 RIFT does not precondition that its nodes have reachable addresses 1995 albeit for operational purposes this is clearly desirable. Under 1996 normal operating conditions this can be easily achieved by e.g. 1997 injecting the node's loopback address into North Prefix TIEs. 1999 Things get more interesting in case a node looses all its northbound 2000 adjacencies but is not at the top of the fabric. In such a case a 2001 node that detects that some other members at its level are 2002 advertising northbound adjacencies MAY inject its loopback address 2003 into southbound PGP TIE and become reachable "from the south" that 2004 way. Further, a solution may be implemented where based on e.g. a 2005 "well known" community such a southbound PGP is reflected at level 0 2006 and advertised as northbound PGP again to allow for "reachability 2007 from the north" at the cost of additional flooding. 2009 4.3.11. One-Hop Healing of Levels with East-West Links 2011 Based on the rules defined in Section 4.2.5, Section 4.2.3.7 and 2012 given presence of E-W links, RIFT can provide a one-hop protection of 2013 nodes that lost all their northbound links or in other complex link 2014 set failure scenarios. Section 5.4 explains the resulting behavior 2015 based on one such example. 2017 5. Examples 2019 5.1. Normal Operation 2021 This section describes RIFT deployment in the example topology 2022 without any node or link failures. We disregard flooding reduction 2023 for simplicity's sake. 2025 As first step, the following bi-directional adjacencies will be 2026 created (and any other links that do not fulfill LIE rules in 2027 Section 4.2.2 disregarded): 2029 1. Spine 21 (PoD 0) to Node 111, Node 112, Node 121, and Node 122 2031 2. Spine 22 (PoD 0) to Node 111, Node 112, Node 121, and Node 122 2033 3. Node 111 to Leaf 111, Leaf 112 2035 4. Node 112 to Leaf 111, Leaf 112 2037 5. Node 121 to Leaf 121, Leaf 122 2039 6. Node 122 to Leaf 121, Leaf 122 2041 Consequently, N-TIEs would be originated by Node 111 and Node 112 and 2042 each set would be sent to both Spine 21 and Spine 22. N-TIEs also 2043 would be originated by Leaf 111 (w/ Prefix 111) and Leaf 112 (w/ 2044 Prefix 112 and the multi-homed prefix) and each set would be sent to 2045 Node 111 and Node 112. Node 111 and Node 112 would then flood these 2046 N-TIEs to Spine 21 and Spine 22. 2048 Similarly, N-TIEs would be originated by Node 121 and Node 122 and 2049 each set would be sent to both Spine 21 and Spine 22. N-TIEs also 2050 would be originated by Leaf 121 (w/ Prefix 121 and the multi-homed 2051 prefix) and Leaf 122 (w/ Prefix 122) and each set would be sent to 2052 Node 121 and Node 122. Node 121 and Node 122 would then flood these 2053 N-TIEs to Spine 21 and Spine 22. 2055 At this point both Spine 21 and Spine 22, as well as any controller 2056 to which they are connected, would have the complete network 2057 topology. At the same time, Node 111/112/121/122 hold only the 2058 N-ties of level 0 of their respective PoD. Leafs hold only their own 2059 N-TIEs. 2061 S-TIEs with adjacencies and a default IP prefix would then be 2062 originated by Spine 21 and Spine 22 and each would be flooded to Node 2063 111, Node 112, Node 121, and Node 122. Node 111, Node 112, Node 121, 2064 and Node 122 would each send the S-TIE from Spine 21 to Spine 22 and 2065 the S-TIE from Spine 22 to Spine 21. (S-TIEs are reflected up to 2066 level from which they are received but they are NOT propagated 2067 southbound.) 2069 An S Tie with a default IP prefix would be originated by Node 111 and 2070 Node 112 and each would be sent to Leaf 111 and Leaf 112. Leaf 111 2071 and Leaf 112 would each send the S-TIE from Node 111 to Node 112 and 2072 the S-TIE from Node 112 to Node 111. 2074 Similarly, an S Tie with a default IP prefix would be originated by 2075 Node 121 and Node 122 and each would be sent to Leaf 121 and Leaf 2076 122. Leaf 121 and Leaf 122 would each send the S-TIE from Node 121 2077 to Node 122 and the S-TIE from Node 122 to Node 121. At this point 2078 IP connectivity with maximum possible ECMP has been established 2079 between the leafs while constraining the amount of information held 2080 by each node to the minimum necessary for normal operation and 2081 dealing with failures. 2083 5.2. Leaf Link Failure 2085 . | | | | 2086 .+-+---+-+ +-+---+-+ 2087 .| | | | 2088 .|Node111| |Node112| 2089 .+-+---+-+ ++----+-+ 2090 . | | | | 2091 . | +---------------+ X 2092 . | | | X Failure 2093 . | +-------------+ | X 2094 . | | | | 2095 .+-+---+-+ +--+--+-+ 2096 .| | | | 2097 .|Leaf111| |Leaf112| 2098 .+-------+ +-------+ 2099 . + + 2100 . Prefix111 Prefix112 2102 Figure 11: Single Leaf link failure 2104 In case of a failing leaf link between node 112 and leaf 112 the 2105 link-state information will cause re-computation of the necessary SPF 2106 and the higher levels will stop forwarding towards prefix 112 through 2107 node 112. Only nodes 111 and 112, as well as both spines will see 2108 control traffic. Leaf 111 will receive a new S-TIE from node 112 and 2109 reflect back to node 111. Node 111 will de-aggregate prefix 111 and 2110 prefix 112 but we will not describe it further here since de- 2111 aggregation is emphasized in the next example. It is worth observing 2112 however in this example that if leaf 111 would keep on forwarding 2113 traffic towards prefix 112 using the advertised south-bound default 2114 of node 112 the traffic would end up on spine 21 and spine 22 and 2115 cross back into pod 1 using node 111. This is arguably not as bad as 2116 black-holing present in the next example but clearly undesirable. 2117 Fortunately, de-aggregation prevents this type of behavior except for 2118 a transitory period of time. 2120 5.3. Partitioned Fabric 2122 . +--------+ +--------+ S-TIE of Spine21 2123 . | | | | received by 2124 . |Spine 21| |Spine 22| reflection of 2125 . ++-+--+-++ ++-+--+-++ Nodes 112 and 111 2126 . | | | | | | | | 2127 . | | | | | | | 0/0 2128 . | | | | | | | | 2129 . | | | | | | | | 2130 . +--------------+ | +--- XXXXXX + | | | +---------------+ 2131 . | | | | | | | | 2132 . | +-----------------------------+ | | | 2133 . 0/0 | | | | | | | 2134 . | 0/0 0/0 +- XXXXXXXXXXXXXXXXXXXXXXXXX -+ | 2135 . | 1.1/16 | | | | | | 2136 . | | +-+ +-0/0-----------+ | | 2137 . | | | 1.1./16 | | | | 2138 .+-+----++ +-+-----+ ++-----0/0 ++----0/0 2139 .| | | | | 1.1/16 | 1.1/16 2140 .|Node111| |Node112| |Node121| |Node122| 2141 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 2142 . | | | | | | | | 2143 . | +---------------+ | | +----------------+ | 2144 . | | | | | | | | 2145 . | +-------------+ | | | +--------------+ | | 2146 . | | | | | | | | 2147 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 2148 .| | | | | | | | 2149 .|Leaf111| |Leaf112| |Leaf121| |Leaf122| 2150 .+-+-----+ ++------+ +-----+-+ +-+-----+ 2151 . + + + + 2152 . Prefix111 Prefix112 Prefix121 Prefix122 2153 . 1.1/16 2155 Figure 12: Fabric partition 2157 Figure 12 shows the arguably most catastrophic but also the most 2158 interesting case. Spine 21 is completely severed from access to 2159 Prefix 121 (we use in the figure 1.1/16 as example) by double link 2160 failure. However unlikely, if left unresolved, forwarding from leaf 2161 111 and leaf 112 to prefix 121 would suffer 50% black-holing based on 2162 pure default route advertisements by spine 21 and spine 22. 2164 The mechanism used to resolve this scenario is hinging on the 2165 distribution of southbound representation by spine 21 that is 2166 reflected by node 111 and node 112 to spine 22. Spine 22, having 2167 computed reachability to all prefixes in the network, advertises with 2168 the default route the ones that are reachable only via lower level 2169 neighbors that spine 21 does not show an adjacency to. That results 2170 in node 111 and node 112 obtaining a longest-prefix match to prefix 2171 121 which leads through spine 22 and prevents black-holing through 2172 spine 21 still advertising the 0/0 aggregate only. 2174 The prefix 121 advertised by spine 22 does not have to be propagated 2175 further towards leafs since they do no benefit from this information. 2176 Hence the amount of flooding is restricted to spine 21 reissuing its 2177 S-TIEs and reflection of those by node 111 and node 112. The 2178 resulting SPF in spine 22 issues a new prefix S-TIEs containing 2179 1.1/16. None of the leafs become aware of the changes and the 2180 failure is constrained strictly to the level that became partitioned. 2182 To finish with an example of the resulting sets computed using 2183 notation introduced in Section 4.2.8, spine 22 constructs the 2184 following sets: 2186 |R = Prefix 111, Prefix 112, Prefix 121, Prefix 122 2188 |H (for r=Prefix 111) = Node 111, Node 112 2190 |H (for r=Prefix 112) = Node 111, Node 112 2192 |H (for r=Prefix 121) = Node 121, Node 122 2194 |H (for r=Prefix 122) = Node 121, Node 122 2196 |A (for Spine 21) = Node 111, Node 112 2198 With that and |H (for r=prefix 121) and |H (for r=prefix 122) being 2199 disjoint from |A (for spine 21), spine 22 will originate an S-TIE 2200 with prefix 121 and prefix 122, that is flooded to nodes 112, 112, 2201 121 and 122. 2203 5.4. Northbound Partitioned Router and Optional East-West Links 2205 . + + + 2206 . X N1 | N2 | N3 2207 . X | | 2208 .+--+----+ +--+----+ +--+-----+ 2209 .| |0/0> <0/0| |0/0> <0/0| | 2210 .| A01 +----------+ A02 +----------+ A03 | Level 1 2211 .++-+-+--+ ++--+--++ +---+-+-++ 2212 . | | | | | | | | | 2213 . | | +----------------------------------+ | | | 2214 . | | | | | | | | | 2215 . | +-------------+ | | | +--------------+ | 2216 . | | | | | | | | | 2217 . | +----------------+ | +-----------------+ | 2218 . | | | | | | | | | 2219 . | | +------------------------------------+ | | 2220 . | | | | | | | | | 2221 .++-+-+--+ | +---+---+ | +-+---+-++ 2222 .| | +-+ +-+ | | 2223 .| L01 | | L02 | | L03 | Level 0 2224 .+-------+ +-------+ +--------+ 2226 Figure 13: North Partitioned Router 2228 Figure 13 shows a part of a fabric where level 1 is horizontally 2229 connected and A01 lost its only northbound adjacency. Based on N-SPF 2230 rules in Section 4.2.5.1 A01 will compute northbound reachability by 2231 using the link A01 to A02 (whereas A02 will NOT use this link during 2232 N-SPF). Hence A01 will still advertise the default towards level 0 2233 and route unidirectionally using the horizontal link. Moreover, 2234 based on Section 4.3.10 it may advertise its loopback address as 2235 south PGP to remain reachable "from the south" for operational 2236 purposes. This is necessary since A02 will NOT route towards A01 2237 using the E-W link (doing otherwise may form routing loops). 2239 As further consideration, the moment A02 looses link N2 the situation 2240 evolves again. A01 will have no more northbound reachability while 2241 still seeing A03 advertising northbound adjacencies in its south node 2242 tie. With that it will stop advertising a default route due to 2243 Section 4.2.3.7. Moreover, A02 may now inject its loopback address 2244 as south PGP. 2246 6. Implementation and Operation: Further Details 2248 6.1. Considerations for Leaf-Only Implementation 2250 Ideally RIFT can be stretched out to the lowest level in the IP 2251 fabric to integrate ToRs or even servers. Since those entities would 2252 run as leafs only, it is worth to observe that a leaf only version is 2253 significantly simpler to implement and requires much less resources: 2255 1. Under normal conditions, the leaf needs to support a multipath 2256 default route only. In worst partitioning case it has to be 2257 capable of accommodating all the leaf routes in its own POD to 2258 prevent black-holing. 2260 2. Leaf nodes hold only their own N-TIEs and S-TIEs of Level 1 nodes 2261 they are connected to; so overall few in numbers. 2263 3. Leaf node does not have to support flooding reduction and de- 2264 aggregation. 2266 4. Unless optional leaf-2-leaf procedures are desired default route 2267 origination, S-TIE origination is unnecessary. 2269 6.2. Adaptations to Other Proposed Data Center Topologies 2271 . +-----+ +-----+ 2272 . | | | | 2273 .+-+ S0 | | S1 | 2274 .| ++---++ ++---++ 2275 .| | | | | 2276 .| | +------------+ | 2277 .| | | +------------+ | 2278 .| | | | | 2279 .| ++-+--+ +--+-++ 2280 .| | | | | 2281 .| | A0 | | A1 | 2282 .| +-+--++ ++---++ 2283 .| | | | | 2284 .| | +------------+ | 2285 .| | +-----------+ | | 2286 .| | | | | 2287 .| +-+-+-+ +--+-++ 2288 .+-+ | | | 2289 . | L0 | | L1 | 2290 . +-----+ +-----+ 2292 Figure 14: Level Shortcut 2294 Strictly speaking, RIFT is not limited to Clos variations only. The 2295 protocol preconditions only a sense of 'compass rose direction' 2296 achieved by configuration (or derivation) of levels and other 2297 topologies are possible within this framework. So, conceptually, one 2298 could include leaf to leaf links and even shortcut between layers but 2299 certain requirements in Section 3 will not be met anymore. As an 2300 example, shortcutting levels illustrated in Figure 14 will lead 2301 either to suboptimal routing when L0 sends traffic to L1 (since using 2302 S0's default route will lead to the traffic being sent back to A0 or 2303 A1) or the leafs need each other's routes installed to understand 2304 that only A0 and A1 should be used to talk to each other. 2306 Whether such modifications of topology constraints make sense is 2307 dependent on many technology variables and the exhausting treatment 2308 of the topic is definitely outside the scope of this document. 2310 6.3. Originating Non-Default Route Southbound 2312 Obviously, an implementation may choose to originate southbound 2313 instead of a strict default route (as described in Section 4.2.3.7) a 2314 shorter prefix P' but in such a scenario all addresses carried within 2315 the RIFT domain must be contained within P'. 2317 7. Security Considerations 2319 The protocol has provisions for nonces and can include authentication 2320 mechanisms in the future comparable to [RFC5709] and [RFC7987]. 2322 One can consider additionally attack vectors where a router may 2323 reboot many times while changing its system ID and pollute the 2324 network with many stale TIEs or TIEs are sent with very long 2325 lifetimes and not cleaned up when the routes vanishes. Those attack 2326 vectors are not unique to RIFT. Given large memory footprints 2327 available today those attacks should be relatively benign. Otherwise 2328 a node can implement a strategy of e.g. discarding contents of all 2329 TIEs of nodes that were not present in the SPF tree over a certain 2330 period of time. Since the protocol, like all modern link-state 2331 protocols, is self-stabilizing and will advertise the presence of 2332 such TIEs to its neighbors, they can be re-requested again if a 2333 computation finds that it sees an adjacency formed towards the system 2334 ID of the discarded TIEs. 2336 Section 4.2.9 presents many attack vectors in untrusted environments, 2337 starting with nodes that oscillate their level offers to the 2338 possiblity of a node offering a three way adjacency with the highest 2339 possible level value with a very long holdtime trying to put itself 2340 "on top of the lattice" and with that gaining access to the whole 2341 southbound topology. Session authentication mechanisms are necessary 2342 in environments where this is possible. 2344 8. Information Elements Schema 2346 This section introduces the schema for information elements. 2348 On schema changes that 2350 1. change field numbers or 2352 2. add new required fields or 2354 3. remove fields or 2356 4. change lists into sets, unions into structures or 2358 5. change multiplicity of fields or 2360 6. changes name of any field 2362 7. change datatypes of any field or 2364 8. changes default value of any field 2366 major version of the schema MUST increase. All other changes MUST 2367 increase minor version within the same major. 2369 Thrift serializer/deserializer MUST not discard optional, unknown 2370 fields but preserve and serialize them again when re-flooding whereas 2371 missing optional fields MAY be replaced with according default values 2372 if present. 2374 All signed integer as forced by Thrift support must be cast for 2375 internal purposes to equivalent unsigned values without discarding 2376 the signedness bit. An implementation SHOULD try to avoid using the 2377 signedness bit when generating values. 2379 The schema is normative. 2381 8.1. common.thrift 2383 /** 2384 Thrift file with common definitions for RIFT 2385 */ 2387 namespace * common 2388 /** @note MUST be interpreted in implementation as unsigned 64 bits. 2389 * The implementation SHOULD NOT use the MSB. 2390 */ 2391 typedef i64 SystemIDType 2392 typedef i32 IPv4Address 2393 /** this has to be of length long enough to accomodate prefix */ 2394 typedef binary IPv6Address 2395 /** @note MUST be interpreted in implementation as unsigned 16 bits */ 2396 typedef i16 UDPPortType 2397 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2398 typedef i32 TIENrType 2399 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2400 typedef i32 MTUSizeType 2401 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2402 typedef i32 SeqNrType 2403 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2404 typedef i32 LifeTimeInSecType 2405 /** @note MUST be interpreted in implementation as unsigned 16 bits */ 2406 typedef i16 LevelType 2407 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2408 typedef i32 PodType 2409 /** @note MUST be interpreted in implementation as unsigned 16 bits */ 2410 typedef i16 VersionType 2411 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2412 typedef i32 MetricType 2413 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2414 typedef i32 BandwithInMegaBitsType 2415 typedef string KeyIDType 2416 /** node local, unique identification for a link (interface/tunnel 2417 * etc. Basically anything RIFT runs on). This is kept 2418 * at 32 bits so it aligns with BFD [RFC5880] discriminator size. 2419 */ 2420 typedef i32 LinkIDType 2421 typedef string KeyNameType 2422 typedef i8 PrefixLenType 2423 /** timestamp in seconds since the epoch */ 2424 typedef i64 TimestampInSecsType 2425 /** security nonce */ 2426 typedef i64 NonceType 2427 /** adjacency holdtime */ 2428 typedef i16 HoldTimeInSecType 2430 /** Flags indicating nodes behavior in case of ZTP and support 2431 for special optimization procedures. It will force level to `leaf_level` 2432 */ 2433 enum LeafIndications { 2434 leaf_only =0, 2435 leaf_only_and_leaf_2_leaf_procedures =1, 2437 } 2439 /** default bandwidth on a link */ 2440 const BandwithInMegaBitsType default_bandwidth = 10 2441 /** fixed leaf level when ZTP is not used */ 2442 const LevelType leaf_level = 0 2443 const LevelType default_level = leaf_level 2444 /** This MUST be used when node is configured as superspine in ZTP. 2445 This is kept reasonably low to alow for fast ZTP convergence on 2446 failures. */ 2447 const LevelType default_superspine_level = 24 2448 const PodType default_pod = 0 2449 const LinkIDType undefined_linkid = 0 2450 /** default distance used */ 2451 const MetricType default_distance = 1 2452 /** any distance larger than this will be considered infinity */ 2453 const MetricType infinite_distance = 0x70000000 2454 /** any element with 0 distance will be ignored, 2455 * missing metrics will be replaced with default_distance 2456 */ 2457 const MetricType invalid_distance = 0 2458 const bool overload_default = false 2459 const bool flood_reduction_default = true 2460 const HoldTimeInSecType default_holdtime = 3 2461 /** by default LIE levels are ZTP offers */ 2462 const bool default_not_a_ztp_offer = false 2463 /** 0 is illegal for SystemID */ 2464 const SystemIDType IllegalSystemID = 0 2465 /** empty set of nodes */ 2466 const set empty_set_of_nodeids = {} 2468 /** normalized bandwidth metric maximum, i.e. node with lowest northbound bandwidth 2469 * at its level uses this metric to advertise its default route */ 2470 const MetricType normalized_bw_metric_max = 0x1fff 2471 /** normalized bandwidth metric minimum, i.e. node with highest northbound bandwidth 2472 * at its level uses this metric to advertise its default route */ 2473 const MetricType normalized_bw_metric_min = 0x00ff 2475 /** default UDP port to run LIEs on */ 2476 const UDPPortType default_lie_udp_port = 6949 2477 const UDPPortType default_tie_udp_flood_port = 6950 2479 /** default MTU size to use */ 2480 const MTUSizeType default_mtu_size = 1400 2481 /** default mcast is v4 224.0.1.150, we make it i64 to 2482 * help languages struggling with highest bit */ 2483 const i64 default_lie_v4_mcast_group = 3758096790 2484 /** indicates whether the direction is northbound/east-west 2485 * or southbound */ 2486 enum TieDirectionType { 2487 Illegal = 0, 2488 South = 1, 2489 North = 2, 2490 DirectionMaxValue = 3, 2491 } 2493 enum AddressFamilyType { 2494 Illegal = 0, 2495 AddressFamilyMinValue = 1, 2496 IPv4 = 2, 2497 IPv6 = 3, 2498 AddressFamilyMaxValue = 4, 2499 } 2501 struct IPv4PrefixType { 2502 1: required IPv4Address address; 2503 2: required PrefixLenType prefixlen; 2504 } 2506 struct IPv6PrefixType { 2507 1: required IPv6Address address; 2508 2: required PrefixLenType prefixlen; 2509 } 2511 union IPAddressType { 2512 1: optional IPv4Address ipv4address; 2513 2: optional IPv6Address ipv6address; 2514 } 2516 union IPPrefixType { 2517 1: optional IPv4PrefixType ipv4prefix; 2518 2: optional IPv6PrefixType ipv6prefix; 2519 } 2521 enum TIETypeType { 2522 Illegal = 0, 2523 TIETypeMinValue = 1, 2524 /** first legal value */ 2525 NodeTIEType = 2, 2526 PrefixTIEType = 3, 2527 TransitivePrefixTIEType = 4, 2528 PGPrefixTIEType = 5, 2529 KeyValueTIEType = 6, 2530 TIETypeMaxValue = 7, 2531 } 2532 /** @note: route types which MUST be ordered on their preference 2533 * PGP prefixes are most preferred attracting 2534 * traffic north (towards spine) and then south 2535 * normal prefixes are attracting traffic south (towards leafs), 2536 * i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH PREFIX TIE 2537 */ 2538 enum RouteType { 2539 Illegal = 0, 2540 RouteTypeMinValue = 1, 2541 /** First legal value. */ 2542 /** Discard routes are most prefered */ 2543 Discard = 2, 2545 /** Local prefixes are directly attached prefixes on the 2546 * system such as e.g. interface routes. 2547 */ 2548 LocalPrefix = 3, 2549 /** advertised in S-TIEs */ 2550 SouthPGPPrefix = 4, 2551 /** advertised in N-TIEs */ 2552 NorthPGPPrefix = 5, 2553 /** advertised in N-TIEs */ 2554 NorthPrefix = 6, 2555 /** advertised in S-TIEs */ 2556 SouthPrefix = 7, 2557 /** transitive southbound are least preferred */ 2558 TransitiveSouthPrefix = 8, 2559 RouteTypeMaxValue = 9 2560 } 2562 8.2. encoding.thrift 2564 /** 2565 Thrift file for packet encodings for RIFT 2566 */ 2568 include "common.thrift" 2570 /** represents protocol encoding schema major version */ 2571 const i32 protocol_major_version = 8 2572 /** represents protocol encoding schema minor version */ 2573 const i32 protocol_minor_version = 0 2575 /** common RIFT packet header */ 2576 struct PacketHeader { 2577 1: required common.VersionType major_version = protocol_major_version; 2578 2: required common.VersionType minor_version = protocol_minor_version; 2579 /** this is the node sending the packet, in case of LIE/TIRE/TIDE 2580 also the originator of it */ 2581 3: required common.SystemIDType sender; 2582 /** level of the node sending the packet, required on everything except 2583 * LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL and is used 2584 * in ZTP procedures. 2585 */ 2586 4: optional common.LevelType level; 2587 } 2589 /** Community serves as community for PGP purposes */ 2590 struct Community { 2591 1: required i32 top; 2592 2: required i32 bottom; 2593 } 2595 /** Neighbor structure */ 2596 struct Neighbor { 2597 1: required common.SystemIDType originator; 2598 2: required common.LinkIDType remote_id; 2599 } 2601 /** Capabilities the node supports */ 2602 struct NodeCapabilities { 2603 /** can this node participate in flood reduction, 2604 only relevant at level > 0 */ 2605 1: optional bool flood_reduction = 2606 common.flood_reduction_default; 2607 /** does this node restrict itself to be leaf only (in ZTP) and 2608 does it support leaf-2-leaf procedures */ 2609 2: optional common.LeafIndications leaf_indications; 2610 } 2612 /** RIFT LIE packet 2614 @note this node's level is already included on the packet header */ 2615 struct LIEPacket { 2616 /** optional node or adjacency name */ 2617 1: optional string name; 2618 /** local link ID */ 2619 2: required common.LinkIDType local_id; 2620 /** UDP port to which we can receive flooded TIEs */ 2621 3: required common.UDPPortType flood_port = 2622 common.default_tie_udp_flood_port; 2623 /** layer 3 MTU */ 2624 4: optional common.MTUSizeType link_mtu_size = 2625 common.default_mtu_size; 2626 /** this will reflect the neighbor once received to provid 2627 3-way connectivity */ 2628 5: optional Neighbor neighbor; 2629 6: optional common.PodType pod = common.default_pod; 2630 /** optional nonce used for security computations */ 2631 7: optional common.NonceType nonce; 2632 /** optional node capabilities shown in the LIE. The capabilies 2633 MUST match the capabilities shown in the Node TIEs, otherwise 2634 the behavior is unspecified. A node detecting the mismatch 2635 SHOULD generate according error. 2636 */ 2637 8: optional NodeCapabilities capabilities; 2638 /** required holdtime of the adjacency, i.e. how much time 2639 MUST expire without LIE for the adjacency to drop 2640 */ 2641 9: required common.HoldTimeInSecType holdtime = 2642 common.default_holdtime; 2643 /** indicates that the level on the LIE MUST NOT be used 2644 to derive a ZTP level by the receiving node. */ 2645 10: optional bool not_a_ztp_offer = 2646 common.default_not_a_ztp_offer; 2647 } 2649 /** LinkID pair describes one of parallel links between two nodes */ 2650 struct LinkIDPair { 2651 /** node-wide unique value for the local link */ 2652 1: required common.LinkIDType local_id; 2653 /** received remote link ID for this link */ 2654 2: required common.LinkIDType remote_id; 2655 /** more properties of the link can go in here */ 2656 } 2658 /** ID of a TIE 2660 @note: TIEID space is a total order achieved by comparing the elements 2661 in sequence defined and comparing each value as an 2662 unsigned integer of according length 2663 */ 2664 struct TIEID { 2665 /** indicates direction of the TIE */ 2666 1: required common.TieDirectionType direction; 2667 /** indicates originator of the TIE */ 2668 2: required common.SystemIDType originator; 2669 3: required common.TIETypeType tietype; 2670 4: required common.TIENrType tie_nr; 2671 } 2672 /** Header of a TIE */ 2673 struct TIEHeader { 2674 2: required TIEID tieid; 2675 3: required common.SeqNrType seq_nr; 2676 /** lifetime expires down to 0 just like in ISIS */ 2677 4: required common.LifeTimeInSecType lifetime; 2678 } 2680 /** A sorted TIDE packet, if unsorted, behavior is undefined */ 2681 struct TIDEPacket { 2682 /** all 00s marks starts */ 2683 1: required TIEID start_range; 2684 /** all FFs mark end */ 2685 2: required TIEID end_range; 2686 /** _sorted_ list of headers */ 2687 3: required list headers; 2688 } 2690 /** A TIRE packet */ 2691 struct TIREPacket { 2692 1: required set headers; 2693 } 2695 /** Neighbor of a node */ 2696 struct NodeNeighborsTIEElement { 2697 2: required common.LevelType level; 2698 /** Cost to neighbor. 2700 @note: All parallel links to same node 2701 incur same cost, in case the neighbor has multiple 2702 parallel links at different cost, the largest distance 2703 (highest numerical value) MUST be advertised 2704 @note: any neighbor with cost <= 0 MUST be ignored in computations */ 2705 3: optional common.MetricType cost = common.default_distance; 2706 /** can carry description of multiple parallel links in a TIE */ 2707 4: optional set link_ids; 2709 /** total bandwith to neighbor, this will be normally sum of the 2710 * bandwidths of all the parallel links. 2711 **/ 2712 5: optional common.BandwithInMegaBitsType bandwidth = 2713 common.default_bandwidth; 2714 } 2716 /** Flags the node sets */ 2717 struct NodeFlags { 2718 /** node is in overload, do not transit traffic through it */ 2719 1: optional bool overload = common.overload_default; 2721 } 2723 /** Description of a node. 2725 It may occur multiple times in different TIEs but if either 2726 * capabilities values do not match or 2727 * flags values do not match or 2728 * neighbors repeat with different values or 2729 * visible in same level/having partition upper do not match 2730 the behavior is undefined and a warning SHOULD be generated. 2731 Neighbors can be distributed across multiple TIEs however if 2732 the sets are disjoint. 2734 @note: observe that absence of fields implies defined defaults 2735 */ 2736 struct NodeTIEElement { 2737 1: required common.LevelType level; 2738 /** if neighbor systemID repeats in other node TIEs of same node 2739 the behavior is undefined. Equivalent to |A_(n,s)(N) in spec. */ 2740 2: required map neighbors; 2742 3: optional NodeCapabilities capabilities; 2743 4: optional NodeFlags flags; 2744 /** optional node name for easier operations */ 2745 5: optional string name; 2747 /** Nodes seen an the same level through reflection through nodes 2748 having backlink to both nodes. They are equivalent to |V(N) in 2749 future specifications. Ignored in Node S-TIEs if present. 2750 */ 2751 6: optional set visible_in_same_level 2752 = common.empty_set_of_nodeids; 2753 /** Non-overloaded nodes in |V seen as attached to another north 2754 * level partition due to the fact that some nodes in its |V have 2755 * adjacencies to higher level nodes that this node doesn't see. 2756 * This may be used in the computation at higher levels to prevent 2757 * blackholing. Ignored in Node S-TIEs if present. 2758 * Equivalent to |PUL(N) in spec. */ 2759 7: optional set same_level_unknown_north_partitions 2760 = common.empty_set_of_nodeids; 2761 } 2763 struct PrefixAttributes { 2764 /** Observe that in default metric case the node is supposed to advertise 2765 * metric calculated from comparison of bandwidths at all nodes at its 2766 * level. **/ 2767 2: required common.MetricType metric = common.default_distance; 2768 } 2769 /** multiple prefixes */ 2770 struct PrefixTIEElement { 2771 /** prefixes with the associated attributes. 2772 if the same prefix repeats in multiple TIEs of same node 2773 behavior is unspecified */ 2774 1: required map prefixes; 2775 } 2777 /** keys with their values */ 2778 struct KeyValueTIEElement { 2779 /** if the same key repeats in multiple TIEs of same node 2780 or with different values, behavior is unspecified */ 2781 1: required map keyvalues; 2782 } 2784 /** single element in a TIE. enum common.TIETypeType 2785 in TIEID indicates which elements MUST be present 2786 in the TIEElement. In case of mismatch the unexpected 2787 elements MUST be ignored. 2788 */ 2789 union TIEElement { 2790 /** in case of enum common.TIETypeType.NodeTIEType */ 2791 1: optional NodeTIEElement node; 2792 /** in case of enum common.TIETypeType.PrefixTIEType */ 2793 2: optional PrefixTIEElement prefixes; 2794 /** transitive prefixes (always southbound) which SHOULD be propagated 2795 * southwards towards lower levels to heal 2796 * pathological upper level partitioning, otherwise 2797 * blackholes may occur. MUST NOT be advertised within a North TIE. 2798 */ 2799 3: optional PrefixTIEElement transitive_prefixes; 2800 4: optional KeyValueTIEElement keyvalues; 2801 /** @todo: policy guided prefixes */ 2802 } 2804 /** @todo: flood header separately in UDP to allow caching to TIEs 2805 while changing lifetime? 2806 */ 2807 struct TIEPacket { 2808 1: required TIEHeader header; 2809 2: required TIEElement element; 2810 } 2812 union PacketContent { 2813 1: optional LIEPacket lie; 2814 2: optional TIDEPacket tide; 2815 3: optional TIREPacket tire; 2816 4: optional TIEPacket tie; 2818 } 2820 /** protocol packet structure */ 2821 struct ProtocolPacket { 2822 1: required PacketHeader header; 2823 2: required PacketContent content; 2824 } 2826 9. IANA Considerations 2828 This specification will request at an opportune time multiple 2829 registry points to exchange protocol packets in a standardized way, 2830 amongst them multicast address assignments and standard port numbers. 2831 The schema itself defines many values and codepoints which can be 2832 considered registries themselves. 2834 10. Acknowledgments 2836 Many thanks to Naiming Shen for some of the early discussions around 2837 the topic of using IGPs for routing in topologies related to Clos. 2838 Russ White to be especially acknowledged for the key conversation on 2839 epistomology that allowed to tie current asynchronous distributed 2840 systems theory results to a modern protocol design presented here. 2841 Adrian Farrel, Joel Halpern and Jeffrey Zhang provided thoughtful 2842 comments that improved the readability of the document and found good 2843 amount of corners where the light failed to shine. Kris Price was 2844 first to mention single router, single arm default considerations. 2845 Jeff Tantsura helped out with some initial thoughts on BFD 2846 interactions while Jeff Haas corrected several misconceptions about 2847 BFD's finer points. Artur Makutunowicz pointed out many possible 2848 improvements and acted as sounding board in regard to modern protocol 2849 implementation techniques RIFT is exploring. Barak Gafni formalized 2850 first time clearly the problem of partitioned spine on a (clean) 2851 napkin in Singapore. 2853 11. References 2855 11.1. Normative References 2857 [ISO10589] 2858 ISO "International Organization for Standardization", 2859 "Intermediate system to Intermediate system intra-domain 2860 routeing information exchange protocol for use in 2861 conjunction with the protocol for providing the 2862 connectionless-mode Network Service (ISO 8473), ISO/IEC 2863 10589:2002, Second Edition.", Nov 2002. 2865 [RFC1142] Oran, D., Ed., "OSI IS-IS Intra-domain Routing Protocol", 2866 RFC 1142, DOI 10.17487/RFC1142, February 1990, 2867 . 2869 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2870 Requirement Levels", BCP 14, RFC 2119, 2871 DOI 10.17487/RFC2119, March 1997, 2872 . 2874 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 2875 DOI 10.17487/RFC2328, April 1998, 2876 . 2878 [RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23, 2879 RFC 2365, DOI 10.17487/RFC2365, July 1998, 2880 . 2882 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 2883 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 2884 DOI 10.17487/RFC4271, January 2006, 2885 . 2887 [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing 2888 Architecture", RFC 4291, DOI 10.17487/RFC4291, February 2889 2006, . 2891 [RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation 2892 Element (PCE)-Based Architecture", RFC 4655, 2893 DOI 10.17487/RFC4655, August 2006, 2894 . 2896 [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi 2897 Topology (MT) Routing in Intermediate System to 2898 Intermediate Systems (IS-ISs)", RFC 5120, 2899 DOI 10.17487/RFC5120, February 2008, 2900 . 2902 [RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way 2903 Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, 2904 DOI 10.17487/RFC5303, October 2008, 2905 . 2907 [RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M., 2908 Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic 2909 Authentication", RFC 5709, DOI 10.17487/RFC5709, October 2910 2009, . 2912 [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 2913 (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, 2914 DOI 10.17487/RFC5881, June 2010, 2915 . 2917 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 2918 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 2919 DOI 10.17487/RFC6234, May 2011, 2920 . 2922 [RFC6822] Previdi, S., Ed., Ginsberg, L., Shand, M., Roy, A., and D. 2923 Ward, "IS-IS Multi-Instance", RFC 6822, 2924 DOI 10.17487/RFC6822, December 2012, 2925 . 2927 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 2928 Litkowski, S., Horneffer, M., and R. Shakir, "Source 2929 Packet Routing in Networking (SPRING) Problem Statement 2930 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 2931 2016, . 2933 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 2934 BGP for Routing in Large-Scale Data Centers", RFC 7938, 2935 DOI 10.17487/RFC7938, August 2016, 2936 . 2938 [RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and 2939 H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987, 2940 DOI 10.17487/RFC7987, October 2016, 2941 . 2943 11.2. Informative References 2945 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 2946 Communication Environments", IEEE International Parallel & 2947 Distributed Processing Symposium, 2011. 2949 [DIJKSTRA] 2950 Dijkstra, E., "A Note on Two Problems in Connexion with 2951 Graphs", Journal Numer. Math. , 1959. 2953 [DYNAMO] De Candia et al., G., "Dynamo: amazon's highly available 2954 key-value store", ACM SIGOPS symposium on Operating 2955 systems principles (SOSP '07), 2007. 2957 [EPPSTEIN] 2958 Eppstein, D., "Finding the k-Shortest Paths", 1997. 2960 [FATTREE] Leiserson, C., "Fat-Trees: Universal Networks for 2961 Hardware-Efficient Supercomputing", 1985. 2963 [I-D.ietf-spring-segment-routing] 2964 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 2965 Litkowski, S., and R. Shakir, "Segment Routing 2966 Architecture", draft-ietf-spring-segment-routing-15 (work 2967 in progress), January 2018. 2969 [MAKSIC2013] 2970 Maksic et al., N., "Improving Utilization of Data Center 2971 Networks", IEEE Communications Magazine, Nov 2013. 2973 [PROTOBUF] 2974 Google, Inc., "Protocol Buffers, 2975 https://developers.google.com/protocol-buffers". 2977 [QUIC] Iyengar et al., J., "QUIC: A UDP-Based Multiplexed and 2978 Secure Transport", 2016. 2980 [VAHDAT08] 2981 Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, 2982 Commodity Data Center Network Architecture", SIGCOMM , 2983 2008. 2985 Authors' Addresses 2987 Tony Przygienda (editor) 2988 Juniper Networks 2989 1194 N. Mathilda Ave 2990 Sunnyvale, CA 94089 2991 US 2993 Email: prz@juniper.net 2995 Alankar Sharma 2996 Comcast 2997 1800 Bishops Gate Blvd 2998 Mount Laurel, NJ 08054 2999 US 3001 Email: Alankar_Sharma@comcast.com 3002 Alia Atlas 3003 Juniper Networks 3004 10 Technology Park Drive 3005 Westford, MA 01886 3006 US 3008 Email: akatlas@juniper.net 3010 John Drake 3011 Juniper Networks 3012 1194 N. Mathilda Ave 3013 Sunnyvale, CA 94089 3014 US 3016 Email: jdrake@juniper.net