idnits 2.17.00 (12 Aug 2021) /tmp/idnits63619/draft-ietf-rift-applicability-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 163: '...carried within the RIFT domain MUST be...' RFC 2119 keyword, line 1260: '...scovery. A node MUST NOT originate LI...' RFC 2119 keyword, line 1263: '...e on. An implementation MUST be ready...' RFC 2119 keyword, line 1364: '...acency is active MAY be supported. Th...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (12 May 2021) is 374 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 5549 (Obsoleted by RFC 8950) == Outdated reference: A later version (-15) exists of draft-ietf-rift-rift-12 Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT WG Yuehua. Wei, Ed. 3 Internet-Draft Zheng. Zhang 4 Intended status: Informational ZTE Corporation 5 Expires: 13 November 2021 Dmitry. Afanasiev 6 Yandex 7 P. Thubert 8 Cisco Systems 9 Tom. Verhaeg 10 Juniper Networks 11 Jaroslaw. Kowalczyk 12 Orange Polska 13 12 May 2021 15 RIFT Applicability 16 draft-ietf-rift-applicability-06 18 Abstract 20 This document discusses the properties, applicability and operational 21 considerations of RIFT in different network scenarios. It intends to 22 provide a rough guide how RIFT can be deployed to simplify routing 23 operations in Clos topologies and their variations. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on 13 November 2021. 42 Copyright Notice 44 Copyright (c) 2021 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 49 license-info) in effect on the date of publication of this document. 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. Code Components 52 extracted from this document must include Simplified BSD License text 53 as described in Section 4.e of the Trust Legal Provisions and are 54 provided without warranty as described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Problem Statement of Routing in Modern IP Fabric Fat Tree 60 Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 4 62 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 4 63 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 6 64 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 7 65 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 7 66 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 7 67 3.2.4. Reachability of Internal Nodes in the Fabric . . . . 9 68 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 9 69 3.3.1. Data Center Topologies . . . . . . . . . . . . . . . 9 70 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 11 71 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 11 72 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 11 73 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 11 74 4. Operational Considerations . . . . . . . . . . . . . . . . . 13 75 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 14 76 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 14 77 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 16 78 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 17 79 4.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 18 80 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 20 81 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 22 82 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 24 83 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 25 84 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 26 85 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 27 86 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 27 87 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 28 88 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 28 89 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 28 90 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 28 91 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 28 92 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 29 93 4.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 30 94 4.16. Key Management . . . . . . . . . . . . . . . . . . . . . 30 96 5. Security Considerations . . . . . . . . . . . . . . . . . . . 31 97 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 31 98 7. Normative References . . . . . . . . . . . . . . . . . . . . 31 99 8. Informative References . . . . . . . . . . . . . . . . . . . 33 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 102 1. Introduction 104 This document discusses the properties and applicability of "Routing 105 in Fat Trees" [RIFT] (RIFT) in different deployment scenarios and 106 highlights the operational simplicity of the technology compared to 107 traditional routing solutions. It also documents special 108 considerations when RIFT is used with or without overlays and/or 109 controllers, and how RIFT identifies topology mis-cablings and 110 reroutes around node and link failures. 112 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 114 Clos [CLOS] and fat tree [FATTREE] topologies have gained prominence 115 in today's networking, primarily as a result of the paradigm shift 116 towards a centralized data-center based architecture that deliver a 117 majority of computation and storage services. 119 Today's current routing protocols were geared towards a network with 120 an irregular topology with isotropic properties, and low degree of 121 connectivity. When applied to Fat Tree topologies: 123 * They tend to need extensive configuration or provisioning during 124 bring up and re-dimensioning. 126 * All nodes including spine and leaf nodes learn the entire network 127 topology and routing information, which is in fact, not needed on 128 the leaf nodes during normal operation. 130 * They flood significant amounts of duplicate link state information 131 between spine and leaf nodes during topology updates and 132 convergence events, requiring that additional CPU and link 133 bandwidth be consumed. This may impact the stability and 134 scalability of the fabric, make the fabric less reactive to 135 failures, and prevent the use of cheaper hardware at the lower 136 levels (i.e. spine and leaf nodes). 138 3. Applicability of RIFT to Clos IP Fabrics 140 Further content of this document assumes that the reader is familiar 141 with the terms and concepts used in OSPF [RFC2328] and IS-IS 142 [ISO10589-Second-Edition] link-state protocols. The sections of RIFT 143 [RIFT] outline the requirements of routing in IP fabrics and RIFT 144 protocol concepts. 146 3.1. Overview of RIFT 148 RIFT is a dynamic routing protocol that is tailored for use in Clos, 149 Fat-Tree, and other anisotropic topologies. A core property of RIFT 150 is that its operation is sensitive to the structure of the fabric - 151 it is anisotropic. RIFT acts as a link-state protocol when "pointing 152 north" - advertising southwards routes to northwards peer routers 153 (parents) through flooding and database synchronization- but operates 154 hop-by-hop like a distance-vector protocol when "pointing south" - 155 typically advertising a fabric default route directed towards the Top 156 of Fabric (ToF, aka superspine) to southwards peer routers 157 (children). 159 The fabric default is typically the default route, as described in 160 Section 3.2.3.8 "Southbound Default Route Origination" of RIFT 161 [RIFT]. The ToF nodes may alternatively originate more specific 162 prefixes (P') southbound instead of the default route. In such a 163 scenario, all addresses carried within the RIFT domain MUST be 164 contained within P', and it is possible for a leaf that acts as 165 gateway to the internet to advertise the default route instead. 167 RIFT floods flat link-state information northbound only so that each 168 level obtains the full topology of levels south of it. That 169 information is never flooded east-west or back south again. So a top 170 tier node has full set of prefixes from the Shortest Path First (SPF) 171 calculation. 173 In the southbound direction, the protocol operates like a "fully 174 summarizing, unidirectional" path-vector protocol or rather a 175 distance-vector with implicit split horizon. Routing information, 176 normally just the default route, propagates one hop south and is 're- 177 advertised' by nodes at next lower level. 179 +-----------+ +-----------+ 180 | ToF | | ToF | LEVEL 2 181 + +-----+--+--+ +-+--+------+ 182 | | | | | | | | | ^ 183 + | | | +-------------------------+ | 184 Distance | +-------------------+ | | | | | 185 Vector | | | | | | | | + 186 South | | | | +--------+ | | | Link-State 187 + | | | | | | | | Flooding 188 | | | +-------------+ | | | North 189 v | | | | | | | | + 190 +-+--+-+ +------+ +-------+ +--+--+-+ | 191 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 192 + ++----++ ++---+-+ +--+--+-+ ++----+-+ | 193 + | | | | | | | | | ^ N 194 Distance | +-------+ | | +--------+ | | | E 195 Vector | | | | | | | | | +------> 196 South | +-------+ | | | +-------+ | | | | 197 + | | | | | | | | | + 198 v ++--++ +-+-++ ++-+-+ +-+--++ + 199 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 200 +----+ +----+ +----+ +-----+ 202 Figure 1: RIFT overview 204 A spine node has only information necessary for its level, which is 205 all destinations south of the node based on SPF calculation, default 206 route, and potential disaggregated routes. 208 RIFT combines the advantage of both link-state and distance-vector: 210 * Fastest possible convergence 212 * Automatic detection of topology 214 * Minimal routes/info on Top-of-Rack (ToR) switches, aka leaf nodes 216 * High degree of ECMP 218 * Fast de-commissioning of nodes 220 * Maximum propagation speed with flexible prefixes in an update 222 So there are two types of link-state database which are "north 223 representation" North Topology Information Elements (N-TIEs) and 224 "south representation" South Topology Information Elements (S-TIEs). 225 The N-TIEs contain a link-state topology description of lower levels 226 and S-TIEs carry simply default routes for the lower levels. 228 RIFT also eliminates major disadvantages of link-state and distance- 229 vector with: 231 * Reduced and balanced flooding 233 * Automatic neighbor detection 235 To achieve this, RIFT builds on the art of IGPs, not only OSPF and 236 IS-IS but also MANET and IoT, to provide unique features: 238 * Automatic (positive or negative) route disaggregation of 239 northwards routes upon fallen leaves 241 * Recursive operation in the case of negative route disaggregation 243 * Anisotropic routing that extends a principle seen in RPL [RFC6550] 244 to wide superspines 246 * Optimal Flooding Reduction that derives from the concept of a 247 "multipoint relay" (MPR) found in OLSR [RFC3626] and balances the 248 flooding load over northbound links and nodes. 250 Additional advantages that are unique to RIFT are listed below, the 251 details of which can be found in RIFT [RIFT]. 253 * True ZTP 255 * Minimal blast radius on failures 257 * Can utilize all Paths through fabric without looping 259 * Simple leaf implementation that can scale down to servers 261 * Key-Value store 263 * Horizontal links used for protection only 265 * Supports non-equal cost multipath (NECMP) and can replace multi- 266 chassis link aggregation group (MLAG or MC-LAG) 268 3.2. Applicable Topologies 270 Albeit RIFT is specified primarily for "proper" Clos or Fat Tree 271 topologies, the protocol natively supports Points of Delivery (PoD) 272 concepts, which, strictly speaking, are not found in the original 273 Clos concept. 275 Further, the specification explains and supports operations of multi- 276 plane Clos variants where the protocol recommends the use of inter- 277 plane rings at the Top-of-Fabric level to allow the reconciliation of 278 topology view of different planes to make the negative disaggregation 279 viable in case of failures within a plane. These observations hold 280 not only in case of RIFT but also in the generic case of dynamic 281 routing on Clos variants with multiple planes and failures in bi- 282 sectional bandwidth, especially on the leafs. 284 3.2.1. Horizontal Links 286 RIFT is not limited to pure Clos divided into PoD and multi-planes 287 but supports horizontal (East-West) links below the top of fabric 288 level. Those links are used only for last resort northbound routes 289 when a spine loses all its northbound links or cannot compute a 290 default route through them. 292 A possible configuration is a "ring" of horizontal links at a level. 293 In presence of such a "ring" in any level (except Top of Fabric (ToF) 294 level) neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a 295 "ring-based protection" scheme since such a computation would have to 296 deal necessarily with breaking of "loops" in Dijkstra sense; an 297 application for which RIFT is not intended. 299 A full-mesh connectivity between nodes on the same level can be 300 employed and that allows N-SPF to provide for any node loosing all 301 its northbound adjacencies (as long as any of the other nodes in the 302 level are northbound connected) to still participate in northbound 303 forwarding. 305 3.2.2. Vertical Shortcuts 307 Through relaxations of the specified adjacency forming rules, RIFT 308 implementations can be extended to support vertical "shortcuts" as 309 proposed by e.g. [I-D.white-distoptflood]. The RIFT specification 310 itself does not provide the exact details since the resulting 311 solution suffers from either much larger blast radius with increased 312 flooding volumes or in case of maximum aggregation routing, bow-tie 313 problems. 315 3.2.3. Generalizing to any Directed Acyclic Graph 317 RIFT is an anisotropic routing protocol, meaning that it has a sense 318 of direction (northbound, southbound, east-west) and that it operates 319 differently depending on the direction. 321 * Northbound, RIFT operates as a link-state protocol, whereby the 322 control packets are reflooded first all the way north and only 323 interpreted later. All the individual fine grained routes are 324 advertised. 326 * Southbound, RIFT operates as a distance-vector protocol, whereby 327 the control packets are flooded only one-hop, interpreted, and the 328 consequence of that computation is what gets flooded one more hop 329 south. In the most common use-cases, a ToF node can reach most of 330 the prefixes in the fabric. If that is the case, the ToF node 331 advertises the fabric default and disaggregates the prefixes that 332 it cannot reach. On the other hand, a ToF node that can reach 333 only a small subset of the prefixes in the fabric will preferably 334 advertise those prefixes and refrain from aggregating. 336 In the general case, what gets advertised south is in more 337 details: 339 1. A fabric default that aggregates all the prefixes that are 340 reachable within the fabric, and that could be a default route 341 or a prefix that is dedicated to this particular fabric. 343 2. The loopback addresses of the northbound nodes, e.g., for 344 inband management. 346 3. The disaggregated prefixes for the dynamic exceptions to the 347 fabric default, advertised to route around the black hole that 348 may form. 350 * East-West routing can optionally be used, with specific 351 restrictions. It is used when a sibling has access to the fabric 352 default but this node does not. 354 A Directed Acyclic Graph (DAG) provides a sense of north (the 355 direction of the DAG) and of south (the reverse), which can be used 356 to apply RIFT. For the purpose of RIFT, an edge in the DAG that has 357 only incoming vertices is a ToF node. 359 There are a number of caveats though: 361 * The DAG structure must exist before RIFT starts, so there is a 362 need for a companion protocol to establish the logical DAG 363 structure. 365 * A generic DAG does not have a sense of east and west. The 366 operation specified for east-west links and the southbound 367 reflection between nodes are not applicable. Also ZTP will derive 368 a sense of depth that will eliminate some links. Variations of 369 ZTP could be derived to meet specific objectives, e.g., make it so 370 that most routers have at least 2 parents to reach the ToF. 372 * RIFT applies to any Destination-Oriented DAG (DODAG) where there's 373 only one ToF node and the problem of disaggregation does not 374 exist. In that case, RIFT operates very much like RPL [RFC6550], 375 but using Link State for southbound routes (downwards in RPL's 376 terms). For an arbitrary DAG with multiple destinations (ToFs) 377 the way disaggregation happens has to be considered. 379 * Positive disaggregation expects that most of the ToF nodes reach 380 most of the leaves, so disaggregation is the exception as opposed 381 to the rule. When this is no more true, it makes sense to turn 382 off disaggregation and route between the ToF nodes over a ring, a 383 full mesh, transit network, or a form of area zero. There again, 384 this operation is similar to RPL operating as a single DODAG with 385 a virtual root. 387 * In order to aggregate and disaggregate routes, RIFT requires that 388 all the ToF nodes share the full knowledge of the prefixes in the 389 fabric. 391 * This can be achieved with a ring as suggested by the RIFT main 392 specification, by some preconfiguration, or using a 393 synchronization with a common repository where all the active 394 prefixes are registered. 396 3.2.4. Reachability of Internal Nodes in the Fabric 398 RIFT does not require that nodes have reachable addresses in the 399 fabric, though it is clearly desirable for operational purposes. 400 Under normal operating conditions this can be easily achieved by 401 injecting the node's loopback address into North and South Prefix 402 TIEs or other implementation specific mechanisms. 404 Special considerations arise when a node loses all northbound 405 adjacencies, but is not at the top of the fabric. These are outside 406 the scope of this document and could be discussed in a separate 407 document. 409 3.3. Use Cases 411 3.3.1. Data Center Topologies 412 3.3.1.1. Data Center Fabrics 414 RIFT is suited for applying in data center (DC) IP fabrics underlay 415 routing, vast majority of which seem to be currently (and for the 416 foreseeable future) Clos architectures. It significantly simplifies 417 operation and deployment of such fabrics as described in Section 4 418 for environments compared to extensive proprietary provisioning and 419 operational solutions. 421 3.3.1.2. Adaptations to Other Proposed Data Center Topologies 423 . +-----+ +-----+ 424 . | | | | 425 .+-+ S0 | | S1 | 426 .| ++---++ ++---++ 427 .| | | | | 428 .| | +------------+ | 429 .| | | +------------+ | 430 .| | | | | 431 .| ++-+--+ +--+-++ 432 .| | | | | 433 .| | A0 | | A1 | 434 .| +-+--++ ++---++ 435 .| | | | | 436 .| | +------------+ | 437 .| | +-----------+ | | 438 .| | | | | 439 .| +-+-+-+ +--+-++ 440 .+-+ | | | 441 . | L0 | | L1 | 442 . +-----+ +-----+ 444 Figure 2: Level Shortcut 446 RIFT is not strictly limited to Clos topologies. The protocol only 447 requires a sense of "compass rose directionality" either achieved 448 through configuration or derivation of levels. So, conceptually, 449 shortcuts between levels could be included. Figure 2 depicts an 450 example of a shortcut between levels. In this example, sub-optimal 451 routing will occur when traffic is sent from L0 to L1 via S0's 452 default route and back down through A0 or A1. In order to ensure 453 that, only default routes from A0 or A1 are used, all leaves would be 454 required to install each others routes. 456 While various technical and operational challenges may require the 457 use of such modifications, discussion of those topics are outside the 458 scope of this document. 460 3.3.2. Metro Fabrics 462 The demand for bandwidth is increasing steadily, driven primarily by 463 environments close to content producers (server farms connection via 464 DC fabrics) but in proximity to content consumers as well. Consumers 465 are often clustered in metro areas with their own network 466 architectures that can benefit from simplified, regular Clos 467 structures and hence from RIFT. 469 3.3.3. Building Cabling 471 Commercial edifices are often cabled in topologies that are either 472 Clos or its isomorphic equivalents. The Clos can grow rather high 473 with many floors. That presents a challenge for traditional routing 474 protocols (except BGP and by now largely phased-out PNNI) which do 475 not support an arbitrary number of levels which RIFT does naturally. 476 Moreover, due to the limited sizes of forwarding tables in network 477 elements of building cabling, the minimum FIB size RIFT maintains 478 under normal conditions is cost-effective in terms of hardware and 479 operational costs. 481 3.3.4. Internal Router Switching Fabrics 483 It is common in high-speed communications switching and routing 484 devices to use fabrics when a crossbar is not feasible due to cost, 485 head-of-line blocking or size trade-offs. Normally such fabrics are 486 not self-healing or rely on 1:/+1 protection schemes but it is 487 conceivable to use RIFT to operate Clos fabrics that can deal 488 effectively with interconnections or subsystem failures in such 489 module. RIFT is neither IP specific and hence any link addressing 490 connecting internal device subnets is conceivable. 492 3.3.5. CloudCO 494 The Cloud Central Office (CloudCO) is a new stage of telecom Central 495 Office. It takes the advantage of Software Defined Networking (SDN) 496 and Network Function Virtualization (NFV) in conjunction with general 497 purpose hardware to optimize current networks. The following figure 498 illustrates this architecture at a high level. It describes a single 499 instance or macro-node of cloud CO that provides a number of Value 500 Added Services (VAS), a Broadband Access Abstraction (BAA), and 501 virtualized nerwork services. An Access I/O module faces a Cloud CO 502 access node, and the Customer Premises Equipments (CPEs) behind it. 503 A Network I/O module is facing the core network. The two I/O modules 504 are interconnected by a leaf and spine fabric [TR-384]. 506 +---------------------+ +----------------------+ 507 | Spine | | Spine | 508 | Switch | | Switch | 509 +------+---+------+-+-+ +--+-+-+-+-----+-------+ 510 | | | | | | | | | | | | 511 | | | | | +-------------------------------+ | 512 | | | | | | | | | | | | 513 | | | | +-------------------------+ | | | 514 | | | | | | | | | | | | 515 | | +----------------------+ | | | | | | | | 516 | | | | | | | | | | | | 517 | +---------------------------------+ | | | | | | | 518 | | | | | | | | | | | | 519 | | | +-----------------------------+ | | | | | 520 | | | | | | | | | | | | 521 | | | | | +--------------------+ | | | | 522 | | | | | | | | | | | | 523 +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ 524 |L | | Leaf | | Leaf | | Leaf | | Leaf | |L | 525 |S | | Switch | | Switch | | Switch | | Switch| |S | 526 ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ 527 | | | | | | | | | | | | | | 528 | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | 529 | |Compute | |Compute | | Compute | |Compute| | 530 | |Node | |Node | | Node | |Node | | 531 | +--------+ +--------+ +----------+ +-------+ | 532 | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | 533 | |--------| |--------| |----------| |-------| | 534 | |--------| |--------| |----------| |-------| | 535 | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | 536 | |--------| |--------| |----------| |-------| | 537 | |--------| |--------| |----------| |-------| | 538 | || VAS7 || || VAS4 || || vIGMP || ||BAA || | 539 | |--------| |--------| |----------| |-------| | 540 | +--------+ +--------+ +----------+ +-------+ | 541 | | 542 ++-----------+ +---------++ 543 |Network I/O | |Access I/O| 544 +------------+ +----------+ 546 Figure 3: An example of CloudCO architecture 548 The Spine-Leaf architecture deployed inside CloudCO meets the network 549 requirements of adaptable, agile, scalable and dynamic. 551 4. Operational Considerations 553 RIFT presents the opportunity for organizations building and 554 operating IP fabrics to simplify their operation and deployments 555 while achieving many desirable properties of a dynamic routing on 556 such a substrate: 558 * RIFT only floods routing information to the devices that 559 absolutely need it. RIFT design follows minimum blast radius and 560 minimum necessary epistemological scope philosophy which leads to 561 good scaling properties while delivering maximum reactiveness. 563 * RIFT allows for extensive Zero Touch Provisioning within the 564 protocol. In its most extreme version RIFT does not rely on any 565 specific addressing and for IP fabric can operate using IPv6 ND 566 [RFC4861] only. 568 * RIFT has provisions to detect common IP fabric mis-cabling 569 scenarios. 571 * RIFT negotiates automatically BFD per link allowing this way for 572 IP and micro-BFD [RFC7130] to replace Link Aggregation Groups 573 (LAGs) which do hide bandwidth imbalances in case of constituent 574 failures. Further automatic link validation techniques similar to 575 [RFC5357] could be supported as well. 577 * RIFT inherently solves many difficult problems associated with the 578 use of traditional routing topologies with dense meshes and high 579 degrees of ECMP by including automatic bandwidth balancing, flood 580 reduction and automatic disaggregation on failures while providing 581 maximum aggregation of prefixes in default scenarios. 583 * RIFT reduces FIB size towards the bottom of the IP fabric where 584 most nodes reside and allows with that for cheaper hardware on the 585 edges and introduction of modern IP fabric architectures that 586 encompass e.g. server multi-homing. 588 * RIFT provides valley-free routing and with that is loop free. 589 This allows the use of any such valley-free path in bi-sectional 590 fabric bandwidth between two destination irrespective of their 591 metrics which can be used to balance load on the fabric in 592 different ways. 594 * RIFT includes a key-value distribution mechanism which allows for 595 many future applications such as automatic provisioning of basic 596 overlay services or automatic key roll-overs over whole fabrics. 598 * RIFT is designed for minimum delay in case of prefix mobility on 599 the fabric. In conjunction with [RFC8505], RIFT can differentiate 600 anycast advertisements from mobility events and retain only the 601 most recent advertisement in the latter case. 603 * Many further operational and design points collected over many 604 years of routing protocol deployments have been incorporated in 605 RIFT such as fast flooding rates, protection of information 606 lifetimes and operationally easily recognizable remote ends of 607 links and node names. 609 4.1. South Reflection 611 South reflection is a mechanism that South Node TIEs are "reflected" 612 back up north to allow nodes in same level without East-west links to 613 "see" each other. 615 For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs 616 from ToF21 to ToF22 separately. Respectively, 617 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 618 to ToF21 separately. So ToF22 and ToF21 see each other's node 619 information as level 2 nodes. 621 In an equivalent fashion, as the result of the south reflection 622 between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, 623 Spine121 and Spine 122 knows each other at level 1. 625 4.2. Suboptimal Routing on Link Failures 626 +--------+ +--------+ 627 | ToF21 | | ToF22 | LEVEL 2 628 ++--+-+-++ ++-+--+-++ 629 | | | | | | | + 630 | | | | | | | linkTS8 631 +-------------+ | +-+linkTS3+-+ | | | +-------------+ 632 | | | | | | + | 633 | +----------------------------+ | linkTS7 | 634 | | | | + + + | 635 | | | +-------+linkTS4+------------+ | 636 | | | + + | | | 637 | | | +------------+--+ | | 638 | | | | | linkTS6 | | 639 +-+----+-+ +-----+--+ ++--------+ +-+----+-+ 640 |Spine111| |Spine112| |Spine121 | |Spine122| LEVEL 1 641 +-+---+--+ +----+---+ +-+---+---+ +-+---+--+ 642 | | | | | | | | 643 | +--------------+ | + ++XX+linkSL6+---+ + 644 | | | | linkSL5 | | linkSL8 645 | +------------+ | | + +---+linkSL7+-+ | + 646 | | | | | | | | 647 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 648 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 649 +-+-----+ ++------+ +-----+-+ +-+-----+ 650 + + + + 651 Prefix111 Prefix112 Prefix121 Prefix122 653 Figure 4: Suboptimal routing upon link failure use case 655 As shown in Figure 4, as the result of the south reflection between 656 Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and 657 Spine 122 knows each other at level 1. 659 Without disaggregation mechanism, when linkSL6 fails, the packet from 660 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 661 then go down through linkTS4 to linkSL8 to Leaf122 or go up through 662 linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to 663 Leaf122 based on pure default route. It's the case of suboptimal 664 routing or bow-tieing. 666 With disaggregation mechanism, when linkSL6 fails, Spine122 will 667 detect the failure according to the reflected node S-TIE from 668 Spine121. Based on the disaggregation algorithm provided by RIFT, 669 Spine122 will explicitly advertise prefix122 in Disaggregated Prefix 670 S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to 671 prefix122 will only be sent to linkSL7 following a longest-prefix 672 match to prefix 122 directly then go down through linkSL8 to Leaf122 673 . 675 4.3. Black-Holing on Link Failures 677 +--------+ +--------+ 678 | ToF 21 | | ToF 22 | LEVEL 2 679 ++-+--+-++ ++-+--+-++ 680 | | | | | | | + 681 | | | | | | | linkTS8 682 +--------------+ | +-+linkTS3+X+ | | | +--------------+ 683 linkTS1 | | | | | + | 684 + +-----------------------------+ | linkTS7 | 685 | | + | + + + | 686 | | linkTS2 +-------+linkTS4+X+----------+ | 687 | + + + + | | | 688 | linkTS5 +-+ +------------+--+ | | 689 | + | | | linkTS6 | | 690 +-+----+-+ +-+----+-+ ++-------+ +-+-----++ 691 |Spine111| |Spine112| |Spine121| |Spine122| LEVEL 1 692 +-+---+--+ ++----+--+ +-+---+--+ +-+---+--+ 693 | | | | | | | | 694 + +---------------+ | + +---+linkSL6+---+ + 695 linkSL1 | | | linkSL5 | | linkSL8 696 + +--+linkSL3+--+ | | + +---+linkSL7+-+ | + 697 | | | | | | | | 698 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 699 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 700 +-+-----+ ++------+ +-----+-+ +-+-----+ 701 + + + + 702 Prefix111 Prefix112 Prefix121 Prefix122 704 Figure 5: Black-holing upon link failure use case 706 This scenario illustrates a case when double link failure occurs and 707 with that black-holing can happen. 709 Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, 710 the packet from leaf111 to prefix122 would suffer 50% black-holing 711 based on pure default route. The packet supposed to go up through 712 linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be 713 dropped. The packet supposed to go up through linkSL3 to linkTS2 714 then go down through linkTS3 or linkTS4 will be dropped as well. 715 It's the case of black-holing. 717 With disaggregation mechanism, when linkTS3 and linkTS4 both fail, 718 ToF22 will detect the failure according to the reflected node S-TIE 719 of ToF21 from Spine111\Spine112. Based on the disaggregation 720 algorithm provided by RITF, ToF22 will explicitly originate an S-TIE 721 with prefix 121 and prefix 122, that is flooded to spines 111, 112, 722 121 and 122. 724 The packet from leaf111 to prefix122 will not be routed to linkTS1 or 725 linkTS2. The packet from leaf111 to prefix122 will only be routed to 726 linkTS5 or linkTS7 following a longest-prefix match to prefix122. 728 4.4. Zero Touch Provisioning (ZTP) 730 RIFT is designed to require a very minimal configuration to simplify 731 its operation and avoid human errors; based on that minimal 732 information, Zero Touch Provisioning (ZTP) autoconfigures the key 733 operational parameters of all the RIFT nodes, that is, on the one 734 hand, the SystemID of the node that must be unique in the RIFT 735 network, and on the other hand the level of the node in the Fat Tree, 736 which determines which peers are northwards "parents" and which are 737 southwards "children". 739 ZTP is always on, but its decisions can be overridden when a network 740 administrator prefers to impose its own configuration. In that case, 741 it is the responsibility of the administrator to ensure that the 742 configured parameters are correct, in other words that the SystemID 743 of each node is unique, and that the administratively set levels 744 truly reflect the relative position of the nodes in the fabric. It 745 is recommended to let ZTP configure the network, and when not, it is 746 recommended to configure the level of all the nodes but those that 747 are forced as leaves to avoid an undesirable interaction between ZTP 748 and the manual configuration. 750 ZTP requires that the administrator points out the Top-of-Fabric 751 (ToF) nodes to set the baseline from which the fabric topology is 752 derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC 753 flag which are initial 'seeds' needed for other ZTP nodes to derive 754 their level in the topology. ZTP computes the level of each node 755 based on the Highest Available Level (HAL) of the potential parent(s) 756 nearest that baseline, which represents the superspine. In a 757 fashion, RIFT can be seen as a distance-vector protocol that computes 758 a set of feasible successors towards the superspine and auto- 759 configures the rest of the topology. In a fashion, RIFT can be seen 760 as a distance-vector protocol that computes a set of feasible 761 successors towards the superspine and auto-configures the rest of the 762 topology. 764 The autoconfiguration mechanism computes a global maximum of levels 765 by diffusion. The derivation of the level of each node happens then 766 based on Link Information Elements (LIEs) received from its neighbors 767 whereas each node (with possibly exceptions of configured leaves) 768 tries to attach at the highest possible point in the fabric. This 769 guarantees that even if the diffusion front reaches a node from 770 "below" faster than from "above", it will greedily abandon already 771 negotiated level derived from nodes topologically below it and 772 properly peer with nodes above. 774 The achieved equilibrium can be disturbed massively by all nodes with 775 highest level either leaving or entering the domain (with some finer 776 distinctions not explained further). It is therefore recommended 777 that each node is multi-homed towards nodes with respective HAL 778 offerings. Fortunately, this is the natural state of things for the 779 topology variants considered in RIFT. 781 A RIFT node may also be configured to confine it to the leaf role 782 with the LEAF_ONLY flag. A leaf node can also be configured to 783 support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either 784 case the node cannot be TOP_OF_FABRIC and its level cannot be 785 configured. RIFT will fully configure the node's level after it is 786 attached to the topology and ensure that the node is at the "bottom 787 of the hierarchy" (southernmost). 789 4.5. Mis-cabling Examples 791 +----------------+ +-----------------+ 792 | ToF21 | +------+ ToF22 | LEVEL 2 793 +-------+----+---+ | +----+---+--------+ 794 | | | | | | | | | 795 | | | +----------------------------+ | 796 | +---------------------------+ | | | | 797 | | | | | | | | | 798 | | | | +-----------------------+ | | 799 | | +------------------------+ | | | 800 | | | | | | | | | 801 +-+---+--+ +-+---+--+ | +--+---+-+ +--+---+-+ 802 |Spine111| |Spine112| | |Spine121| |Spine122| LEVEL 1 803 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 804 | | | | | | | | | 805 | +---------+ | link-M | +---------+ | 806 | | | | | | | | | 807 | +-------+ | | | | +-------+ | | 808 | | | | | | | | | 809 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 810 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 811 +-------+ +-------+ +-------+ +-------+ 812 Figure 6: A single plane mis-cabling example 814 Figure 6 shows a single plane mis-cabling example. It's a perfect 815 Fat Tree fabric except link-M connecting Leaf112 to ToF22. 817 The RIFT control protocol can discover the physical links 818 automatically and be able to detect cabling that violates Fat Tree 819 topology constraints. It reacts accordingly to such mis-cabling 820 attempts, at a minimum preventing adjacencies between nodes from 821 being formed and traffic from being forwarded on those mis-cabled 822 links. Leaf112 will in such scenario use link-M to derive its level 823 (unless it is leaf) and can report links to Spine111 and Spine112 as 824 mis-cabled unless the implementations allows horizontal links. 826 Figure 7 shows a multiple plane mis-cabling example. Since Leaf112 827 and Spine121 belong to two different PoDs, the adjacency between 828 Leaf112 and Spine121 can not be formed. link-W would be detected and 829 prevented. 831 +-------+ +-------+ +-------+ +-------+ 832 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 833 +-------+ +-------+ +-------+ +-------+ 834 | | | | | | | | 835 | | | +-----------------+ | | | 836 | +--------------------------+ | | | | 837 | | | | | | | | 838 | +------+ | | | +------+ | 839 | | +-----------------+ | | | | | 840 | | | +--------------------------+ | | 841 | A | | B | | A | | B | 842 +-----+--+ +-+---+--+ +--+---+-+ +--+-----+ 843 |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 844 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 845 | | | | | | | | | 846 | +---------+ | | | +---------+ | 847 | | | | link-W | | | | 848 | +-------+ | | | | +-------+ | | 849 | | | | | | | | | 850 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 851 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 852 +-------+ +-------+ +-------+ +-------+ 853 +--------PoD#1----------+ +---------PoD#2---------+ 855 Figure 7: A multiple plane mis-cabling example 857 RIFT provides an optional level determination procedure in its Zero 858 Touch Provisioning mode. Nodes in the fabric without their level 859 configured determine it automatically. This can have possibly 860 counter-intuitive consequences however. One extreme failure scenario 861 is depicted in Figure 8 and it shows that if all northbound links of 862 spine11 fail at the same time, spine11 negotiates a lower level than 863 Leaf11 and Leaf12. 865 To prevent such scenario where leafs are expected to act as switches, 866 LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is 867 invalid, Spine11 would not derive a valid level from the topology in 868 Figure 8. It will be isolated from the whole fabric and it would be 869 up to the leafs to declare the links towards such spine as mis- 870 cabled. 872 +-------+ +-------+ +-------+ +-------+ 873 |ToF A1| |ToF A2| |ToF A1| |ToF A2| 874 +-------+ +-------+ +-------+ +-------+ 875 | | | | | | 876 | +-------+ | | | 877 + + | | ====> | | 878 X X +------+ | +------+ | 879 + + | | | | 880 +----+--+ +-+-----+ +-+-----+ 881 |Spine11| |Spine12| |Spine12| 882 +-+---+-+ ++----+-+ ++----+-+ 883 | | | | | | 884 | +---------+ | | | 885 | | | | | | 886 | +-------+ | | +-------+ | 887 | | | | | | 888 +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ 889 |Leaf111| |Leaf112| |Leaf111| |Leaf112| 890 +-------+ +-------+ +-+-----+ +-+-----+ 891 | | 892 | +--------+ 893 | | 894 +-+---+-+ 895 |Spine11| 896 +-------+ 898 Figure 8: Fallen spine 900 4.6. Positive vs. Negative Disaggregation 902 Disaggregation is the procedure whereby [RIFT] advertises a more 903 specific route southwards as an exception to the aggregated fabric- 904 default north. Disaggregation is useful when a prefix within the 905 aggregation is reachable via some of the parents but not the others 906 at the same level of the fabric. It is mandatory when the level is 907 the ToF since a ToF node that cannot reach a prefix becomes a black 908 hole for that prefix. The hard problem is to know which prefixes are 909 reachable by whom. 911 In the general case, [RIFT] solves that problem by interconnecting 912 the ToF nodes. So the ToF nodes can exchange the full list of 913 prefixes that exist in the fabric and figure when a ToF node lacks 914 reachability and to existing prefix. This requires additional ports 915 at the ToF, typically 2 ports per ToF node to form a ToF-spanning 916 ring. [RIFT] also defines the southbound reflection procedure that 917 enables a parent to explore the direct connectivity of its peers, 918 meaning their own parents and children; based on the advertisements 919 received from the shared parents and children, it may enable the 920 parent to infer the prefixes its peers can reach. 922 When a parent lacks reachability to a prefix, it may disaggregate the 923 prefix negatively, i.e., advertise that this parent can be used to 924 reach any prefix in the aggregation except that one. The Negative 925 Disaggregation signaling is simple and functions transitively from 926 ToF to top-of-pod (ToP) and then from ToP to Leaf. But it is hard 927 for a parent to figure which prefix it needs to disaggregate, because 928 it does not know what it does not know; it results that the use of a 929 spanning ring at the ToF is required to operate the Negative 930 Disaggregation. Also, though it is only an implementation problem, 931 the programmation of the FIB is complex compared to normal routes, 932 and may incur recursions. 934 The more classical alternative is, for the parents that can reach a 935 prefix that peers at the same level cannot, to advertise a more 936 specific route to that prefix. This leverages the normal longest 937 prefix match in the FIB, and does not require a special 938 implementation. But as opposed to the Negative Disaggregation, the 939 Positive Disaggregation is difficult and inefficient to operate 940 transitively. 942 Transitivity is not needed to a grandchild if all its parents 943 received the Positive Disaggregation, meaning that they shall all 944 avoid the black hole; when that is the case, they collectively build 945 a ceiling that protects the grandchild. But until then, a parent 946 that received a Positive Disaggregation may believe that some peers 947 are lacking the reachability and readvertise too early, or defer and 948 maintain a black hole situation longer than necessary. 950 In a non-partitioned fabric, all the ToF nodes see one another 951 through the reflection and can figure if one is missing a child. In 952 that case it is possible to compute the prefixes that the peer cannot 953 reach and disaggregate positively without a ToF-spanning ring. The 954 ToF nodes can also ascertain that the ToP nodes are connected each to 955 at least a ToF node that can still reach the prefix, meaning that the 956 transitive operation is not required. 958 The bottom line is that in a fabric that is partitioned (e.g., using 959 multiple planes) and/or where the ToP nodes are not guaranteed to 960 always form a ceiling for their children, it is mandatory to use the 961 Negative Disaggregation. On the other hand, in a highly symmetrical 962 and fully connected fabric, (e.g., a canonical Clos Network), the 963 Positive Disaggregation methods allows to save the complexity and 964 cost associated to the ToF-spanning ring. 966 Note that in the case of Positive Disaggregation, the first ToF 967 node(s) that announces a more-specific route attracts all the traffic 968 for that route and may suffer from a transient incast. A ToP node 969 that defers injecting the longer prefix in the FIB, in order to 970 receive more advertisements and spread the packets better, also keeps 971 on sending a portion of the traffic to the black hole in the 972 meantime. In the case of Negative Disaggregation, the last ToF 973 node(s) that injects the route may also incur an incast issue; this 974 problem would occur if a prefix that becomes totally unreachable is 975 disaggregated, but doing so is mostly useless and is not recommended. 977 4.7. Mobile Edge and Anycast 979 When a physical or a virtual node changes its point of attachement in 980 the fabric from a previous-leaf to a next-leaf, new routes must be 981 installed that supersede the old ones. Since the flooding flows 982 northwards, the nodes (if any) between the previous-leaf and the 983 common parent are not immediately aware that the path via previous- 984 leaf is obsolete, and a stale route may exist for a while. The 985 common parent needs to select the freshest route advertisement in 986 order to install the correct route via the next-leaf. This requires 987 that the fabric determines the sequence of the movements of the 988 mobile node. 990 On the one hand, a classical sequence counter provides a total order 991 for a while but it will eventually wrap. On the other hand, a 992 timestamp provides a permanent order but it may miss a movement that 993 happens too quickly vs. the granularity of the timing information. 994 It is not envisioned in the short term that the average fabric 995 supports a Precision Time Protocol [IEEEstd1588], and the precision 996 that may be available with the Network Time Protocol [RFC5905], in 997 the order of 100 to 200ms, may not be necessarily enough to cover, 998 e.g., the fast mobility of a Virtual Machine. 1000 Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that 1001 combines a sequence counter from the mobile node and a timestamp from 1002 the network taken at the leaf when the route is injected. If the 1003 timestamps of the concurrent advertisements are comparable (i.e., 1004 more distant than the precision of the timing protocol), then the 1005 timestamp alone is used to determine the relative freshness of the 1006 routes. Otherwise, the sequence counter from the mobile node, if 1007 available, is used. One caveat is that the sequence counter must not 1008 wrap within the precision of the timing protocol. Another is that 1009 the mobile node may not even provide a sequence counter, in which 1010 case the mobility itself must be slower than the precision of the 1011 timing. 1013 Mobility must not be confused with anycast. In both cases, a same 1014 address is injected in RIFT at different leaves. In the case of 1015 mobility, only the freshest route must be conserved, since mobile 1016 node changed its point of attachment for a leaf to the next. In the 1017 case of anycast, the node may be either multihomed (attached to 1018 multiple leaves in parallel) or reachable beyond the fabric via 1019 multiple routes that are redistributed to different leaves; either 1020 way, in the case of anycast, the multiple routes are equally valid 1021 and should be conserved. Without further information from the 1022 redistributed routing protocol, it is impossible to sort out a 1023 movement from a redistribution that happens asynchronously on 1024 different leaves. [RIFT] expects that anycast addresses are 1025 advertised within the timing precision, which is typically the case 1026 with a low-precision timing and a multihomed node. Beyond that time 1027 interval, RIFT interprets the lag as a mobility and only the freshest 1028 route is retained. 1030 When using IPv6 [RFC8200], RIFT suggests to leverage "Registration 1031 Extensions for IPv6 over Low-Power Wireless Personal Area Network 1032 (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND 1033 interaction between the mobile node and the leaf. This provides not 1034 only a sequence counter but also a lifetime and a security token that 1035 may be used to protect the ownership of an address [RFC8928]. When 1036 using [RFC8505], the parallel registration of an anycast address to 1037 multiple leaves is done with the same sequence counter, whereas the 1038 sequence counter is incremented when the point of attachement 1039 changes. This way, it is possible to differentiate a mobile node 1040 from a multihomed node, even when the mobility happens within the 1041 timing precision. It is also possible for a mobile node to be 1042 multihomed as well, e.g., to change only one of its points of 1043 attachement. 1045 4.8. IPv4 over IPv6 1047 RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 1048 Address Family (AF) configures via the usual Neighbor Discovery (ND) 1049 mechanisms and then V4 can use V6 nexthops analogous to [RFC5549]. 1050 It is expected that the whole fabric supports the same type of 1051 forwarding of address families on all the links. RIFT provides an 1052 indication whether a node is v4 forwarding capable and 1053 implementations are possible where different routing tables are 1054 computed per address family as long as the computation remains loop- 1055 free. 1057 +-----+ +-----+ 1058 +---+---+ | ToF | | ToF | 1059 ^ +--+--+ +-----+ 1060 | | | | | 1061 | | +-------------+ | 1062 | | +--------+ | | 1063 + | | | | 1064 V6 +-----+ +-+---+ 1065 Forwarding |Spine| |Spine| 1066 + +--+--+ +-----+ 1067 | | | | | 1068 | | +-------------+ | 1069 | | +--------+ | | 1070 | | | | | 1071 v +-----+ +-+---+ 1072 +---+---+ |Leaf | | Leaf| 1073 +--+--+ +--+--+ 1074 | | 1075 IPv4 prefixes| |IPv4 prefixes 1076 | | 1077 +---+----+ +---+----+ 1078 | V4 | | V4 | 1079 | subnet | | subnet | 1080 +--------+ +--------+ 1082 Figure 9: IPv4 over IPv6 1084 4.9. In-Band Reachability of Nodes 1086 RIFT doesn't precondition that nodes of the fabric have reachable 1087 addresses. But the operational purposes to reach the internal nodes 1088 may exist. Figure 10 shows an example that the network management 1089 station (NMS) attaches to leaf1. 1091 +-------+ +-------+ 1092 | ToF1 | | ToF2 | 1093 ++---- ++ ++-----++ 1094 | | | | 1095 | +----------+ | 1096 | +--------+ | | 1097 | | | | 1098 ++-----++ +--+---++ 1099 |Spine1 | |Spine2 | 1100 ++-----++ ++-----++ 1101 | | | | 1102 | +----------+ | 1103 | +--------+ | | 1104 | | | | 1105 ++-----++ +--+---++ 1106 | Leaf1 | | Leaf2 | 1107 +---+---+ +-------+ 1108 | 1109 |NMS 1111 Figure 10: In-Band reachability of node 1113 If NMS wants to access Leaf2, it simply works. Because loopback 1114 address of Leaf2 is flooded in its Prefix North TIE. 1116 If NMS wants to access Spine2, it simply works too. Because spine 1117 node always advertises its loopback address in the Prefix North TIE. 1118 NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ 1119 ToF2-Spine2. 1121 If NMS wants to access ToF2, ToF2's loopback address needs to be 1122 injected into its Prefix South TIE. This TIE must be seen by all 1123 nodes at the level below - the spine nodes in Figure 10 - that must 1124 form a ceiling for all the traffic coming from below (south). 1125 Otherwise, the traffic from NMS may follow the default route to the 1126 wrong ToF Node, e.g., ToF1. 1128 In a fully connected ToF, in case of failure between ToF2 and spine 1129 nodes, ToF2's loopback address must be disaggregated recursively all 1130 the way to the leaves. 1132 In a partitioned ToF, a TOF node is only reachable within its Plane, 1133 and the disaggregation to the leaves is also required. A possible 1134 alternative is to use the ring that interconnects the ToF nodes to 1135 transmit packets between them for their loopback addresses only. The 1136 idea is that this is mostly control traffic and should not alter the 1137 load balancing properties of the fabric. 1139 4.10. Dual Homing Servers 1141 Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It 1142 has no configuration (unless it is a Top-of-Fabric at the top of the 1143 topology or the must operate in the topology as leaf and/or support 1144 leaf-2-leaf procedures) and it will fully configure itself after 1145 being attached to the topology. 1147 +---+ +---+ +---+ 1148 |ToF| |ToF| |ToF| ToF 1149 +---+ +---+ +---+ 1150 | | | | | | 1151 | +----------------+ | | 1152 | | | | | | 1153 | +----------------+ | 1154 | | | | | | 1155 +----------+--+ +--+----------+ 1156 | ToR1 | | ToR2 | Spine 1157 +--+------+---+ +--+-------+--+ 1158 +---+ | | | | | | +---+ 1159 | | | | | | | | 1160 | +-----------------+ | | | 1161 | | | +-------------+ | | 1162 + | + | | |-----------------+ | 1163 X | X | +--------x-----+ | X | 1164 + | + | | | + | 1165 +---+ +---+ +---+ +---+ 1166 | | | | | | | | 1167 +---+ +---+ ...............+---+ +---+ 1168 SV(1) SV(2) SV(n+1) SV(n) Leaf 1170 Figure 11: Dual-homing servers 1172 In the single plane, the worst condition is disaggregation of every 1173 other servers at the same level. Suppose the links from ToR1 (Top of 1174 Rack) to all the leaves become not available. All the servers' 1175 routes are disaggregated and the FIB of the servers will be expanded 1176 with n-1 more specific routes. 1178 Sometimes, people may prefer to disaggregate from ToR to servers from 1179 start on, i.e. the servers have couple tens of routes in FIB from 1180 start on beside default routes to avoid breakages at rack level. 1181 Full disaggregation of the fabric could be achieved by configuration 1182 supported by RIFT. 1184 4.11. Fabric With A Controller 1186 There are many different ways to deploy the controller. One 1187 possibility is attaching a controller to the RIFT domain from ToF and 1188 another possibility is attaching a controller from the leaf. 1190 +------------+ 1191 | Controller | 1192 ++----------++ 1193 | | 1194 | | 1195 +----++ ++----+ 1196 ------- | ToF | | ToF | 1197 | +--+--+ +-----+ 1198 | | | | | 1199 | | +-------------+ | 1200 | | +--------+ | | 1201 | | | | | 1202 +-----+ +-+---+ 1203 RIFT domain |Spine| |Spine| 1204 +--+--+ +-----+ 1205 | | | | | 1206 | | +-------------+ | 1207 | | +--------+ | | 1208 | | | | | 1209 | +-----+ +-+---+ 1210 ------- |Leaf | | Leaf| 1211 +-----+ +-----+ 1213 Figure 12: Fabric with a controller 1215 4.11.1. Controller Attached to ToFs 1217 If a controller is attaching to the RIFT domain from ToF, it usually 1218 uses dual-homing connections. The loopback prefix of the controller 1219 should be advertised down by the ToF and spine to leaves. If the 1220 controller loses link to ToF, make sure the ToF withdraw the prefix 1221 of the controller(use different mechanisms). 1223 4.11.2. Controller Attached to Leaf 1225 If the controller is attaching from a leaf to the fabric, no special 1226 provisions are needed. 1228 4.12. Internet Connectivity With Underlay 1230 If global addressing is running without overlay, an external default 1231 route needs to be advertised through RIFT fabric to achieve internet 1232 connectivity. For the purpose of forwarding of the entire RIFT 1233 fabric, an internal fabric prefix needs to be advertised in the South 1234 Prefix TIE by ToF and spine nodes. 1236 4.12.1. Internet Default on the Leaf 1238 In case that an internet access request comes from a leaf and the 1239 internet gateway is another leaf, the leaf node as the internet 1240 gateway needs to advertise a default route in its Prefix North TIE. 1242 4.12.2. Internet Default on the ToFs 1244 In case that an internet access request comes from a leaf and the 1245 internet gateway is a ToF, the ToF and spine nodes need to advertise 1246 a default route in the Prefix South TIE. 1248 4.13. Subnet Mismatch and Address Families 1250 +--------+ +--------+ 1251 | | LIE LIE | | 1252 | A | +----> <----+ | B | 1253 | +---------------------+ | 1254 +--------+ +--------+ 1255 X/24 Y/24 1257 Figure 13: subnet mismatch 1259 LIEs are exchanged over all links running RIFT to perform Link 1260 (Neighbor) Discovery. A node MUST NOT originate LIEs on an address 1261 family if it does not process received LIEs on that family. LIEs on 1262 same link are considered part of the same negotiation independent on 1263 the address family they arrive on. An implementation MUST be ready 1264 to accept TIEs on all addresses it used as source of LIE frames. 1266 As shown in the above figure, without further checks adjacency of 1267 node A and B may form, but the forwarding between node A and node B 1268 may fail because subnet X mismatches with subnet Y. 1270 To prevent this a RIFT implementation should check for subnet 1271 mismatch just like e.g. ISIS does. This can lead to scenarios where 1272 an adjacency, despite exchange of LIEs in both address families may 1273 end up having an adjacency in a single AF only. This is a 1274 consideration especially in Section 4.8 scenarios. 1276 4.14. Anycast Considerations 1278 + traffic 1279 | 1280 v 1281 +------+------+ 1282 | ToF | 1283 +---+-----+---+ 1284 | | | | 1285 +------------+ | | +------------+ 1286 | | | | 1287 +---+---+ +-------+ +-------+ +---+---+ 1288 | | | | | | | | 1289 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 1290 +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ 1291 | | | | | | | | 1292 | +---------+ | | +---------+ | 1293 | | | | | | | | 1294 | +-------+ | | | +-------+ | | 1295 | | | | | | | | 1296 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 1297 | | | | | | | | 1298 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 1299 +-+-----+ ++------+ +-----+-+ +-----+-+ 1300 + + + ^ | 1301 PrefixA PrefixB PrefixA | PrefixC 1302 | 1303 + traffic 1305 Figure 14: Anycast 1307 If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast 1308 prefix PrefixA. RIFT can deal with this case well. But if the 1309 traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. 1310 But Spine21 or Spine22 doesn't know another PrefixA attaching 1311 Leaf111. So it will always get to Leaf121 and never get to Leaf111. 1312 If the intension is that the traffic should been offloaded to 1313 Leaf111, then use policy guided prefixes defined in "Routing in Fat 1314 Trees" [RIFT]. 1316 4.15. IoT Applicability 1318 The design of RIFT inherits from RPL [RFC6550] the anisotropic design 1319 of a default route upwards (northwards); it also inherits the 1320 capability to inject external host routes at the Leaf level using 1321 Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host 1322 and a RIFT router. Both the RPL and the RIFT protocols are meant for 1323 large scale, and WiND enables device mobility at the edge the same 1324 way in both cases. 1326 The main difference between RIFT and RPL is that with RPL, there's a 1327 single Root, whereas RIFT has many ToF nodes. The adds huge 1328 capabilities for leaf-2-leaf ECMP paths, but additional complexity 1329 with the need to disaggregate. Also RIFT uses Link State flooding 1330 northwards, and is not designed for low-power operation. 1332 Still nothing prevents that the IP devices connected at the Leaf are 1333 IoT (Internet of Things) devices, which typically expose their 1334 address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. 1336 A network that serves high speed/ high power IoT devices should 1337 typically provide deterministic capabilities for applications such as 1338 high speed control loops or movement detection. The Fat Tree is 1339 highly reliable, and in normal condition provides an equilatent 1340 multipath operation; but the ECMP doesn't provide hard guarantees for 1341 either delivery or latency. As long as the fabric is non-blocking 1342 the result is the same; but there can be load unbalances resulting in 1343 incast and possibly congestion loss that will prevent the delivery 1344 within bounded latency. 1346 This could be alleviated with Packet Replication, Elimination and 1347 Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide 1348 at the scale of all flows, and the replication may increase the 1349 probability of the overload that it attempts to solve. 1351 Note that the load balancing is not RIFT's problem, but it is key to 1352 serve IoT adequately. 1354 4.16. Key Management 1356 As outlined in Section "Security Considerations" of [RIFT], either a 1357 private shared key or a public/private key pair is used to 1358 authenticate the adjacency. Both the key distribution and key 1359 synchronization methods are out of scope for this document. Both 1360 nodes in the adjacency must share the same keys, key type, and 1361 algorithm for a given key ID. Mismatched keys will not inter-operate 1362 as their security envelopes will be unverifiable. 1364 Key roll-over while the adjacency is active MAY be supported. The 1365 specific mechanism is well documented in [RFC6518]. 1367 5. Security Considerations 1369 This document presents applicability of RIFT. As such, it does not 1370 introduce any security considerations. However, there are a number 1371 of security concerns at [RIFT]. 1373 6. Contributors 1375 The following people (listed in alphabetical order) contributed 1376 significantly to the content of this document and should be 1377 considered co-authors: 1379 Tony Przygienda 1381 Juniper Networks 1383 1194 N. Mathilda Ave 1385 Sunnyvale, CA 94089 1387 US 1389 Email: prz@juniper.net 1391 7. Normative References 1393 [ISO10589-Second-Edition] 1394 International Organization for Standardization, 1395 "Intermediate system to Intermediate system intra-domain 1396 routeing information exchange protocol for use in 1397 conjunction with the protocol for providing the 1398 connectionless-mode Network Service (ISO 8473)", November 1399 2002. 1401 [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central 1402 Office Reference Architectural Framework", January 2018. 1404 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 1405 DOI 10.17487/RFC2328, April 1998, 1406 . 1408 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1409 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1410 DOI 10.17487/RFC4861, September 2007, 1411 . 1413 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1414 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1415 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1416 . 1418 [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., 1419 Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional 1420 Forwarding Detection (BFD) on Link Aggregation Group (LAG) 1421 Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 1422 2014, . 1424 [RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network 1425 Layer Reachability Information with an IPv6 Next Hop", 1426 RFC 5549, DOI 10.17487/RFC5549, May 2009, 1427 . 1429 [RFC6518] Lebovitz, G. and M. Bhatia, "Keying and Authentication for 1430 Routing Protocols (KARP) Design Guidelines", RFC 6518, 1431 DOI 10.17487/RFC6518, February 2012, 1432 . 1434 [RFC6550] Winter, T., Ed., Thubert, P., Ed., Brandt, A., Hui, J., 1435 Kelsey, R., Levis, P., Pister, K., Struik, R., Vasseur, 1436 JP., and R. Alexander, "RPL: IPv6 Routing Protocol for 1437 Low-Power and Lossy Networks", RFC 6550, 1438 DOI 10.17487/RFC6550, March 2012, 1439 . 1441 [RFC6775] Shelby, Z., Ed., Chakrabarti, S., Nordmark, E., and C. 1442 Bormann, "Neighbor Discovery Optimization for IPv6 over 1443 Low-Power Wireless Personal Area Networks (6LoWPANs)", 1444 RFC 6775, DOI 10.17487/RFC6775, November 2012, 1445 . 1447 [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, 1448 "Deterministic Networking Architecture", RFC 8655, 1449 DOI 10.17487/RFC8655, October 2019, 1450 . 1452 [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and 1453 D. Afanasiev, "RIFT: Routing in Fat Trees", Work in 1454 Progress, Internet-Draft, draft-ietf-rift-rift-12, 26 May 1455 2020, 1456 . 1458 [I-D.white-distoptflood] 1459 White, R., Hegde, S., and S. Zandi, "IS-IS Optimal 1460 Distributed Flooding for Dense Topologies", Work in 1461 Progress, Internet-Draft, draft-white-distoptflood-04, 27 1462 July 2020, 1463 . 1465 8. Informative References 1467 [IEEEstd1588] 1468 IEEE standard for Information Technology, "IEEE Standard 1469 for a Precision Clock Synchronization Protocol for 1470 Networked Measurement and Control Systems", 1471 . 1473 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 1474 Communication Environments", IEEE International Parallel & 1475 Distributed Processing Symposium, 2011. 1477 [FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for 1478 Hardware-Efficient Supercomputing", 1985. 1480 [RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link 1481 State Routing Protocol (OLSR)", RFC 3626, 1482 DOI 10.17487/RFC3626, October 2003, 1483 . 1485 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 1486 "Network Time Protocol Version 4: Protocol and Algorithms 1487 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 1488 . 1490 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1491 (IPv6) Specification", STD 86, RFC 8200, 1492 DOI 10.17487/RFC8200, July 2017, 1493 . 1495 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 1496 Perkins, "Registration Extensions for IPv6 over Low-Power 1497 Wireless Personal Area Network (6LoWPAN) Neighbor 1498 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 1499 . 1501 [RFC8928] Thubert, P., Ed., Sarikaya, B., Sethi, M., and R. Struik, 1502 "Address-Protected Neighbor Discovery for Low-Power and 1503 Lossy Networks", RFC 8928, DOI 10.17487/RFC8928, November 1504 2020, . 1506 Authors' Addresses 1507 Yuehua Wei (editor) 1508 ZTE Corporation 1509 No.50, Software Avenue 1510 Nanjing 1511 210012 1512 China 1514 Email: wei.yuehua@zte.com.cn 1516 Zheng Zhang 1517 ZTE Corporation 1518 No.50, Software Avenue 1519 Nanjing 1520 210012 1521 China 1523 Email: zhang.zheng@zte.com.cn 1525 Dmitry Afanasiev 1526 Yandex 1528 Email: fl0w@yandex-team.ru 1530 Pascal Thubert 1531 Cisco Systems, Inc 1532 Building D 1533 45 Allee des Ormes - BP1200 1534 06254 MOUGINS - Sophia Antipolis 1535 France 1537 Phone: +33 497 23 26 34 1538 Email: pthubert@cisco.com 1540 Tom Verhaeg 1541 Juniper Networks 1543 Email: tverhaeg@juniper.net 1545 Jaroslaw Kowalczyk 1546 Orange Polska 1548 Email: jaroslaw.kowalczyk2@orange.com