idnits 2.17.00 (12 Aug 2021) /tmp/idnits58105/draft-ietf-rift-applicability-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (17 September 2021) is 246 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-15) exists of draft-ietf-rift-rift-13 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT WG Yuehua. Wei, Ed. 3 Internet-Draft Zheng. Zhang 4 Intended status: Informational ZTE Corporation 5 Expires: 21 March 2022 Dmitry. Afanasiev 6 Yandex 7 P. Thubert 8 Cisco Systems 9 Jaroslaw. Kowalczyk 10 Orange Polska 11 17 September 2021 13 RIFT Applicability 14 draft-ietf-rift-applicability-07 16 Abstract 18 This document discusses the properties, applicability and operational 19 considerations of RIFT in different network scenarios. It intends to 20 provide a rough guide how RIFT can be deployed to simplify routing 21 operations in Clos topologies and their variations. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at https://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on 21 March 2022. 40 Copyright Notice 42 Copyright (c) 2021 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 47 license-info) in effect on the date of publication of this document. 48 Please review these documents carefully, as they describe your rights 49 and restrictions with respect to this document. Code Components 50 extracted from this document must include Simplified BSD License text 51 as described in Section 4.e of the Trust Legal Provisions and are 52 provided without warranty as described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 57 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 3. Problem Statement of Routing in Modern IP Fabric Fat Tree 59 Networks . . . . . . . . . . . . . . . . . . . . . . . . 5 60 4. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 5 61 4.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 5 62 4.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 8 63 4.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 8 64 4.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 9 65 4.2.3. Generalizing to any Directed Acyclic Graph . . . . . 9 66 4.2.4. Reachability of Internal Nodes in the Fabric . . . . 11 67 4.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 11 68 4.3.1. Data Center Topologies . . . . . . . . . . . . . . . 11 69 4.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 12 70 4.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 13 71 4.3.4. Internal Router Switching Fabrics . . . . . . . . . . 13 72 4.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 13 73 5. Operational Considerations . . . . . . . . . . . . . . . . . 15 74 5.1. South Reflection . . . . . . . . . . . . . . . . . . . . 16 75 5.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 16 76 5.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 18 77 5.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 19 78 5.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 20 79 5.6. Positive vs. Negative Disaggregation . . . . . . . . . . 22 80 5.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 24 81 5.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 25 82 5.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 26 83 5.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 28 84 5.11. Fabric With A Controller . . . . . . . . . . . . . . . . 29 85 5.11.1. Controller Attached to ToFs . . . . . . . . . . . . 29 86 5.11.2. Controller Attached to Leaf . . . . . . . . . . . . 29 87 5.12. Internet Connectivity With Underlay . . . . . . . . . . . 30 88 5.12.1. Internet Default on the Leaf . . . . . . . . . . . . 30 89 5.12.2. Internet Default on the ToFs . . . . . . . . . . . . 30 90 5.13. Subnet Mismatch and Address Families . . . . . . . . . . 30 91 5.14. Anycast Considerations . . . . . . . . . . . . . . . . . 31 92 5.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 32 93 5.16. Key Management . . . . . . . . . . . . . . . . . . . . . 32 94 6. Security Considerations . . . . . . . . . . . . . . . . . . . 33 95 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33 96 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 33 97 9. Normative References . . . . . . . . . . . . . . . . . . . . 33 98 10. Informative References . . . . . . . . . . . . . . . . . . . 35 99 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 101 1. Introduction 103 This document discusses the properties and applicability of "Routing 104 in Fat Trees" [RIFT] in different deployment scenarios and highlights 105 the operational simplicity of the technology compared to traditional 106 routing solutions. It also documents special considerations when 107 RIFT is used with or without overlays and/or controllers, and how 108 RIFT identifies topology mis-cablings and reroutes around node and 109 link failures. 111 2. Terminology 113 Clos/Fat Tree: 115 This document uses the terms Clos and Fat Tree interchangeably 116 whereas it always refers to a folded spine-and-leaf topology with 117 possibly multiple Points of Delivery (PoDs) and one or multiple Top 118 of Fabric (ToF) planes. 120 Directed Acyclic Graph (DAG): 122 A finite directed graph with no directed cycles (loops). If links in 123 a Clos are considered as either being all directed towards the top or 124 vice versa, each of such two graphs is a DAG. 126 Disaggregation: 128 Process in which a node decides to advertise more specific prefixes 129 Southwards, either positively to attract the corresponding traffic, 130 or negatively to repel it. Disaggregation is performed to prevent 131 black-holing and suboptimal routing to the more specific prefixes. 133 TIE: 135 This is an acronym for a "Topology Information Element". TIEs are 136 exchanged between RIFT nodes to describe parts of a network such as 137 links and address prefixes. A TIE has always a direction and a type. 138 North TIEs (sometimes abbreviated as N-TIEs) are used when dealing 139 with TIEs in the northbound representation and South-TIEs (sometimes 140 abbreviated as S- TIEs) for the southbound equivalent. TIEs have 141 different types such as node and prefix TIEs. 143 Node TIE: 145 This stands as acronym for a "Node Topology Information Element", 146 which contains all adjacencies the node discovered and information 147 about the node itself. Node TIE should NOT be confused with a North 148 TIE since "node" defines the type of TIE rather than its direction. 149 Consequently North Node TIEs and South Node TIEs exist. 151 Prefix TIE: 153 This is an acronym for a "Prefix Topology Information Element" and it 154 contains all prefixes directly attached to this node in case of a 155 North TIE and in case of South TIE the necessary default routes the 156 node advertises southbound. 158 South Reflection: 160 Often abbreviated just as "reflection", it defines a mechanism where 161 South Node TIEs are "reflected" from the level south back up north to 162 allow nodes in the same level without East- West links to "see" each 163 other's node Topology Information Elements (TIEs). 165 LIE: 167 This is an acronym for a "Link Information Element" exchanged on all 168 the system's links running RIFT to form ThreeWay adjacencies and 169 carry information used to perform Zero Touch Provisioning (ZTP) of 170 levels. 172 Shortest-Path First (SPF): 174 A well-known graph algorithm attributed to Dijkstra that establishes 175 a tree of shortest paths from a source to destinations on the graph. 176 SPF acronym is used due to its familiarity as general term for the 177 node reachability calculations. RIFT can employ to ultimately 178 calculate routes of which Dijkstra algorithm is a possible one. 180 North SPF (N-SPF): 182 A reachability calculation that is progressing northbound, as example 183 SPF that is using South Node TIEs only. Normally it progresses a 184 single hop only and installs default routes. 186 South SPF (S-SPF): 188 A reachability calculation that is progressing southbound, as example 189 SPF that is using North Node TIEs only. 191 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 193 Clos [CLOS] topologies (called commonly a fat tree/network in modern 194 IP fabric considerations as homonym to the original definition of the 195 term Fat Tree [FATTREE])have gained prominence in today's networking, 196 primarily as a result of the paradigm shift towards a centralized 197 data-center based architecture that deliver a majority of computation 198 and storage services. 200 Today's current routing protocols were geared towards a network with 201 an irregular topology with isotropic properties, and low degree of 202 connectivity. When applied to Fat Tree topologies: 204 * They tend to need extensive configuration or provisioning during 205 bring up and re-dimensioning. 207 * All nodes including spine and leaf nodes learn the entire network 208 topology and routing information, which is in fact, not needed on 209 the leaf nodes during normal operation. 211 * They flood significant amounts of duplicate link state information 212 between spine and leaf nodes during topology updates and 213 convergence events, requiring that additional CPU and link 214 bandwidth be consumed. This may impact the stability and 215 scalability of the fabric, make the fabric less reactive to 216 failures, and prevent the use of cheaper hardware at the lower 217 levels (i.e. spine and leaf nodes). 219 4. Applicability of RIFT to Clos IP Fabrics 221 Further content of this document assumes that the reader is familiar 222 with the terms and concepts used in OSPF [RFC2328] and IS-IS 223 [ISO10589-Second-Edition] link-state protocols. The sections of RIFT 224 [RIFT] outline the requirements of routing in IP fabrics and RIFT 225 protocol concepts. 227 4.1. Overview of RIFT 229 RIFT is a dynamic routing protocol that is tailored for use in Clos, 230 Fat-Tree, and other anisotropic topologies. A core property of RIFT 231 is that its operation is sensitive to the structure of the fabric - 232 it is anisotropic. RIFT acts as a link-state protocol when "pointing 233 north" - advertising southwards routes to northwards peer routers 234 (parents) through flooding and database synchronization- but operates 235 hop-by-hop like a distance-vector protocol when "pointing south" - 236 typically advertising a fabric default route directed towards the Top 237 of Fabric (ToF, aka superspine) to southwards peer routers 238 (children). 240 The fabric default is typically the default route, as described in 241 Section 4.2.3.8 "Southbound Default Route Origination" of RIFT 242 [RIFT]. The ToF nodes may alternatively originate more specific 243 prefixes (P') southbound instead of the default route. In such a 244 scenario, all addresses carried within the RIFT domain must be 245 contained within P', and it is possible for a leaf that acts as 246 gateway to the internet to advertise the default route instead. 248 RIFT floods flat link-state information northbound only so that each 249 level obtains the full topology of levels south of it. That 250 information is never flooded east-west or back south again. So a top 251 tier node has full set of prefixes from the Shortest Path First (SPF) 252 calculation. 254 In the southbound direction, the protocol operates like a "fully 255 summarizing, unidirectional" path-vector protocol or rather a 256 distance-vector with implicit split horizon. Routing information, 257 normally just the default route, propagates one hop south and is "re- 258 advertised" by nodes at next lower level. 260 +-----------+ +-----------+ 261 | ToF | | ToF | LEVEL 2 262 + +-----+--+--+ +-+--+------+ 263 | | | | | | | | | ^ 264 + | | | +-------------------------+ | 265 Distance | +-------------------+ | | | | | 266 Vector | | | | | | | | + 267 South | | | | +--------+ | | | Link-State 268 + | | | | | | | | Flooding 269 | | | +-------------+ | | | North 270 v | | | | | | | | + 271 +-+--+-+ +------+ +-------+ +--+--+-+ | 272 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 273 + ++----++ ++---+-+ +--+--+-+ ++----+-+ | 274 + | | | | | | | | | ^ N 275 Distance | +-------+ | | +--------+ | | | E 276 Vector | | | | | | | | | +------> 277 South | +-------+ | | | +-------+ | | | | 278 + | | | | | | | | | + 279 v ++--++ +-+-++ ++-+-+ +-+--++ + 280 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 281 +----+ +----+ +----+ +-----+ 283 Figure 1: RIFT overview 285 A spine node has only information necessary for its level, which is 286 all destinations south of the node based on SPF calculation, default 287 route, and potential disaggregated routes. 289 RIFT combines the advantage of both link-state and distance-vector: 291 * Fastest possible convergence 293 * Automatic detection of topology 295 * Minimal routes/info on Top-of-Rack (ToR) switches, aka leaf nodes 297 * High degree of ECMP 299 * Fast de-commissioning of nodes 301 * Maximum propagation speed with flexible prefixes in an update 303 So there are two types of link-state database which are "north 304 representation" North Topology Information Elements (N-TIEs) and 305 "south representation" South Topology Information Elements (S-TIEs). 306 The N-TIEs contain a link-state topology description of lower levels 307 and S-TIEs carry simply default routes for the lower levels. 309 RIFT also eliminates major disadvantages of link-state and distance- 310 vector with: 312 * Reduced and balanced flooding 314 * Automatic neighbor detection 316 To achieve this, RIFT builds on the art of IGPs, not only OSPF and 317 IS-IS but also MANET and IoT, to provide unique features: 319 * Automatic (positive or negative) route disaggregation of 320 northwards routes upon fallen leaves 322 * Recursive operation in the case of negative route disaggregation 324 * Anisotropic routing that extends a principle seen in RPL [RFC6550] 325 to wide superspines 327 * Optimal flooding reduction that derives from the concept of a 328 "multipoint relay" (MPR) found in OLSR [RFC3626] and balances the 329 flooding load over northbound links and nodes. 331 Additional advantages that are unique to RIFT are listed below, the 332 details of which can be found in RIFT [RIFT]. 334 * True ZTP(Zero Touch Provisioning) 335 * Minimal blast radius on failures 337 * Can utilize all paths through fabric without looping 339 * Simple leaf implementation that can scale down to servers 341 * Key-Value store 343 * Horizontal links used for protection only 345 * Supports non-equal cost multipath and can replace multi-chassis 346 link aggregation group (MLAG or MC-LAG) 348 4.2. Applicable Topologies 350 Albeit RIFT is specified primarily for "proper" Clos or Fat Tree 351 topologies, the protocol natively supports Points of Delivery (PoD) 352 concepts, which, strictly speaking, are not found in the original 353 Clos concept. 355 Further, the specification explains and supports operations of multi- 356 plane Clos variants where the protocol recommends the use of inter- 357 plane rings at the Top-of-Fabric level to allow the reconciliation of 358 topology view of different planes to make the negative disaggregation 359 viable in case of failures within a plane. These observations hold 360 not only in case of RIFT but also in the generic case of dynamic 361 routing on Clos variants with multiple planes and failures in bi- 362 sectional bandwidth, especially on the leafs. 364 4.2.1. Horizontal Links 366 RIFT is not limited to pure Clos divided into PoD and multi-planes 367 but supports horizontal (East-West) links below the top of fabric 368 level. Those links are used only for last resort northbound routes 369 when a spine loses all its northbound links or cannot compute a 370 default route through them. 372 A possible configuration is a "ring" of horizontal links at a level. 373 In presence of such a "ring" in any level (except Top of Fabric (ToF) 374 level) neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a 375 "ring-based protection" scheme since such a computation would have to 376 deal necessarily with breaking of "loops" in Dijkstra sense; an 377 application for which RIFT is not intended. 379 A full-mesh connectivity between nodes on the same level can be 380 employed and that allows N-SPF to provide for any node loosing all 381 its northbound adjacencies (as long as any of the other nodes in the 382 level are northbound connected) to still participate in northbound 383 forwarding. 385 4.2.2. Vertical Shortcuts 387 Through relaxations of the specified adjacency forming rules, RIFT 388 implementations can be extended to support vertical "shortcuts". The 389 RIFT specification itself does not provide the exact details since 390 the resulting solution suffers from either much larger blast radius 391 with increased flooding volumes or in case of maximum aggregation 392 routing, bow-tie problems. 394 4.2.3. Generalizing to any Directed Acyclic Graph 396 RIFT is an anisotropic routing protocol, meaning that it has a sense 397 of direction (northbound, southbound, east-west) and that it operates 398 differently depending on the direction. 400 * Northbound, RIFT operates as a link-state protocol, whereby the 401 control packets are reflooded first all the way north and only 402 interpreted later. All the individual fine grained routes are 403 advertised. 405 * Southbound, RIFT operates as a distance-vector protocol, whereby 406 the control packets are flooded only one-hop, interpreted, and the 407 consequence of that computation is what gets flooded one more hop 408 south. In the most common use-cases, a ToF node can reach most of 409 the prefixes in the fabric. If that is the case, the ToF node 410 advertises the fabric default and disaggregates the prefixes that 411 it cannot reach. On the other hand, a ToF node that can reach 412 only a small subset of the prefixes in the fabric will preferably 413 advertise those prefixes and refrain from aggregating. 415 In the general case, what gets advertised south is in more 416 details: 418 1. A fabric default that aggregates all the prefixes that are 419 reachable within the fabric, and that could be a default route 420 or a prefix that is dedicated to this particular fabric. 422 2. The loopback addresses of the northbound nodes, e.g., for 423 inband management. 425 3. The disaggregated prefixes for the dynamic exceptions to the 426 fabric default, advertised to route around the black hole that 427 may form. 429 * East-West routing can optionally be used, with specific 430 restrictions. It is used when a sibling has access to the fabric 431 default but this node does not. 433 A Directed Acyclic Graph (DAG) provides a sense of north (the 434 direction of the DAG) and of south (the reverse), which can be used 435 to apply RIFT. For the purpose of RIFT, an edge in the DAG that has 436 only incoming vertices is a ToF node. 438 There are a number of caveats though: 440 * The DAG structure must exist before RIFT starts, so there is a 441 need for a companion protocol to establish the logical DAG 442 structure. 444 * A generic DAG does not have a sense of east and west. The 445 operation specified for east-west links and the southbound 446 reflection between nodes are not applicable. Also ZTP(Zero Touch 447 Provisioning) will derive a sense of depth that will eliminate 448 some links. Variations of ZTP(Zero Touch Provisioning) could be 449 derived to meet specific objectives, e.g., make it so that most 450 routers have at least 2 parents to reach the ToF. 452 * RIFT applies to any Destination-Oriented DAG (DODAG) where there's 453 only one ToF node and the problem of disaggregation does not 454 exist. In that case, RIFT operates very much like RPL [RFC6550], 455 but using Link State for southbound routes (downwards in RPL's 456 terms). For an arbitrary DAG with multiple destinations (ToFs) 457 the way disaggregation happens has to be considered. 459 * Positive disaggregation expects that most of the ToF nodes reach 460 most of the leaves, so disaggregation is the exception as opposed 461 to the rule. When this is no more true, it makes sense to turn 462 off disaggregation and route between the ToF nodes over a ring, a 463 full mesh, transit network, or a form of area zero. There again, 464 this operation is similar to RPL operating as a single DODAG with 465 a virtual root. 467 * In order to aggregate and disaggregate routes, RIFT requires that 468 all the ToF nodes share the full knowledge of the prefixes in the 469 fabric. 471 * This can be achieved with a ring as suggested by "RIFT" [RIFT], by 472 some preconfiguration, or using a synchronization with a common 473 repository where all the active prefixes are registered. 475 4.2.4. Reachability of Internal Nodes in the Fabric 477 RIFT does not require that nodes have reachable addresses in the 478 fabric, though it is clearly desirable for operational purposes. 479 Under normal operating conditions this can be easily achieved by 480 injecting the node's loopback address into North and South Prefix 481 TIEs or other implementation specific mechanisms. 483 Special considerations arise when a node loses all northbound 484 adjacencies, but is not at the top of the fabric. These are outside 485 the scope of this document and could be discussed in a separate 486 document. 488 4.3. Use Cases 490 4.3.1. Data Center Topologies 492 4.3.1.1. Data Center Fabrics 494 RIFT is suited for applying in data center (DC) IP fabrics underlay 495 routing, vast majority of which seem to be currently (and for the 496 foreseeable future) Clos architectures. It significantly simplifies 497 operation and deployment of such fabrics as described in Section 5 498 for environments compared to extensive proprietary provisioning and 499 operational solutions. 501 4.3.1.2. Adaptations to Other Proposed Data Center Topologies 502 . +-----+ +-----+ 503 . | | | | 504 .+-+ S0 | | S1 | 505 .| ++---++ ++---++ 506 .| | | | | 507 .| | +------------+ | 508 .| | | +------------+ | 509 .| | | | | 510 .| ++-+--+ +--+-++ 511 .| | | | | 512 .| | A0 | | A1 | 513 .| +-+--++ ++---++ 514 .| | | | | 515 .| | +------------+ | 516 .| | +-----------+ | | 517 .| | | | | 518 .| +-+-+-+ +--+-++ 519 .+-+ | | | 520 . | L0 | | L1 | 521 . +-----+ +-----+ 523 Figure 2: Level Shortcut 525 RIFT is not strictly limited to Clos topologies. The protocol only 526 requires a sense of "compass rose directionality" either achieved 527 through configuration or derivation of levels. So, conceptually, 528 shortcuts between levels could be included. Figure 2 depicts an 529 example of a shortcut between levels. In this example, sub-optimal 530 routing will occur when traffic is sent from L0 to L1 via S0's 531 default route and back down through A0 or A1. In order to ensure 532 that, only default routes from A0 or A1 are used, all leaves would be 533 required to install each others routes. 535 While various technical and operational challenges may require the 536 use of such modifications, discussion of those topics are outside the 537 scope of this document. 539 4.3.2. Metro Fabrics 541 The demand for bandwidth is increasing steadily, driven primarily by 542 environments close to content producers (server farms connection via 543 DC fabrics) but in proximity to content consumers as well. Consumers 544 are often clustered in metro areas with their own network 545 architectures that can benefit from simplified, regular Clos 546 structures and hence from RIFT. 548 4.3.3. Building Cabling 550 Commercial edifices are often cabled in topologies that are either 551 Clos or its isomorphic equivalents. The Clos can grow rather high 552 with many floors. That presents a challenge for traditional routing 553 protocols (except BGP and by now largely phased-out PNNI) which do 554 not support an arbitrary number of levels which RIFT does naturally. 555 Moreover, due to the limited sizes of forwarding tables in network 556 elements of building cabling, the minimum FIB size RIFT maintains 557 under normal conditions is cost-effective in terms of hardware and 558 operational costs. 560 4.3.4. Internal Router Switching Fabrics 562 It is common in high-speed communications switching and routing 563 devices to use fabrics when a crossbar is not feasible due to cost, 564 head-of-line blocking or size trade-offs. Normally such fabrics are 565 not self-healing or rely on 1:/+1 protection schemes but it is 566 conceivable to use RIFT to operate Clos fabrics that can deal 567 effectively with interconnections or subsystem failures in such 568 module. RIFT is neither IP specific and hence any link addressing 569 connecting internal device subnets is conceivable. 571 4.3.5. CloudCO 573 The Cloud Central Office (CloudCO) is a new stage of telecom Central 574 Office. It takes the advantage of Software Defined Networking (SDN) 575 and Network Function Virtualization (NFV) in conjunction with general 576 purpose hardware to optimize current networks. The following figure 577 illustrates this architecture at a high level. It describes a single 578 instance or macro-node of cloud CO that provides a number of Value 579 Added Services (VAS), a Broadband Access Abstraction (BAA), and 580 virtualized nerwork services. An Access I/O module faces a Cloud CO 581 access node, and the Customer Premises Equipments (CPEs) behind it. 582 A Network I/O module is facing the core network. The two I/O modules 583 are interconnected by a leaf and spine fabric [TR-384]. 585 +---------------------+ +----------------------+ 586 | Spine | | Spine | 587 | Switch | | Switch | 588 +------+---+------+-+-+ +--+-+-+-+-----+-------+ 589 | | | | | | | | | | | | 590 | | | | | +-------------------------------+ | 591 | | | | | | | | | | | | 592 | | | | +-------------------------+ | | | 593 | | | | | | | | | | | | 594 | | +----------------------+ | | | | | | | | 595 | | | | | | | | | | | | 596 | +---------------------------------+ | | | | | | | 597 | | | | | | | | | | | | 598 | | | +-----------------------------+ | | | | | 599 | | | | | | | | | | | | 600 | | | | | +--------------------+ | | | | 601 | | | | | | | | | | | | 602 +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ 603 |L | | Leaf | | Leaf | | Leaf | | Leaf | |L | 604 |S | | Switch | | Switch | | Switch | | Switch| |S | 605 ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ 606 | | | | | | | | | | | | | | 607 | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | 608 | |Compute | |Compute | | Compute | |Compute| | 609 | |Node | |Node | | Node | |Node | | 610 | +--------+ +--------+ +----------+ +-------+ | 611 | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | 612 | |--------| |--------| |----------| |-------| | 613 | |--------| |--------| |----------| |-------| | 614 | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | 615 | |--------| |--------| |----------| |-------| | 616 | |--------| |--------| |----------| |-------| | 617 | || VAS7 || || VAS4 || || vIGMP || ||BAA || | 618 | |--------| |--------| |----------| |-------| | 619 | +--------+ +--------+ +----------+ +-------+ | 620 | | 621 ++-----------+ +---------++ 622 |Network I/O | |Access I/O| 623 +------------+ +----------+ 625 Figure 3: An example of CloudCO architecture 627 The Spine-Leaf architecture deployed inside CloudCO meets the network 628 requirements of adaptable, agile, scalable and dynamic. 630 5. Operational Considerations 632 RIFT presents the opportunity for organizations building and 633 operating IP fabrics to simplify their operation and deployments 634 while achieving many desirable properties of a dynamic routing on 635 such a substrate: 637 * RIFT only floods routing information to the devices that 638 absolutely need it. RIFT design follows minimum blast radius and 639 minimum necessary epistemological scope philosophy which leads to 640 good scaling properties while delivering maximum reactiveness. 642 * RIFT allows for extensive Zero Touch Provisioning within the 643 protocol. In its most extreme version RIFT does not rely on any 644 specific addressing and for IP fabric can operate using IPv6 ND 645 [RFC4861] only. 647 * RIFT has provisions to detect common IP fabric mis-cabling 648 scenarios. 650 * RIFT negotiates automatically BFD per link allowing this way for 651 IP and micro-BFD [RFC7130] to replace Link Aggregation Groups 652 (LAGs) which do hide bandwidth imbalances in case of constituent 653 failures. Further automatic link validation techniques similar to 654 [RFC5357] could be supported as well. 656 * RIFT inherently solves many difficult problems associated with the 657 use of traditional routing topologies with dense meshes and high 658 degrees of ECMP by including automatic bandwidth balancing, flood 659 reduction and automatic disaggregation on failures while providing 660 maximum aggregation of prefixes in default scenarios. 662 * RIFT reduces FIB size towards the bottom of the IP fabric where 663 most nodes reside and allows with that for cheaper hardware on the 664 edges and introduction of modern IP fabric architectures that 665 encompass e.g. server multi-homing. 667 * RIFT provides valley-free routing and with that is loop free. 668 This allows the use of any such valley-free path in bi-sectional 669 fabric bandwidth between two destination irrespective of their 670 metrics which can be used to balance load on the fabric in 671 different ways. 673 * RIFT includes a key-value distribution mechanism which allows for 674 many future applications such as automatic provisioning of basic 675 overlay services or automatic key roll-overs over whole fabrics. 677 * RIFT is designed for minimum delay in case of prefix mobility on 678 the fabric. In conjunction with [RFC8505], RIFT can differentiate 679 anycast advertisements from mobility events and retain only the 680 most recent advertisement in the latter case. 682 * Many further operational and design points collected over many 683 years of routing protocol deployments have been incorporated in 684 RIFT such as fast flooding rates, protection of information 685 lifetimes and operationally easily recognizable remote ends of 686 links and node names. 688 5.1. South Reflection 690 South reflection is a mechanism that South Node TIEs are "reflected" 691 back up north to allow nodes in same level without east-west links to 692 "see" each other. 694 For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs 695 from ToF21 to ToF22 separately. Respectively, 696 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 697 to ToF21 separately. So ToF22 and ToF21 see each other's node 698 information as level 2 nodes. 700 In an equivalent fashion, as the result of the south reflection 701 between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, 702 Spine121 and Spine 122 knows each other at level 1. 704 5.2. Suboptimal Routing on Link Failures 705 +--------+ +--------+ 706 | ToF21 | | ToF22 | LEVEL 2 707 ++--+-+-++ ++-+--+-++ 708 | | | | | | | + 709 | | | | | | | linkTS8 710 +------------+ | +-+linkTS3+-+ | | | +-------------+ 711 | | | | | | + | 712 | +---------------------------+ | linkTS7 | 713 | | | | + + + | 714 | | | +-------+linkTS4+------------+ | 715 | | | + + | | | 716 | | | +-------------+--+ | | 717 | | | | | linkTS6 | | 718 +-+----+-+ +-+----+-+ ++--------+ +-+----+-+ 719 |Spine111| |Spine112| |Spine121 | |Spine122| LEVEL 1 720 +-+---+--+ +-+----+-+ +-+---+---+ +-+----+-+ 721 | | | | | | | | 722 | +-------------+ | + ++XX+linkSL6+---+ + 723 | | | | linkSL5 | | linkSL8 724 | +-----------+ | | + +---+linkSL7+-+ | + 725 | | | | | | | | 726 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 727 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 728 +-+-----+ +-+-----+ +-----+-+ +-+-----+ 729 + + + + 730 Prefix111 Prefix112 Prefix121 Prefix122 732 Figure 4: Suboptimal routing upon link failure use case 734 As shown in Figure 4, as the result of the south reflection between 735 Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and 736 Spine 122 knows each other at level 1. 738 Without disaggregation mechanism, when linkSL6 fails, the packet from 739 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 740 then go down through linkTS4 to linkSL8 to Leaf122 or go up through 741 linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to 742 Leaf122 based on pure default route. It's the case of suboptimal 743 routing or bow-tieing. 745 With disaggregation mechanism, when linkSL6 fails, Spine122 will 746 detect the failure according to the reflected node S-TIE from 747 Spine121. Based on the disaggregation algorithm provided by RIFT, 748 Spine122 will explicitly advertise prefix122 in Disaggregated Prefix 749 S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to 750 prefix122 will only be sent to linkSL7 following a longest-prefix 751 match to prefix 122 directly then go down through linkSL8 to Leaf122 752 . 754 5.3. Black-Holing on Link Failures 756 +--------+ +--------+ 757 | ToF 21 | | ToF 22 | LEVEL 2 758 ++-+--+-++ ++-+--+-++ 759 | | | | | | | + 760 | | | | | | | linkTS8 761 +--------------+ | +-+linkTS3+X+ | | | +--------------+ 762 linkTS1 | | | | | + | 763 + +-----------------------------+ | linkTS7 | 764 | | + | + + + | 765 | | linkTS2 +-------+linkTS4+X+----------+ | 766 | + + + + | | | 767 | linkTS5 +-+ +------------+--+ | | 768 | + | | | linkTS6 | | 769 +-+----+-+ +-+----+-+ ++-------+ +-+-----++ 770 |Spine111| |Spine112| |Spine121| |Spine122| LEVEL 1 771 +-+---+--+ ++----+--+ +-+---+--+ +-+----+-+ 772 | | | | | | | | 773 + +---------------+ | + +---+linkSL6+---+ + 774 linkSL1 | | | linkSL5 | | linkSL8 775 + +--+linkSL3+--+ | | + +---+linkSL7+-+ | + 776 | | | | | | | | 777 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 778 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 779 +-+-----+ +-+-----+ +-----+-+ +-----+-+ 780 + + + + 781 Prefix111 Prefix112 Prefix121 Prefix122 783 Figure 5: Black-holing upon link failure use case 785 This scenario illustrates a case when double link failure occurs and 786 with that black-holing can happen. 788 Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, 789 the packet from leaf111 to prefix122 would suffer 50% black-holing 790 based on pure default route. The packet supposed to go up through 791 linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be 792 dropped. The packet supposed to go up through linkSL3 to linkTS2 793 then go down through linkTS3 or linkTS4 will be dropped as well. 794 It's the case of black-holing. 796 With disaggregation mechanism, when linkTS3 and linkTS4 both fail, 797 ToF22 will detect the failure according to the reflected node S-TIE 798 of ToF21 from Spine111\Spine112. Based on the disaggregation 799 algorithm provided by RIFT, ToF22 will explicitly originate an S-TIE 800 with prefix 121 and prefix 122, that is flooded to spines 111, 112, 801 121 and 122. 803 The packet from leaf111 to prefix122 will not be routed to linkTS1 or 804 linkTS2. The packet from leaf111 to prefix122 will only be routed to 805 linkTS5 or linkTS7 following a longest-prefix match to prefix122. 807 5.4. Zero Touch Provisioning (ZTP) 809 RIFT is designed to require a very minimal configuration to simplify 810 its operation and avoid human errors; based on that minimal 811 information, Zero Touch Provisioning (ZTP) autoconfigures the key 812 operational parameters of all the RIFT nodes, that is, on the one 813 hand, the SystemID of the node that must be unique in the RIFT 814 network, and on the other hand the level of the node in the Fat Tree, 815 which determines which peers are northwards "parents" and which are 816 southwards "children". 818 ZTP is always on, but its decisions can be overridden when a network 819 administrator prefers to impose its own configuration. In that case, 820 it is the responsibility of the administrator to ensure that the 821 configured parameters are correct, in other words that the SystemID 822 of each node is unique, and that the administratively set levels 823 truly reflect the relative position of the nodes in the fabric. It 824 is recommended to let ZTP configure the network, and when not, it is 825 recommended to configure the level of all the nodes but those that 826 are forced as leaves to avoid an undesirable interaction between ZTP 827 and the manual configuration. 829 ZTP requires that the administrator points out the Top-of-Fabric 830 (ToF) nodes to set the baseline from which the fabric topology is 831 derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC 832 flag which are initial 'seeds' needed for other ZTP nodes to derive 833 their level in the topology. ZTP computes the level of each node 834 based on the Highest Available Level (HAL) of the potential parent(s) 835 nearest that baseline, which represents the superspine. In a 836 fashion, RIFT can be seen as a distance-vector protocol that computes 837 a set of feasible successors towards the superspine and auto- 838 configures the rest of the topology. 840 The autoconfiguration mechanism computes a global maximum of levels 841 by diffusion. The derivation of the level of each node happens then 842 based on Link Information Elements (LIEs) received from its neighbors 843 whereas each node (with possibly exceptions of configured leaves) 844 tries to attach at the highest possible point in the fabric. This 845 guarantees that even if the diffusion front reaches a node from 846 "below" faster than from "above", it will greedily abandon already 847 negotiated level derived from nodes topologically below it and 848 properly peer with nodes above. 850 The achieved equilibrium can be disturbed massively by all nodes with 851 highest level either leaving or entering the domain (with some finer 852 distinctions not explained further). It is therefore recommended 853 that each node is multi-homed towards nodes with respective HAL 854 offerings. Fortunately, this is the natural state of things for the 855 topology variants considered in RIFT. 857 A RIFT node may also be configured to confine it to the leaf role 858 with the LEAF_ONLY flag. A leaf node can also be configured to 859 support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either 860 case the node cannot be TOP_OF_FABRIC and its level cannot be 861 configured. RIFT will fully configure the node's level after it is 862 attached to the topology and ensure that the node is at the "bottom 863 of the hierarchy" (southernmost). 865 5.5. Mis-cabling Examples 867 +----------------+ +-----------------+ 868 | ToF21 | +------+ ToF22 | LEVEL 2 869 +-------+----+---+ | +----+---+--------+ 870 | | | | | | | | | 871 | | | +----------------------------+ | 872 | +---------------------------+ | | | | 873 | | | | | | | | | 874 | | | | +-----------------------+ | | 875 | | +------------------------+ | | | 876 | | | | | | | | | 877 +-+---+--+ +-+---+--+ | +--+---+-+ +--+---+-+ 878 |Spine111| |Spine112| | |Spine121| |Spine122| LEVEL 1 879 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 880 | | | | | | | | | 881 | +---------+ | link-M | +---------+ | 882 | | | | | | | | | 883 | +-------+ | | | | +-------+ | | 884 | | | | | | | | | 885 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 886 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 887 +-------+ +-------+ +-------+ +-------+ 889 Figure 6: A single plane mis-cabling example 891 Figure 6 shows a single plane mis-cabling example. It's a perfect 892 Fat Tree fabric except link-M connecting Leaf112 to ToF22. 894 The RIFT control protocol can discover the physical links 895 automatically and be able to detect cabling that violates Fat Tree 896 topology constraints. It reacts accordingly to such mis-cabling 897 attempts, at a minimum preventing adjacencies between nodes from 898 being formed and traffic from being forwarded on those mis-cabled 899 links. Leaf112 will in such scenario use link-M to derive its level 900 (unless it is leaf) and can report links to Spine111 and Spine112 as 901 mis-cabled unless the implementations allows horizontal links. 903 Figure 7 shows a multiple plane mis-cabling example. Since Leaf112 904 and Spine121 belong to two different PoDs, the adjacency between 905 Leaf112 and Spine121 can not be formed. link-W would be detected and 906 prevented. 908 +-------+ +-------+ +-------+ +-------+ 909 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 910 +-------+ +-------+ +-------+ +-------+ 911 | | | | | | | | 912 | | | +-----------------+ | | | 913 | +--------------------------+ | | | | 914 | | | | | | | | 915 | +------+ | | | +------+ | 916 | | +-----------------+ | | | | | 917 | | | +--------------------------+ | | 918 | A | | B | | A | | B | 919 +-----+--+ +-+---+--+ +--+---+-+ +--+-----+ 920 |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 921 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 922 | | | | | | | | | 923 | +---------+ | | | +---------+ | 924 | | | | link-W | | | | 925 | +-------+ | | | | +-------+ | | 926 | | | | | | | | | 927 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 928 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 929 +-------+ +-------+ +-------+ +-------+ 930 +--------PoD#1----------+ +---------PoD#2---------+ 932 Figure 7: A multiple plane mis-cabling example 934 RIFT provides an optional level determination procedure in its Zero 935 Touch Provisioning mode. Nodes in the fabric without their level 936 configured determine it automatically. This can have possibly 937 counter-intuitive consequences however. One extreme failure scenario 938 is depicted in Figure 8 and it shows that if all northbound links of 939 spine11 fail at the same time, spine11 negotiates a lower level than 940 Leaf11 and Leaf12. 942 To prevent such scenario where leafs are expected to act as switches, 943 LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is 944 invalid, Spine11 would not derive a valid level from the topology in 945 Figure 8. It will be isolated from the whole fabric and it would be 946 up to the leafs to declare the links towards such spine as mis- 947 cabled. 949 +-------+ +-------+ +-------+ +-------+ 950 |ToF A1| |ToF A2| |ToF A1| |ToF A2| 951 +-------+ +-------+ +-------+ +-------+ 952 | | | | | | 953 | +-------+ | | | 954 + + | | ====> | | 955 X X +------+ | +------+ | 956 + + | | | | 957 +----+--+ +-+-----+ +-+-----+ 958 |Spine11| |Spine12| |Spine12| 959 +-+---+-+ ++----+-+ ++----+-+ 960 | | | | | | 961 | +---------+ | | | 962 | | | | | | 963 | +-------+ | | +-------+ | 964 | | | | | | 965 +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ 966 |Leaf111| |Leaf112| |Leaf111| |Leaf112| 967 +-------+ +-------+ +-+-----+ +-+-----+ 968 | | 969 | +--------+ 970 | | 971 +-+---+-+ 972 |Spine11| 973 +-------+ 975 Figure 8: Fallen spine 977 5.6. Positive vs. Negative Disaggregation 979 Disaggregation is the procedure whereby [RIFT] advertises a more 980 specific route southwards as an exception to the aggregated fabric- 981 default north. Disaggregation is useful when a prefix within the 982 aggregation is reachable via some of the parents but not the others 983 at the same level of the fabric. It is mandatory when the level is 984 the ToF since a ToF node that cannot reach a prefix becomes a black 985 hole for that prefix. The hard problem is to know which prefixes are 986 reachable by whom. 988 In the general case, [RIFT] solves that problem by interconnecting 989 the ToF nodes. So the ToF nodes can exchange the full list of 990 prefixes that exist in the fabric and figure when a ToF node lacks 991 reachability and to existing prefix. This requires additional ports 992 at the ToF, typically 2 ports per ToF node to form a ToF-spanning 993 ring. [RIFT] also defines the southbound reflection procedure that 994 enables a parent to explore the direct connectivity of its peers, 995 meaning their own parents and children; based on the advertisements 996 received from the shared parents and children, it may enable the 997 parent to infer the prefixes its peers can reach. 999 When a parent lacks reachability to a prefix, it may disaggregate the 1000 prefix negatively, i.e., advertise that this parent can be used to 1001 reach any prefix in the aggregation except that one. The Negative 1002 Disaggregation signaling is simple and functions transitively from 1003 ToF to top-of-pod (ToP) and then from ToP to Leaf. But it is hard 1004 for a parent to figure which prefix it needs to disaggregate, because 1005 it does not know what it does not know; it results that the use of a 1006 spanning ring at the ToF is required to operate the Negative 1007 Disaggregation. Also, though it is only an implementation problem, 1008 the programmation of the FIB is complex compared to normal routes, 1009 and may incur recursions. 1011 The more classical alternative is, for the parents that can reach a 1012 prefix that peers at the same level cannot, to advertise a more 1013 specific route to that prefix. This leverages the normal longest 1014 prefix match in the FIB, and does not require a special 1015 implementation. But as opposed to the Negative Disaggregation, the 1016 Positive Disaggregation is difficult and inefficient to operate 1017 transitively. 1019 Transitivity is not needed to a grandchild if all its parents 1020 received the Positive Disaggregation, meaning that they shall all 1021 avoid the black hole; when that is the case, they collectively build 1022 a ceiling that protects the grandchild. But until then, a parent 1023 that received a Positive Disaggregation may believe that some peers 1024 are lacking the reachability and readvertise too early, or defer and 1025 maintain a black hole situation longer than necessary. 1027 In a non-partitioned fabric, all the ToF nodes see one another 1028 through the reflection and can figure if one is missing a child. In 1029 that case it is possible to compute the prefixes that the peer cannot 1030 reach and disaggregate positively without a ToF-spanning ring. The 1031 ToF nodes can also ascertain that the ToP nodes are connected each to 1032 at least a ToF node that can still reach the prefix, meaning that the 1033 transitive operation is not required. 1035 The bottom line is that in a fabric that is partitioned (e.g., using 1036 multiple planes) and/or where the ToP nodes are not guaranteed to 1037 always form a ceiling for their children, it is mandatory to use the 1038 Negative Disaggregation. On the other hand, in a highly symmetrical 1039 and fully connected fabric, (e.g., a canonical Clos Network), the 1040 Positive Disaggregation methods allows to save the complexity and 1041 cost associated to the ToF-spanning ring. 1043 Note that in the case of Positive Disaggregation, the first ToF 1044 node(s) that announces a more-specific route attracts all the traffic 1045 for that route and may suffer from a transient incast. A ToP node 1046 that defers injecting the longer prefix in the FIB, in order to 1047 receive more advertisements and spread the packets better, also keeps 1048 on sending a portion of the traffic to the black hole in the 1049 meantime. In the case of Negative Disaggregation, the last ToF 1050 node(s) that injects the route may also incur an incast issue; this 1051 problem would occur if a prefix that becomes totally unreachable is 1052 disaggregated, but doing so is mostly useless and is not recommended. 1054 5.7. Mobile Edge and Anycast 1056 When a physical or a virtual node changes its point of attachement in 1057 the fabric from a previous-leaf to a next-leaf, new routes must be 1058 installed that supersede the old ones. Since the flooding flows 1059 northwards, the nodes (if any) between the previous-leaf and the 1060 common parent are not immediately aware that the path via previous- 1061 leaf is obsolete, and a stale route may exist for a while. The 1062 common parent needs to select the freshest route advertisement in 1063 order to install the correct route via the next-leaf. This requires 1064 that the fabric determines the sequence of the movements of the 1065 mobile node. 1067 On the one hand, a classical sequence counter provides a total order 1068 for a while but it will eventually wrap. On the other hand, a 1069 timestamp provides a permanent order but it may miss a movement that 1070 happens too quickly vs. the granularity of the timing information. 1071 It is not envisioned in the short term that the average fabric 1072 supports a Precision Time Protocol [IEEEstd1588], and the precision 1073 that may be available with the Network Time Protocol [RFC5905], in 1074 the order of 100 to 200ms, may not be necessarily enough to cover, 1075 e.g., the fast mobility of a Virtual Machine. 1077 Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that 1078 combines a sequence counter from the mobile node and a timestamp from 1079 the network taken at the leaf when the route is injected. If the 1080 timestamps of the concurrent advertisements are comparable (i.e., 1081 more distant than the precision of the timing protocol), then the 1082 timestamp alone is used to determine the relative freshness of the 1083 routes. Otherwise, the sequence counter from the mobile node, if 1084 available, is used. One caveat is that the sequence counter must not 1085 wrap within the precision of the timing protocol. Another is that 1086 the mobile node may not even provide a sequence counter, in which 1087 case the mobility itself must be slower than the precision of the 1088 timing. 1090 Mobility must not be confused with anycast. In both cases, a same 1091 address is injected in RIFT at different leaves. In the case of 1092 mobility, only the freshest route must be conserved, since mobile 1093 node changed its point of attachment for a leaf to the next. In the 1094 case of anycast, the node may be either multihomed (attached to 1095 multiple leaves in parallel) or reachable beyond the fabric via 1096 multiple routes that are redistributed to different leaves; either 1097 way, in the case of anycast, the multiple routes are equally valid 1098 and should be conserved. Without further information from the 1099 redistributed routing protocol, it is impossible to sort out a 1100 movement from a redistribution that happens asynchronously on 1101 different leaves. [RIFT] expects that anycast addresses are 1102 advertised within the timing precision, which is typically the case 1103 with a low-precision timing and a multihomed node. Beyond that time 1104 interval, RIFT interprets the lag as a mobility and only the freshest 1105 route is retained. 1107 When using IPv6 [RFC8200], RIFT suggests to leverage "Registration 1108 Extensions for IPv6 over Low-Power Wireless Personal Area Network 1109 (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND 1110 interaction between the mobile node and the leaf. This provides not 1111 only a sequence counter but also a lifetime and a security token that 1112 may be used to protect the ownership of an address [RFC8928]. When 1113 using [RFC8505], the parallel registration of an anycast address to 1114 multiple leaves is done with the same sequence counter, whereas the 1115 sequence counter is incremented when the point of attachement 1116 changes. This way, it is possible to differentiate a mobile node 1117 from a multihomed node, even when the mobility happens within the 1118 timing precision. It is also possible for a mobile node to be 1119 multihomed as well, e.g., to change only one of its points of 1120 attachement. 1122 5.8. IPv4 over IPv6 1124 RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 1125 Address Family (AF) configures via the usual Neighbor Discovery (ND) 1126 mechanisms and then V4 can use V6 nexthops analogous to [RFC8950]. 1127 It is expected that the whole fabric supports the same type of 1128 forwarding of address families on all the links. RIFT provides an 1129 indication whether a node is v4 forwarding capable and 1130 implementations are possible where different routing tables are 1131 computed per address family as long as the computation remains loop- 1132 free. 1134 +-----+ +-----+ 1135 +---+---+ | ToF | | ToF | 1136 ^ +--+--+ +-----+ 1137 | | | | | 1138 | | +-------------+ | 1139 | | +--------+ | | 1140 + | | | | 1141 V6 +-----+ +-+---+ 1142 Forwarding |Spine| |Spine| 1143 + +--+--+ +-----+ 1144 | | | | | 1145 | | +-------------+ | 1146 | | +--------+ | | 1147 | | | | | 1148 v +-----+ +-+---+ 1149 +---+---+ |Leaf | | Leaf| 1150 +--+--+ +--+--+ 1151 | | 1152 IPv4 prefixes| |IPv4 prefixes 1153 | | 1154 +---+----+ +---+----+ 1155 | V4 | | V4 | 1156 | subnet | | subnet | 1157 +--------+ +--------+ 1159 Figure 9: IPv4 over IPv6 1161 5.9. In-Band Reachability of Nodes 1163 RIFT doesn't precondition that nodes of the fabric have reachable 1164 addresses. But the operational purposes to reach the internal nodes 1165 may exist. Figure 10 shows an example that the network management 1166 station (NMS) attaches to leaf1. 1168 +-------+ +-------+ 1169 | ToF1 | | ToF2 | 1170 ++---- ++ ++-----++ 1171 | | | | 1172 | +----------+ | 1173 | +--------+ | | 1174 | | | | 1175 ++-----++ +--+---++ 1176 |Spine1 | |Spine2 | 1177 ++-----++ ++-----++ 1178 | | | | 1179 | +----------+ | 1180 | +--------+ | | 1181 | | | | 1182 ++-----++ +--+---++ 1183 | Leaf1 | | Leaf2 | 1184 +---+---+ +-------+ 1185 | 1186 |NMS 1188 Figure 10: In-Band reachability of node 1190 If NMS wants to access Leaf2, it simply works. Because loopback 1191 address of Leaf2 is flooded in its Prefix North TIE. 1193 If NMS wants to access Spine2, it simply works too. Because spine 1194 node always advertises its loopback address in the Prefix North TIE. 1195 NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ 1196 ToF2-Spine2. 1198 If NMS wants to access ToF2, ToF2's loopback address needs to be 1199 injected into its Prefix South TIE. This TIE must be seen by all 1200 nodes at the level below - the spine nodes in Figure 10 - that must 1201 form a ceiling for all the traffic coming from below (south). 1202 Otherwise, the traffic from NMS may follow the default route to the 1203 wrong ToF Node, e.g., ToF1. 1205 In a fully connected ToF, in case of failure between ToF2 and spine 1206 nodes, ToF2's loopback address must be disaggregated recursively all 1207 the way to the leaves. 1209 In a partitioned ToF, a TOF node is only reachable within its Plane, 1210 and the disaggregation to the leaves is also required. A possible 1211 alternative is to use the ring that interconnects the ToF nodes to 1212 transmit packets between them for their loopback addresses only. The 1213 idea is that this is mostly control traffic and should not alter the 1214 load balancing properties of the fabric. 1216 5.10. Dual Homing Servers 1218 Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It 1219 has no configuration (unless it is a Top-of-Fabric at the top of the 1220 topology or the must operate in the topology as leaf and/or support 1221 leaf-2-leaf procedures) and it will fully configure itself after 1222 being attached to the topology. 1224 +---+ +---+ +---+ 1225 |ToF| |ToF| |ToF| ToF 1226 +---+ +---+ +---+ 1227 | | | | | | 1228 | +----------------+ | | 1229 | | | | | | 1230 | +----------------+ | 1231 | | | | | | 1232 +----------+--+ +--+----------+ 1233 | ToR1 | | ToR2 | Spine 1234 +--+------+---+ +--+-------+--+ 1235 +---+ | | | | | | +---+ 1236 | | | | | | | | 1237 | +-----------------+ | | | 1238 | | | +-------------+ | | 1239 + | + | | |-----------------+ | 1240 X | X | +--------x-----+ | X | 1241 + | + | | | + | 1242 +---+ +---+ +---+ +---+ 1243 | | | | | | | | 1244 +---+ +---+ ...............+---+ +---+ 1245 SV(1) SV(2) SV(n+1) SV(n) Leaf 1247 Figure 11: Dual-homing servers 1249 In the single plane, the worst condition is disaggregation of every 1250 other servers at the same level. Suppose the links from ToR1 (Top of 1251 Rack) to all the leaves become not available. All the servers' 1252 routes are disaggregated and the FIB of the servers will be expanded 1253 with n-1 more specific routes. 1255 Sometimes, people may prefer to disaggregate from ToR to servers from 1256 start on, i.e. the servers have couple tens of routes in FIB from 1257 start on beside default routes to avoid breakages at rack level. 1258 Full disaggregation of the fabric could be achieved by configuration 1259 supported by RIFT. 1261 5.11. Fabric With A Controller 1263 There are many different ways to deploy the controller. One 1264 possibility is attaching a controller to the RIFT domain from ToF and 1265 another possibility is attaching a controller from the leaf. 1267 +------------+ 1268 | Controller | 1269 ++----------++ 1270 | | 1271 | | 1272 +----++ ++----+ 1273 ------- | ToF | | ToF | 1274 | +--+--+ +-----+ 1275 | | | | | 1276 | | +-------------+ | 1277 | | +--------+ | | 1278 | | | | | 1279 +-----+ +-+---+ 1280 RIFT domain |Spine| |Spine| 1281 +--+--+ +-----+ 1282 | | | | | 1283 | | +-------------+ | 1284 | | +--------+ | | 1285 | | | | | 1286 | +-----+ +-+---+ 1287 ------- |Leaf | | Leaf| 1288 +-----+ +-----+ 1290 Figure 12: Fabric with a controller 1292 5.11.1. Controller Attached to ToFs 1294 If a controller is attaching to the RIFT domain from ToF, it usually 1295 uses dual-homing connections. The loopback prefix of the controller 1296 should be advertised down by the ToF and spine to leaves. If the 1297 controller loses link to ToF, make sure the ToF withdraw the prefix 1298 of the controller(use different mechanisms). 1300 5.11.2. Controller Attached to Leaf 1302 If the controller is attaching from a leaf to the fabric, no special 1303 provisions are needed. 1305 5.12. Internet Connectivity With Underlay 1307 If global addressing is running without overlay, an external default 1308 route needs to be advertised through RIFT fabric to achieve internet 1309 connectivity. For the purpose of forwarding of the entire RIFT 1310 fabric, an internal fabric prefix needs to be advertised in the South 1311 Prefix TIE by ToF and spine nodes. 1313 5.12.1. Internet Default on the Leaf 1315 In case that an internet access request comes from a leaf and the 1316 internet gateway is another leaf, the leaf node as the internet 1317 gateway needs to advertise a default route in its Prefix North TIE. 1319 5.12.2. Internet Default on the ToFs 1321 In case that an internet access request comes from a leaf and the 1322 internet gateway is a ToF, the ToF and spine nodes need to advertise 1323 a default route in the Prefix South TIE. 1325 5.13. Subnet Mismatch and Address Families 1327 +--------+ +--------+ 1328 | | LIE LIE | | 1329 | A | +----> <----+ | B | 1330 | +---------------------+ | 1331 +--------+ +--------+ 1332 X/24 Y/24 1334 Figure 13: subnet mismatch 1336 LIEs are exchanged over all links running RIFT to perform Link 1337 (Neighbor) Discovery. A node must NOT originate LIEs on an address 1338 family if it does not process received LIEs on that family. LIEs on 1339 same link are considered part of the same negotiation independent on 1340 the address family they arrive on. An implementation must be ready 1341 to accept TIEs on all addresses it used as source of LIE frames. 1343 As shown in the above figure, without further checks adjacency of 1344 node A and B may form, but the forwarding between node A and node B 1345 may fail because subnet X mismatches with subnet Y. 1347 To prevent this a RIFT implementation should check for subnet 1348 mismatch just like e.g. ISIS does. This can lead to scenarios where 1349 an adjacency, despite exchange of LIEs in both address families may 1350 end up having an adjacency in a single AF only. This is a 1351 consideration especially in Section 5.8 scenarios. 1353 5.14. Anycast Considerations 1355 + traffic 1356 | 1357 v 1358 +------+------+ 1359 | ToF | 1360 +---+-----+---+ 1361 | | | | 1362 +------------+ | | +------------+ 1363 | | | | 1364 +---+---+ +-------+ +-------+ +---+---+ 1365 | | | | | | | | 1366 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 1367 +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ 1368 | | | | | | | | 1369 | +---------+ | | +---------+ | 1370 | | | | | | | | 1371 | +-------+ | | | +-------+ | | 1372 | | | | | | | | 1373 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 1374 | | | | | | | | 1375 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 1376 +-+-----+ ++------+ +-----+-+ +-----+-+ 1377 + + + ^ | 1378 PrefixA PrefixB PrefixA | PrefixC 1379 | 1380 + traffic 1382 Figure 14: Anycast 1384 If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast 1385 prefix PrefixA. RIFT can deal with this case well. But if the 1386 traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. 1387 But Spine21 or Spine22 doesn't know another PrefixA attaching 1388 Leaf111. So it will always get to Leaf121 and never get to Leaf111. 1389 If the intension is that the traffic should been offloaded to 1390 Leaf111, then use policy guided prefixes defined in RIFT [RIFT]. 1392 5.15. IoT Applicability 1394 The design of RIFT inherits from RPL [RFC6550] the anisotropic design 1395 of a default route upwards (northwards); it also inherits the 1396 capability to inject external host routes at the Leaf level using 1397 Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host 1398 and a RIFT router. Both the RPL and the RIFT protocols are meant for 1399 large scale, and WiND enables device mobility at the edge the same 1400 way in both cases. 1402 The main difference between RIFT and RPL is that with RPL, there's a 1403 single Root, whereas RIFT has many ToF nodes. The adds huge 1404 capabilities for leaf-2-leaf ECMP paths, but additional complexity 1405 with the need to disaggregate. Also RIFT uses Link State flooding 1406 northwards, and is not designed for low-power operation. 1408 Still nothing prevents that the IP devices connected at the Leaf are 1409 IoT (Internet of Things) devices, which typically expose their 1410 address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. 1412 A network that serves high speed/ high power IoT devices should 1413 typically provide deterministic capabilities for applications such as 1414 high speed control loops or movement detection. The Fat Tree is 1415 highly reliable, and in normal condition provides an equilatent 1416 multipath operation; but the ECMP doesn't provide hard guarantees for 1417 either delivery or latency. As long as the fabric is non-blocking 1418 the result is the same; but there can be load unbalances resulting in 1419 incast and possibly congestion loss that will prevent the delivery 1420 within bounded latency. 1422 This could be alleviated with Packet Replication, Elimination and 1423 Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide 1424 at the scale of all flows, and the replication may increase the 1425 probability of the overload that it attempts to solve. 1427 Note that the load balancing is not RIFT's problem, but it is key to 1428 serve IoT adequately. 1430 5.16. Key Management 1432 As outlined in Section "Security Considerations" of [RIFT], either a 1433 private shared key or a public/private key pair is used to 1434 authenticate the adjacency. Both the key distribution and key 1435 synchronization methods are out of scope for this document. Both 1436 nodes in the adjacency must share the same keys, key type, and 1437 algorithm for a given key ID. Mismatched keys will not inter-operate 1438 as their security envelopes will be unverifiable. 1440 Key roll-over while the adjacency is active may be supported. The 1441 specific mechanism is well documented in [RFC6518]. 1443 6. Security Considerations 1445 This document presents applicability of RIFT. As such, it does not 1446 introduce any security considerations. However, there are a number 1447 of security concerns at [RIFT]. 1449 7. IANA Considerations 1451 This document has no IANA actions. 1453 8. Contributors 1455 The following people (listed in alphabetical order) contributed 1456 significantly to the content of this document and should be 1457 considered co-authors: 1459 Tom Verhaeg 1461 Juniper Networks 1463 Email: tverhaeg@juniper.net 1465 Tony Przygienda 1467 Juniper Networks 1469 1194 N. Mathilda Ave 1471 Sunnyvale, CA 94089 1473 US 1475 Email: prz@juniper.net 1477 9. Normative References 1479 [ISO10589-Second-Edition] 1480 International Organization for Standardization, 1481 "Intermediate system to Intermediate system intra-domain 1482 routeing information exchange protocol for use in 1483 conjunction with the protocol for providing the 1484 connectionless-mode Network Service (ISO 8473)", November 1485 2002. 1487 [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central 1488 Office Reference Architectural Framework", January 2018. 1490 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 1491 DOI 10.17487/RFC2328, April 1998, 1492 . 1494 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1495 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1496 DOI 10.17487/RFC4861, September 2007, 1497 . 1499 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1500 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1501 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1502 . 1504 [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., 1505 Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional 1506 Forwarding Detection (BFD) on Link Aggregation Group (LAG) 1507 Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 1508 2014, . 1510 [RFC8950] Litkowski, S., Agrawal, S., Ananthamurthy, K., and K. 1511 Patel, "Advertising IPv4 Network Layer Reachability 1512 Information (NLRI) with an IPv6 Next Hop", RFC 8950, 1513 DOI 10.17487/RFC8950, November 2020, 1514 . 1516 [RFC6518] Lebovitz, G. and M. Bhatia, "Keying and Authentication for 1517 Routing Protocols (KARP) Design Guidelines", RFC 6518, 1518 DOI 10.17487/RFC6518, February 2012, 1519 . 1521 [RFC6550] Winter, T., Ed., Thubert, P., Ed., Brandt, A., Hui, J., 1522 Kelsey, R., Levis, P., Pister, K., Struik, R., Vasseur, 1523 JP., and R. Alexander, "RPL: IPv6 Routing Protocol for 1524 Low-Power and Lossy Networks", RFC 6550, 1525 DOI 10.17487/RFC6550, March 2012, 1526 . 1528 [RFC6775] Shelby, Z., Ed., Chakrabarti, S., Nordmark, E., and C. 1529 Bormann, "Neighbor Discovery Optimization for IPv6 over 1530 Low-Power Wireless Personal Area Networks (6LoWPANs)", 1531 RFC 6775, DOI 10.17487/RFC6775, November 2012, 1532 . 1534 [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, 1535 "Deterministic Networking Architecture", RFC 8655, 1536 DOI 10.17487/RFC8655, October 2019, 1537 . 1539 [RIFT] Sharma, A., Thubert, P., Rijsman, B., and D. Afanasiev, 1540 "RIFT: Routing in Fat Trees", Work in Progress, Internet- 1541 Draft, draft-ietf-rift-rift-13, 12 July 2021, 1542 . 1545 10. Informative References 1547 [IEEEstd1588] 1548 IEEE standard for Information Technology, "IEEE Standard 1549 for a Precision Clock Synchronization Protocol for 1550 Networked Measurement and Control Systems", 1551 . 1553 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 1554 Communication Environments", IEEE International Parallel & 1555 Distributed Processing Symposium, 2011. 1557 [FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for 1558 Hardware-Efficient Supercomputing", 1985. 1560 [RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link 1561 State Routing Protocol (OLSR)", RFC 3626, 1562 DOI 10.17487/RFC3626, October 2003, 1563 . 1565 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 1566 "Network Time Protocol Version 4: Protocol and Algorithms 1567 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 1568 . 1570 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1571 (IPv6) Specification", STD 86, RFC 8200, 1572 DOI 10.17487/RFC8200, July 2017, 1573 . 1575 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 1576 Perkins, "Registration Extensions for IPv6 over Low-Power 1577 Wireless Personal Area Network (6LoWPAN) Neighbor 1578 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 1579 . 1581 [RFC8928] Thubert, P., Ed., Sarikaya, B., Sethi, M., and R. Struik, 1582 "Address-Protected Neighbor Discovery for Low-Power and 1583 Lossy Networks", RFC 8928, DOI 10.17487/RFC8928, November 1584 2020, . 1586 Authors' Addresses 1588 Yuehua Wei (editor) 1589 ZTE Corporation 1590 No.50, Software Avenue 1591 Nanjing 1592 210012 1593 China 1595 Email: wei.yuehua@zte.com.cn 1597 Zheng Zhang 1598 ZTE Corporation 1599 No.50, Software Avenue 1600 Nanjing 1601 210012 1602 China 1604 Email: zhang.zheng@zte.com.cn 1606 Dmitry Afanasiev 1607 Yandex 1609 Email: fl0w@yandex-team.ru 1611 Pascal Thubert 1612 Cisco Systems, Inc 1613 Building D 1614 45 Allee des Ormes - BP1200 1615 06254 MOUGINS - Sophia Antipolis 1616 France 1618 Phone: +33 497 23 26 34 1619 Email: pthubert@cisco.com 1621 Jaroslaw Kowalczyk 1622 Orange Polska 1624 Email: jaroslaw.kowalczyk2@orange.com