idnits 2.17.00 (12 Aug 2021) /tmp/idnits58839/draft-ietf-rift-applicability-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1158: '...scovery. A node MUST NOT originate LI...' RFC 2119 keyword, line 1161: '...e on. An implementation MUST be ready...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (26 April 2021) is 390 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 5549 (Obsoleted by RFC 8950) == Outdated reference: A later version (-15) exists of draft-ietf-rift-rift-12 Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT WG Yuehua. Wei, Ed. 3 Internet-Draft Zheng. Zhang 4 Intended status: Informational ZTE Corporation 5 Expires: 28 October 2021 Dmitry. Afanasiev 6 Yandex 7 P. Thubert 8 Cisco Systems 9 Tom. Verhaeg 10 Juniper Networks 11 Jaroslaw. Kowalczyk 12 Orange Polska 13 26 April 2021 15 RIFT Applicability 16 draft-ietf-rift-applicability-05 18 Abstract 20 This document discusses the properties, applicability and operational 21 considerations of RIFT in different network scenarios. It intends to 22 provide a rough guide how RIFT can be deployed to simplify routing 23 operations in Clos topologies and their variations. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on 28 October 2021. 42 Copyright Notice 44 Copyright (c) 2021 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 49 license-info) in effect on the date of publication of this document. 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. Code Components 52 extracted from this document must include Simplified BSD License text 53 as described in Section 4.e of the Trust Legal Provisions and are 54 provided without warranty as described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Problem Statement of Routing in Modern IP Fabric Fat Tree 60 Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 62 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 4 63 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 6 64 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 65 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 7 66 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 7 67 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 68 3.3.1. Data Center Fabrics . . . . . . . . . . . . . . . . . 8 69 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 9 70 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 9 71 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 9 72 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 9 73 4. Operational Considerations . . . . . . . . . . . . . . . . . 11 74 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 12 75 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 12 76 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 14 77 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 15 78 4.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 16 79 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 18 80 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 20 81 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 21 82 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 22 83 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 24 84 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 25 85 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 25 86 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 25 87 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 26 88 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 26 89 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 26 90 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 26 91 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 27 92 4.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 28 93 5. Security Considerations . . . . . . . . . . . . . . . . . . . 28 94 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 29 95 7. Normative References . . . . . . . . . . . . . . . . . . . . 29 96 8. Informative References . . . . . . . . . . . . . . . . . . . 30 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 99 1. Introduction 101 This document discusses the properties and applicability of "Routing 102 in Fat Trees" [RIFT] (RIFT) in different deployment scenarios and 103 highlights the operational simplicity of the technology compared to 104 traditional routing solutions. It also documents special 105 considerations when RIFT is used with or without overlays and/or 106 controllers, and how RIFT corrects topology mis-cablings and/or node 107 and link failures. 109 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 111 Clos [CLOS] and fat tree [FATTREE] topologies have gained prominence 112 in today's networking, primarily as a result of the paradigm shift 113 towards a centralized data-center based architecture that deliver a 114 majority of computation and storage services. 116 Today's current routing protocols were geared towards a network with 117 an irregular topology with isotropic properties, and low degree of 118 connectivity. When applied to Fat Tree topologies: 120 * They tend to need extensive configuration or provisioning during 121 bring up and re-dimensioning. 123 * All nodes including spine and leaf nodes learn the entire network 124 topology and routing information, which is in fact, not needed on 125 the leaf nodes during normal operation. 127 * Significant link-state PDUs (LSPs) flooding duplication between 128 spine nodes and leaf nodes occurs during network bring up and 129 topology updates. 131 * This consumes both CPU and link bandwidth resources which prevents 132 the use of cheaper hardware at the lower levels (leaf and spine) 133 and reduces the scalability and reactivity.of the network. 135 3. Applicability of RIFT to Clos IP Fabrics 137 Further content of this document assumes that the reader is familiar 138 with the terms and concepts used in OSPF [RFC2328] and IS-IS 139 [ISO10589-Second-Edition] link-state protocols. The sections of RIFT 140 [RIFT] outline the requirements of routing in IP fabrics and RIFT 141 protocol concepts. 143 3.1. Overview of RIFT 145 RIFT is a dynamic routing protocol that is specifically tailored for 146 use in Clos and Fat Tree network topologies. A core property of RIFT 147 is that its operation is sensitive to the structure of the fabric - 148 it is anisotropic. RIFT acts as a link-state protocol when "pointing 149 north" - advertising southwards routes to northwards peer routers 150 (parents) through flooding and database synchronization- but operates 151 hop-by-hop like a distance-vector protocol when "pointing south" - 152 typically advertising a fabric default route directed towards the Top 153 of Fabric (ToF, aka superspine) to southwards peer routers (children) 154 -. 156 RIFT floods flat link-state information northbound only so that each 157 level obtains the full topology of levels south of it. That 158 information is never flooded east-west or back south again. So a top 159 tier node has full set of prefixes from the Shortest Path First (SPF) 160 calculation. 162 In the southbound direction, the protocol operates like a "fully 163 summarizing, unidirectional" path-vector protocol or rather a 164 distance-vector with implicit split horizon. Routing information, 165 normally just the default route, propagates one hop south and is 're- 166 advertised' by nodes at next lower level. 168 +-----------+ +-----------+ 169 | ToF | | ToF | LEVEL 2 170 + +-----+--+--+ +-+--+------+ 171 | | | | | | | | | ^ 172 + | | | +-------------------------+ | 173 Distance | +-------------------+ | | | | | 174 Vector | | | | | | | | + 175 South | | | | +--------+ | | | Link-State 176 + | | | | | | | | Flooding 177 | | | +-------------+ | | | North 178 v | | | | | | | | + 179 +-+--+-+ +------+ +-------+ +--+--+-+ | 180 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 181 + ++----++ ++---+-+ +--+--+-+ ++----+-+ | 182 + | | | | | | | | | ^ N 183 Distance | +-------+ | | +--------+ | | | E 184 Vector | | | | | | | | | +------> 185 South | +-------+ | | | +-------+ | | | | 186 + | | | | | | | | | + 187 v ++--++ +-+-++ ++-+-+ +-+--++ + 188 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 189 +----+ +----+ +----+ +-----+ 190 Figure 1: Rift overview 192 A spine node has only information necessary for its level, which is 193 all destinations south of the node based on SPF calculation, default 194 route, and potential disaggregated routes. 196 RIFT combines the advantage of both link-state and distance-vector: 198 * Fastest possible convergence 200 * Automatic detection of topology 202 * Minimal routes/info on Top-of-Rack (ToR) switches, aka leaf nodes 204 * High degree of ECMP 206 * Fast de-commissioning of nodes 208 * Maximum propagation speed with flexible prefixes in an update 210 So there are two types of link-state database which are "north 211 representation" North Topology Information Elements (N-TIEs) and 212 "south representation" South Topology Information Elements (S-TIEs). 213 The N-TIEs contain a link-state topology description of lower levels 214 and S-TIEs carry simply default routes for the lower levels. 216 RIFT also eliminates major disadvantages of link-state and distance- 217 vector with: 219 * Reduced and balanced flooding 221 * Automatic neighbor detection 223 To achieve this, RIFT builds on the art of IGPs, not only OSPF and 224 IS-IS but also MANET and IoT, to provide unique features: 226 * Automatic (positive or negative) route disaggregation of 227 northwards routes upon fallen leaves 229 * Recursive operation in the case of negative route disaggregation 231 * Anisotropic routing that extends a principle seen in RPL [RFC6550] 232 to wide superspines 234 * Optimal Flooding Reduction that derives from the concept of a 235 "multipoint relay" (MPR) found in OLSR [RFC3626] and balances the 236 flooding load over northbound links and nodes. 238 Additional advantages that are unique to RIFT are listed below, the 239 details of which can be found in RIFT [RIFT]. 241 * True ZTP 243 * Minimal blast radius on failures 245 * Can utilize all Paths through fabric without looping 247 * Simple leaf implementation that can scale down to servers 249 * Key-Value store 251 * Horizontal links used for protection only 253 * Supports non-equal cost multipath (NECMP) and can replace multi- 254 chassis link aggregation group (MLAG or MC-LAG) 256 3.2. Applicable Topologies 258 Albeit RIFT is specified primarily for "proper" Clos or Fat Tree 259 topologies, the protocol natively supports Points of Delivery (PoD) 260 concepts, which, strictly speaking, are not found in the original 261 Clos concept. 263 Further, the specification explains and supports operations of multi- 264 plane Clos variants where the protocol recommends the use of inter- 265 plane rings at the Top-of-Fabric level to allow the reconciliation of 266 topology view of different planes to make the negative disaggregation 267 viable in case of failures within a plane. These observations hold 268 not only in case of RIFT but also in the generic case of dynamic 269 routing on Clos variants with multiple planes and failures in bi- 270 sectional bandwidth, especially on the leafs. 272 3.2.1. Horizontal Links 274 RIFT is not limited to pure Clos divided into PoD and multi-planes 275 but supports horizontal (East-West) links below the top of fabric 276 level. Those links are used only for last resort northbound routes 277 when a spine loses all its northbound links or cannot compute a 278 default route through them. 280 A possible configuration is a "ring" of horizontal links at a level. 281 In presence of such a "ring" in any level (except Top of Fabric (ToF) 282 level) neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a 283 "ring-based protection" scheme since such a computation would have to 284 deal necessarily with breaking of "loops" in Dijkstra sense; an 285 application for which RIFT is not intended. 287 A full-mesh connectivity between nodes on the same level can be 288 employed and that allows N-SPF to provide for any node loosing all 289 its northbound adjacencies (as long as any of the other nodes in the 290 level are northbound connected) to still participate in northbound 291 forwarding. 293 3.2.2. Vertical Shortcuts 295 Through relaxations of the specified adjacency forming rules, RIFT 296 implementations can be extended to support vertical "shortcuts" as 297 proposed by e.g. [I-D.white-distoptflood]. The RIFT specification 298 itself does not provide the exact details since the resulting 299 solution suffers from either much larger blast radius with increased 300 flooding volumes or in case of maximum aggregation routing, bow-tie 301 problems. 303 3.2.3. Generalizing to any Directed Acyclic Graph 305 RIFT is an anisotropic routing protocol, meaning that it has a sense 306 of direction (northbound, southbound, east-west) and that it operates 307 differently depending on the direction. 309 * Northbound, RIFT operates as a link-state protocol, whereby the 310 control packets are reflooded first all the way north and only 311 interpreted later. All the individual fine grained routes are 312 advertised. 314 * Southbound, RIFT operates as a distance-vector protocol, whereby 315 the control packets are flooded only one-hop, interpreted, and the 316 consequence of that computation is what gets flooded one more hop 317 south. In the most common use-cases, a ToF node can reach most of 318 the prefixes in the fabric. If that is the case, the ToF node 319 advertises the fabric default and disaggregates the prefixes that 320 it cannot reach. On the other hand, a ToF node that can reach 321 only a small subset of the prefixes in the fabric will preferably 322 advertise those prefixes and refrain from aggregating. 324 In the general case, what gets advertised south is in more 325 details: 327 1. A fabric default that aggregates all the prefixes that are 328 reachable within the fabric, and that could be a default route 329 or a prefix that is dedicated to this particular fabric. 331 2. The loopback addresses of the northbound nodes, e.g., for 332 inband management. 334 3. The disaggregated prefixes for the dynamic exceptions to the 335 fabric default, advertised to route around the black hole that 336 may form. 338 * East-West routing can optionally be used, with specific 339 restrictions. It is used when a sibling has access to the fabric 340 default but this node does not. 342 A Directed Acyclic Graph (DAG) provides a sense of north (the 343 direction of the DAG) and of south (the reverse), which can be used 344 to apply RIFT. For the purpose of RIFT, an edge in the DAG that has 345 only incoming vertices is a ToF node. 347 There are a number of caveats though: 349 * The DAG structure must exist before RIFT starts, so there is a 350 need for a companion protocol to establish the logical DAG 351 structure. 353 * A generic DAG does not have a sense of east and west. The 354 operation specified for east-west links and the southbound 355 reflection between nodes are not applicable. 357 * In order to aggregate and disaggregate routes, RIFT requires that 358 all the ToF nodes share the full knowledge of the prefixes in the 359 fabric. This can be achieved with a ring as suggested by the RIFT 360 main specification, by some preconfiguration, or using a 361 synchronization with a common repository where all the active 362 prefixes are registered. 364 3.3. Use Cases 366 3.3.1. Data Center Fabrics 368 RIFT is suited for applying in the data center (DC) IP fabrics 369 underlay routing, vast majority of which seem to be currently (and 370 for the foreseeable future) Clos architectures. It significantly 371 simplifies operation and deployment of such fabrics as described in 372 Section 4 for environments compared to extensive proprietary 373 provisioning and operational solutions. 375 3.3.2. Metro Fabrics 377 The demand for bandwidth is increasing steadily, driven primarily by 378 environments close to content producers (server farms connection via 379 DC fabrics) but in proximity to content consumers as well. Consumers 380 are often clustered in metro areas with their own network 381 architectures that can benefit from simplified, regular Clos 382 structures and hence from RIFT. 384 3.3.3. Building Cabling 386 Commercial edifices are often cabled in topologies that are either 387 Clos or its isomorphic equivalents. The Clos can grow rather high 388 with many floors. That presents a challenge for traditional routing 389 protocols (except BGP and by now largely phased-out PNNI) which do 390 not support an arbitrary number of levels which RIFT does naturally. 391 Moreover, due to the limited sizes of forwarding tables in network 392 elements of building cabling, the minimum FIB size RIFT maintains 393 under normal conditions is cost-effective in terms of hardware and 394 operational costs. 396 3.3.4. Internal Router Switching Fabrics 398 It is common in high-speed communications switching and routing 399 devices to use fabrics when a crossbar is not feasible due to cost, 400 head-of-line blocking or size trade-offs. Normally such fabrics are 401 not self-healing or rely on 1:/+1 protection schemes but it is 402 conceivable to use RIFT to operate Clos fabrics that can deal 403 effectively with interconnections or subsystem failures in such 404 module. RIFT is neither IP specific and hence any link addressing 405 connecting internal device subnets is conceivable. 407 3.3.5. CloudCO 409 The Cloud Central Office (CloudCO) is a new stage of telecom Central 410 Office. It takes the advantage of Software Defined Networking (SDN) 411 and Network Function Virtualization (NFV) in conjunction with general 412 purpose hardware to optimize current networks. The following figure 413 illustrates this architecture at a high level. It describes a single 414 instance or macro-node of cloud CO that provides a number of Value 415 Added Services (VAS), a Broadband Access Abstraction (BAA), and 416 virtualized nerwork services. An Access I/O module faces a Cloud CO 417 access node, and the Customer Premises Equipments (CPEs) behind it. 418 A Network I/O module is facing the core network. The two I/O modules 419 are interconnected by a leaf and spine fabric [TR-384]. 421 +---------------------+ +----------------------+ 422 | Spine | | Spine | 423 | Switch | | Switch | 424 +------+---+------+-+-+ +--+-+-+-+-----+-------+ 425 | | | | | | | | | | | | 426 | | | | | +-------------------------------+ | 427 | | | | | | | | | | | | 428 | | | | +-------------------------+ | | | 429 | | | | | | | | | | | | 430 | | +----------------------+ | | | | | | | | 431 | | | | | | | | | | | | 432 | +---------------------------------+ | | | | | | | 433 | | | | | | | | | | | | 434 | | | +-----------------------------+ | | | | | 435 | | | | | | | | | | | | 436 | | | | | +--------------------+ | | | | 437 | | | | | | | | | | | | 438 +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ 439 |L | | Leaf | | Leaf | | Leaf | | Leaf | |L | 440 |S | | Switch | | Switch | | Switch | | Switch| |S | 441 ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ 442 | | | | | | | | | | | | | | 443 | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | 444 | |Compute | |Compute | | Compute | |Compute| | 445 | |Node | |Node | | Node | |Node | | 446 | +--------+ +--------+ +----------+ +-------+ | 447 | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | 448 | |--------| |--------| |----------| |-------| | 449 | |--------| |--------| |----------| |-------| | 450 | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | 451 | |--------| |--------| |----------| |-------| | 452 | |--------| |--------| |----------| |-------| | 453 | || VAS7 || || VAS4 || || vIGMP || ||BAA || | 454 | |--------| |--------| |----------| |-------| | 455 | +--------+ +--------+ +----------+ +-------+ | 456 | | 457 ++-----------+ +---------++ 458 |Network I/O | |Access I/O| 459 +------------+ +----------+ 461 Figure 2: An example of CloudCO architecture 463 The Spine-Leaf architecture deployed inside CloudCO meets the network 464 requirements of adaptable, agile, scalable and dynamic. 466 4. Operational Considerations 468 RIFT presents the opportunity for organizations building and 469 operating IP fabrics to simplify their operation and deployments 470 while achieving many desirable properties of a dynamic routing on 471 such a substrate: 473 * RIFT only floods routing information to the devices that 474 absolutely need it. RIFT design follows minimum blast radius and 475 minimum necessary epistemological scope philosophy which leads to 476 good scaling properties while delivering maximum reactiveness. 478 * RIFT allows for extensive Zero Touch Provisioning within the 479 protocol. In its most extreme version RIFT does not rely on any 480 specific addressing and for IP fabric can operate using IPv6 ND 481 [RFC4861] only. 483 * RIFT has provisions to detect common IP fabric mis-cabling 484 scenarios. 486 * RIFT negotiates automatically BFD per link allowing this way for 487 IP and micro-BFD [RFC7130] to replace Link Aggregation Groups 488 (LAGs) which do hide bandwidth imbalances in case of constituent 489 failures. Further automatic link validation techniques similar to 490 [RFC5357] could be supported as well. 492 * RIFT inherently solves many difficult problems associated with the 493 use of traditional routing topologies with dense meshes and high 494 degrees of ECMP by including automatic bandwidth balancing, flood 495 reduction and automatic disaggregation on failures while providing 496 maximum aggregation of prefixes in default scenarios. 498 * RIFT reduces FIB size towards the bottom of the IP fabric where 499 most nodes reside and allows with that for cheaper hardware on the 500 edges and introduction of modern IP fabric architectures that 501 encompass e.g. server multi-homing. 503 * RIFT provides valley-free routing and with that is loop free. 504 This allows the use of any such valley-free path in bi-sectional 505 fabric bandwidth between two destination irrespective of their 506 metrics which can be used to balance load on the fabric in 507 different ways. 509 * RIFT includes a key-value distribution mechanism which allows for 510 many future applications such as automatic provisioning of basic 511 overlay services or automatic key roll-overs over whole fabrics. 513 * RIFT is designed for minimum delay in case of prefix mobility on 514 the fabric. In conjunction with [RFC8505], RIFT can differentiate 515 anycast advertisements from mobility events and retain only the 516 most recent advertisement in the latter case. 518 * Many further operational and design points collected over many 519 years of routing protocol deployments have been incorporated in 520 RIFT such as fast flooding rates, protection of information 521 lifetimes and operationally easily recognizable remote ends of 522 links and node names. 524 4.1. South Reflection 526 South reflection is a mechanism that South Node TIEs are "reflected" 527 back up north to allow nodes in same level without East-west links to 528 "see" each other. 530 For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs 531 from ToF21 to ToF22 separately. Respectively, 532 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 533 to ToF21 separately. So ToF22 and ToF21 see each other's node 534 information as level 2 nodes. 536 In an equivalent fashion, as the result of the south reflection 537 between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, 538 Spine121 and Spine 122 knows each other at level 1. 540 4.2. Suboptimal Routing on Link Failures 541 +--------+ +--------+ 542 | ToF21 | | ToF22 | LEVEL 2 543 ++--+-+-++ ++-+--+-++ 544 | | | | | | | + 545 | | | | | | | linkTS8 546 +-------------+ | +-+linkTS3+-+ | | | +-------------+ 547 | | | | | | + | 548 | +----------------------------+ | linkTS7 | 549 | | | | + + + | 550 | | | +-------+linkTS4+------------+ | 551 | | | + + | | | 552 | | | +------------+--+ | | 553 | | | | | linkTS6 | | 554 +-+----+-+ +-----+--+ ++--------+ +-+----+-+ 555 |Spine111| |Spine112| |Spine121 | |Spine122| LEVEL 1 556 +-+---+--+ +----+---+ +-+---+---+ +-+---+--+ 557 | | | | | | | | 558 | +--------------+ | + ++XX+linkSL6+---+ + 559 | | | | linkSL5 | | linkSL8 560 | +------------+ | | + +---+linkSL7+-+ | + 561 | | | | | | | | 562 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 563 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 564 +-+-----+ ++------+ +-----+-+ +-+-----+ 565 + + + + 566 Prefix111 Prefix112 Prefix121 Prefix122 568 Figure 3: Suboptimal routing upon link failure use case 570 As shown in Figure 3, as the result of the south reflection between 571 Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and 572 Spine 122 knows each other at level 1. 574 Without disaggregation mechanism, when linkSL6 fails, the packet from 575 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 576 then go down through linkTS4 to linkSL8 to Leaf122 or go up through 577 linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to 578 Leaf122 based on pure default route. It's the case of suboptimal 579 routing or bow-tieing. 581 With disaggregation mechanism, when linkSL6 fails, Spine122 will 582 detect the failure according to the reflected node S-TIE from 583 Spine121. Based on the disaggregation algorithm provided by RIFT, 584 Spine122 will explicitly advertise prefix122 in Disaggregated Prefix 585 S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to 586 prefix122 will only be sent to linkSL7 following a longest-prefix 587 match to prefix 122 directly then go down through linkSL8 to Leaf122 588 . 590 4.3. Black-Holing on Link Failures 592 +--------+ +--------+ 593 | ToF 21 | | ToF 22 | LEVEL 2 594 ++-+--+-++ ++-+--+-++ 595 | | | | | | | + 596 | | | | | | | linkTS8 597 +--------------+ | +-+linkTS3+X+ | | | +--------------+ 598 linkTS1 | | | | | + | 599 + +-----------------------------+ | linkTS7 | 600 | | + | + + + | 601 | | linkTS2 +-------+linkTS4+X+----------+ | 602 | + + + + | | | 603 | linkTS5 +-+ +------------+--+ | | 604 | + | | | linkTS6 | | 605 +-+----+-+ +-+----+-+ ++-------+ +-+-----++ 606 |Spine111| |Spine112| |Spine121| |Spine122| LEVEL 1 607 +-+---+--+ ++----+--+ +-+---+--+ +-+---+--+ 608 | | | | | | | | 609 + +---------------+ | + +---+linkSL6+---+ + 610 linkSL1 | | | linkSL5 | | linkSL8 611 + +--+linkSL3+--+ | | + +---+linkSL7+-+ | + 612 | | | | | | | | 613 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 614 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 615 +-+-----+ ++------+ +-----+-+ +-+-----+ 616 + + + + 617 Prefix111 Prefix112 Prefix121 Prefix122 619 Figure 4: Black-holing upon link failure use case 621 This scenario illustrates a case when double link failure occurs and 622 with that black-holing can happen. 624 Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, 625 the packet from leaf111 to prefix122 would suffer 50% black-holing 626 based on pure default route. The packet supposed to go up through 627 linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be 628 dropped. The packet supposed to go up through linkSL3 to linkTS2 629 then go down through linkTS3 or linkTS4 will be dropped as well. 630 It's the case of black-holing. 632 With disaggregation mechanism, when linkTS3 and linkTS4 both fail, 633 ToF22 will detect the failure according to the reflected node S-TIE 634 of ToF21 from Spine111\Spine112. Based on the disaggregation 635 algorithm provided by RITF, ToF22 will explicitly originate an S-TIE 636 with prefix 121 and prefix 122, that is flooded to spines 111, 112, 637 121 and 122. 639 The packet from leaf111 to prefix122 will not be routed to linkTS1 or 640 linkTS2. The packet from leaf111 to prefix122 will only be routed to 641 linkTS5 or linkTS7 following a longest-prefix match to prefix122. 643 4.4. Zero Touch Provisioning (ZTP) 645 RIFT is designed to require a very minimal configuration to simplify 646 its operation and avoid human errors; based on that minimal 647 information, Zero Touch Provisioning (ZTP) autoconfigures the key 648 operational parameters of all the RIFT nodes, that is, on the one 649 hand, the SystemID of the node that must be unique in the RIFT 650 network, and on the other hand the level of the node in the Fat Tree, 651 which determines which peers are northwards "parents" and which are 652 southwards "children". 654 ZTP is always on, but its decisions can be overridden when a network 655 administrator prefers to impose its own configuration. In that case, 656 it is the responsibility of the administrator to ensure that the 657 configured parameters are correct, in other words that the SystemID 658 of each node is unique, and that the administratively set levels 659 truly reflect the relative position of the nodes in the fabric. It 660 is recommended to let ZTP configure the network, and when not, it is 661 recommended to configure the level of all the nodes but those that 662 are forced as leaves to avoid an undesirable interaction between ZTP 663 and the manual configuration. 665 ZTP requires that the administrator points out the Top-of-Fabric 666 (ToF) nodes to set the baseline from which the fabric topology is 667 derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC 668 flag which are initial 'seeds' needed for other ZTP nodes to derive 669 their level in the topology. The derivation of the level of each 670 node happens then based on Link Information Elements (LIEs) received 671 from its neighbors whereas each node (with possibly exceptions of 672 configured leaves) tries to attach at the highest possible point in 673 the fabric. This guarantees that even if the diffusion front reaches 674 a node from "below" faster than from "above", it will greedily 675 abandon already negotiated level derived from nodes topologically 676 below it and properly peer with nodes above. 678 A RIFT node may also be configured to confine it to the leaf role 679 with the LEAF_ONLY flag. A leaf node can also be configured to 680 support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either 681 case the node cannot be TOP_OF_FABRIC and its level cannot be 682 configured. RIFT will fully configure the node's level after it is 683 attached to the topology and ensure that the node is at the "bottom 684 of the hierarchy" (southernmost). 686 4.5. Mis-cabling Examples 688 +----------------+ +-----------------+ 689 | ToF21 | +------+ ToF22 | LEVEL 2 690 +-------+----+---+ | +----+---+--------+ 691 | | | | | | | | | 692 | | | +----------------------------+ | 693 | +---------------------------+ | | | | 694 | | | | | | | | | 695 | | | | +-----------------------+ | | 696 | | +------------------------+ | | | 697 | | | | | | | | | 698 +-+---+--+ +-+---+--+ | +--+---+-+ +--+---+-+ 699 |Spine111| |Spine112| | |Spine121| |Spine122| LEVEL 1 700 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 701 | | | | | | | | | 702 | +---------+ | link-M | +---------+ | 703 | | | | | | | | | 704 | +-------+ | | | | +-------+ | | 705 | | | | | | | | | 706 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 707 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 708 +-------+ +-------+ +-------+ +-------+ 710 Figure 5: A single plane mis-cabling example 712 Figure 5 shows a single plane mis-cabling example. It's a perfect 713 Fat Tree fabric except link-M connecting Leaf112 to ToF22. 715 The RIFT control protocol can discover the physical links 716 automatically and be able to detect cabling that violates Fat Tree 717 topology constraints. It reacts accordingly to such mis-cabling 718 attempts, at a minimum preventing adjacencies between nodes from 719 being formed and traffic from being forwarded on those mis-cabled 720 links. Leaf112 will in such scenario use link-M to derive its level 721 (unless it is leaf) and can report links to Spine111 and Spine112 as 722 mis-cabled unless the implementations allows horizontal links. 724 Figure 6 shows a multiple plane mis-cabling example. Since Leaf112 725 and Spine121 belong to two different PoDs, the adjacency between 726 Leaf112 and Spine121 can not be formed. link-W would be detected and 727 prevented. 729 +-------+ +-------+ +-------+ +-------+ 730 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 731 +-------+ +-------+ +-------+ +-------+ 732 | | | | | | | | 733 | | | +-----------------+ | | | 734 | +--------------------------+ | | | | 735 | | | | | | | | 736 | +------+ | | | +------+ | 737 | | +-----------------+ | | | | | 738 | | | +--------------------------+ | | 739 | A | | B | | A | | B | 740 +-----+--+ +-+---+--+ +--+---+-+ +--+-----+ 741 |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 742 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 743 | | | | | | | | | 744 | +---------+ | | | +---------+ | 745 | | | | link-W | | | | 746 | +-------+ | | | | +-------+ | | 747 | | | | | | | | | 748 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 749 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 750 +-------+ +-------+ +-------+ +-------+ 751 +--------PoD#1----------+ +---------PoD#2---------+ 753 Figure 6: A multiple plane mis-cabling example 755 RIFT provides an optional level determination procedure in its Zero 756 Touch Provisioning mode. Nodes in the fabric without their level 757 configured determine it automatically. This can have possibly 758 counter-intuitive consequences however. One extreme failure scenario 759 is depicted in Figure 7 and it shows that if all northbound links of 760 spine11 fail at the same time, spine11 negotiates a lower level than 761 Leaf11 and Leaf12. 763 To prevent such scenario where leafs are expected to act as switches, 764 LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is 765 invalid, Spine11 would not derive a valid level from the topology in 766 Figure 7. It will be isolated from the whole fabric and it would be 767 up to the leafs to declare the links towards such spine as mis- 768 cabled. 770 +-------+ +-------+ +-------+ +-------+ 771 |ToF A1| |ToF A2| |ToF A1| |ToF A2| 772 +-------+ +-------+ +-------+ +-------+ 773 | | | | | | 774 | +-------+ | | | 775 + + | | ====> | | 776 X X +------+ | +------+ | 777 + + | | | | 778 +----+--+ +-+-----+ +-+-----+ 779 |Spine11| |Spine12| |Spine12| 780 +-+---+-+ ++----+-+ ++----+-+ 781 | | | | | | 782 | +---------+ | | | 783 | | | | | | 784 | +-------+ | | +-------+ | 785 | | | | | | 786 +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ 787 |Leaf111| |Leaf112| |Leaf111| |Leaf112| 788 +-------+ +-------+ +-+-----+ +-+-----+ 789 | | 790 | +--------+ 791 | | 792 +-+---+-+ 793 |Spine11| 794 +-------+ 796 Figure 7: Fallen spine 798 4.6. Positive vs. Negative Disaggregation 800 Disaggregation is the procedure whereby [RIFT] advertises a more 801 specific route southwards as an exception to the aggregated fabric- 802 default north. Disaggregation is useful when a prefix within the 803 aggregation is reachable via some of the parents but not the others 804 at the same level of the fabric. It is mandatory when the level is 805 the ToF since a ToF node that cannot reach a prefix becomes a black 806 hole for that prefix. The hard problem is to know which prefixes are 807 reachable by whom. 809 In the general case, [RIFT] solves that problem by interconnecting 810 the ToF nodes. So the ToF nodes can exchange the full list of 811 prefixes that exist in the fabric and figure when a ToF node lacks 812 reachability and to existing prefix. This requires additional ports 813 at the ToF, typically 2 ports per ToF node to form a ToF-spanning 814 ring. [RIFT] also defines the southbound reflection procedure that 815 enables a parent to explore the direct connectivity of its peers, 816 meaning their own parents and children; based on the advertisements 817 received from the shared parents and children, it may enable the 818 parent to infer the prefixes its peers can reach. 820 When a parent lacks reachability to a prefix, it may disaggregate the 821 prefix negatively, i.e., advertise that this parent can be used to 822 reach any prefix in the aggregation except that one. The Negative 823 Disaggregation signaling is simple and functions transitively from 824 ToF to top-of-pod (ToP) and then from ToP to Leaf. But it is hard 825 for a parent to figure which prefix it needs to disaggregate, because 826 it does not know what it does not know; it results that the use of a 827 spanning ring at the ToF is required to operate the Negative 828 Disaggregation. Also, though it is only an implementation problem, 829 the programmation of the FIB is complex compared to normal routes, 830 and may incur recursions. 832 The more classical alternative is, for the parents that can reach a 833 prefix that peers at the same level cannot, to advertise a more 834 specific route to that prefix. This leverages the normal longest 835 prefix match in the FIB, and does not require a special 836 implementation. But as opposed to the Negative Disaggregation, the 837 Positive Disaggregation is difficult and inefficient to operate 838 transitively. 840 Transitivity is not needed to a grandchild if all its parents 841 received the Positive Disaggregation, meaning that they shall all 842 avoid the black hole; when that is the case, they collectively build 843 a ceiling that protects the grandchild. But until then, a parent 844 that received a Positive Disaggregation may believe that some peers 845 are lacking the reachability and readvertise too early, or defer and 846 maintain a black hole situation longer than necessary. 848 In a non-partitioned fabric, all the ToF nodes see one another 849 through the reflection and can figure if one is missing a child. In 850 that case it is possible to compute the prefixes that the peer cannot 851 reach and disaggregate positively without a ToF-spanning ring. The 852 ToF nodes can also ascertain that the ToP nodes are connected each to 853 at least a ToF node that can still reach the prefix, meaning that the 854 transitive operation is not required. 856 The bottom line is that in a fabric that is partitioned (e.g., using 857 multiple planes) and/or where the ToP nodes are not guaranteed to 858 always form a ceiling for their children, it is mandatory to use the 859 Negative Disaggregation. On the other hand, in a highly symmetrical 860 and fully connected fabric, (e.g., a canonical Clos Network), the 861 Positive Disaggregation methods allows to save the complexity and 862 cost associated to the ToF-spanning ring. 864 Note that in the case of Positive Disaggregation, the first ToF 865 node(s) that announces a more-specific route attracts all the traffic 866 for that route and may suffer from a transient incast. A ToP node 867 that defers injecting the longer prefix in the FIB, in order to 868 receive more advertisements and spread the packets better, also keeps 869 on sending a portion of the traffic to the black hole in the 870 meantime. In the case of Negative Disaggregation, the last ToF 871 node(s) that injects the route may also incur an incast issue; this 872 problem would occur if a prefix that becomes totally unreachable is 873 disaggregated, but doing so is mostly useless and is not recommended. 875 4.7. Mobile Edge and Anycast 877 When a physical or a virtual node changes its point of attachement in 878 the fabric from a previous-leaf to a next-leaf, new routes must be 879 installed that supersede the old ones. Since the flooding flows 880 northwards, the nodes (if any) between the previous-leaf and the 881 common parent are not immediately aware that the path via previous- 882 leaf is obsolete, and a stale route may exist for a while. The 883 common parent needs to select the freshest route advertisement in 884 order to install the correct route via the next-leaf. This requires 885 that the fabric determines the sequence of the movements of the 886 mobile node. 888 On the one hand, a classical sequence counter provides a total order 889 for a while but it will eventually wrap. On the other hand, a 890 timestamp provides a permanent order but it may miss a movement that 891 happens too quickly vs. the granularity of the timing information. 892 It is not envisioned in the short term that the average fabric 893 supports a Precision Time Protocol [IEEEstd1588], and the precision 894 that may be available with the Network Time Protocol [RFC5905], in 895 the order of 100 to 200ms, may not be necessarily enough to cover, 896 e.g., the fast mobility of a Virtual Machine. 898 Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that 899 combines a sequence counter from the mobile node and a timestamp from 900 the network taken at the leaf when the route is injected. If the 901 timestamps of the concurrent advertisements are comparable (i.e., 902 more distant than the precision of the timing protocol), then the 903 timestamp alone is used to determine the relative freshness of the 904 routes. Otherwise, the sequence counter from the mobile node, if 905 available, is used. One caveat is that the sequence counter must not 906 wrap within the precision of the timing protocol. Another is that 907 the mobile node may not even provide a sequence counter, in which 908 case the mobility itself must be slower than the precision of the 909 timing. 911 Mobility must not be confused with anycast. In both cases, a same 912 address is injected in RIFT at different leaves. In the case of 913 mobility, only the freshest route must be conserved, since mobile 914 node changed its point of attachment for a leaf to the next. In the 915 case of anycast, the node may be either multihomed (attached to 916 multiple leaves in parallel) or reachable beyond the fabric via 917 multiple routes that are redistributed to different leaves; either 918 way, in the case of anycast, the multiple routes are equally valid 919 and should be conserved. Without further information from the 920 redistributed routing protocol, it is impossible to sort out a 921 movement from a redistribution that happens asynchronously on 922 different leaves. [RIFT] expects that anycast addresses are 923 advertised within the timing precision, which is typically the case 924 with a low-precision timing and a multihomed node. Beyond that time 925 interval, RIFT interprets the lag as a mobility and only the freshest 926 route is retained. 928 When using IPv6 [RFC8200], RIFT suggests to leverage "Registration 929 Extensions for IPv6 over Low-Power Wireless Personal Area Network 930 (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND 931 interaction between the mobile node and the leaf. This provides not 932 only a sequence counter but also a lifetime and a security token that 933 may be used to protect the ownership of an address [RFC8928]. When 934 using [RFC8505], the parallel registration of an anycast address to 935 multiple leaves is done with the same sequence counter, whereas the 936 sequence counter is incremented when the point of attachement 937 changes. This way, it is possible to differentiate a mobile node 938 from a multihomed node, even when the mobility happens within the 939 timing precision. It is also possible for a mobile node to be 940 multihomed as well, e.g., to change only one of its points of 941 attachement. 943 4.8. IPv4 over IPv6 945 RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 946 Address Family (AF) configures via the usual Neighbor Discovery (ND) 947 mechanisms and then V4 can use V6 nexthops analogous to [RFC5549]. 948 It is expected that the whole fabric supports the same type of 949 forwarding of address families on all the links. RIFT provides an 950 indication whether a node is v4 forwarding capable and 951 implementations are possible where different routing tables are 952 computed per address family as long as the computation remains loop- 953 free. 955 +-----+ +-----+ 956 +---+---+ | ToF | | ToF | 957 ^ +--+--+ +-----+ 958 | | | | | 959 | | +-------------+ | 960 | | +--------+ | | 961 + | | | | 962 V6 +-----+ +-+---+ 963 Forwarding |Spine| |Spine| 964 + +--+--+ +-----+ 965 | | | | | 966 | | +-------------+ | 967 | | +--------+ | | 968 | | | | | 969 v +-----+ +-+---+ 970 +---+---+ |Leaf | | Leaf| 971 +--+--+ +--+--+ 972 | | 973 IPv4 prefixes| |IPv4 prefixes 974 | | 975 +---+----+ +---+----+ 976 | V4 | | V4 | 977 | subnet | | subnet | 978 +--------+ +--------+ 980 Figure 8: IPv4 over IPv6 982 4.9. In-Band Reachability of Nodes 984 RIFT doesn't precondition that nodes of the fabric have reachable 985 addresses. But the operational purposes to reach the internal nodes 986 may exist. Figure 9 shows an example that the network management 987 station (NMS) attaches to leaf1. 989 +-------+ +-------+ 990 | ToF1 | | ToF2 | 991 ++---- ++ ++-----++ 992 | | | | 993 | +----------+ | 994 | +--------+ | | 995 | | | | 996 ++-----++ +--+---++ 997 |Spine1 | |Spine2 | 998 ++-----++ ++-----++ 999 | | | | 1000 | +----------+ | 1001 | +--------+ | | 1002 | | | | 1003 ++-----++ +--+---++ 1004 | Leaf1 | | Leaf2 | 1005 +---+---+ +-------+ 1006 | 1007 |NMS 1009 Figure 9: In-Band reachability of node 1011 If NMS wants to access Leaf2, it simply works. Because loopback 1012 address of Leaf2 is flooded in its Prefix North TIE. 1014 If NMS wants to access Spine2, it simply works too. Because spine 1015 node always advertises its loopback address in the Prefix North TIE. 1016 NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ 1017 ToF2-Spine2. 1019 If NMS wants to access ToF2, ToF2's loopback address needs to be 1020 injected into its Prefix South TIE. This TIE must be seen by all 1021 nodes at the level below - the spine nodes in Figure 9 - that must 1022 form a ceiling for all the traffic coming from below (south). 1023 Otherwise, the traffic from NMS may follow the default route to the 1024 wrong ToF Node, e.g., ToF1. 1026 In a fully connected ToF, in case of failure between ToF2 and spine 1027 nodes, ToF2's loopback address must be disaggregated recursively all 1028 the way to the leaves. 1030 In a partitioned ToF, a TOF node is only reachable within its Plane, 1031 and the disaggregation to the leaves is also required. A possible 1032 alternative is to use the ring that interconnects the ToF nodes to 1033 transmit packets between them for their loopback addresses only. The 1034 idea is that this is mostly control traffic and should not alter the 1035 load balancing properties of the fabric. 1037 4.10. Dual Homing Servers 1039 Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It 1040 has no configuration (unless it is a Top-of-Fabric at the top of the 1041 topology or the must operate in the topology as leaf and/or support 1042 leaf-2-leaf procedures) and it will fully configure itself after 1043 being attached to the topology. 1045 +---+ +---+ +---+ 1046 |ToF| |ToF| |ToF| ToF 1047 +---+ +---+ +---+ 1048 | | | | | | 1049 | +----------------+ | | 1050 | | | | | | 1051 | +----------------+ | 1052 | | | | | | 1053 +----------+--+ +--+----------+ 1054 | ToR1 | | ToR2 | Spine 1055 +--+------+---+ +--+-------+--+ 1056 +---+ | | | | | | +---+ 1057 | | | | | | | | 1058 | +-----------------+ | | | 1059 | | | +-------------+ | | 1060 + | + | | |-----------------+ | 1061 X | X | +--------x-----+ | X | 1062 + | + | | | + | 1063 +---+ +---+ +---+ +---+ 1064 | | | | | | | | 1065 +---+ +---+ ...............+---+ +---+ 1066 SV(1) SV(2) SV(n+1) SV(n) Leaf 1068 Figure 10: Dual-homing servers 1070 In the single plane, the worst condition is disaggregation of every 1071 other servers at the same level. Suppose the links from ToR1 (Top of 1072 Rack) to all the leaves become not available. All the servers' 1073 routes are disaggregated and the FIB of the servers will be expanded 1074 with n-1 more specific routes. 1076 Sometimes, people may prefer to disaggregate from ToR to servers from 1077 start on, i.e. the servers have couple tens of routes in FIB from 1078 start on beside default routes to avoid breakages at rack level. 1079 Full disaggregation of the fabric could be achieved by configuration 1080 supported by RIFT. 1082 4.11. Fabric With A Controller 1084 There are many different ways to deploy the controller. One 1085 possibility is attaching a controller to the RIFT domain from ToF and 1086 another possibility is attaching a controller from the leaf. 1088 +------------+ 1089 | Controller | 1090 ++----------++ 1091 | | 1092 | | 1093 +----++ ++----+ 1094 ------- | ToF | | ToF | 1095 | +--+--+ +-----+ 1096 | | | | | 1097 | | +-------------+ | 1098 | | +--------+ | | 1099 | | | | | 1100 +-----+ +-+---+ 1101 RIFT domain |Spine| |Spine| 1102 +--+--+ +-----+ 1103 | | | | | 1104 | | +-------------+ | 1105 | | +--------+ | | 1106 | | | | | 1107 | +-----+ +-+---+ 1108 ------- |Leaf | | Leaf| 1109 +-----+ +-----+ 1111 Figure 11: Fabric with a controller 1113 4.11.1. Controller Attached to ToFs 1115 If a controller is attaching to the RIFT domain from ToF, it usually 1116 uses dual-homing connections. The loopback prefix of the controller 1117 should be advertised down by the ToF and spine to leaves. If the 1118 controller loses link to ToF, make sure the ToF withdraw the prefix 1119 of the controller(use different mechanisms). 1121 4.11.2. Controller Attached to Leaf 1123 If the controller is attaching from a leaf to the fabric, no special 1124 provisions are needed. 1126 4.12. Internet Connectivity With Underlay 1128 If global addressing is running without overlay, an external default 1129 route needs to be advertised through rift fabric to achieve internet 1130 connectivity. For the purpose of forwarding of the entire rift 1131 fabric, an internal fabric prefix needs to be advertised in the South 1132 Prefix TIE by ToF and spine nodes. 1134 4.12.1. Internet Default on the Leaf 1136 In case that an internet access request comes from a leaf and the 1137 internet gateway is another leaf, the leaf node as the internet 1138 gateway needs to advertise a default route in its Prefix North TIE. 1140 4.12.2. Internet Default on the ToFs 1142 In case that an internet access request comes from a leaf and the 1143 internet gateway is a ToF, the ToF and spine nodes need to advertise 1144 a default route in the Prefix South TIE. 1146 4.13. Subnet Mismatch and Address Families 1148 +--------+ +--------+ 1149 | | LIE LIE | | 1150 | A | +----> <----+ | B | 1151 | +---------------------+ | 1152 +--------+ +--------+ 1153 X/24 Y/24 1155 Figure 12: subnet mismatch 1157 LIEs are exchanged over all links running RIFT to perform Link 1158 (Neighbor) Discovery. A node MUST NOT originate LIEs on an address 1159 family if it does not process received LIEs on that family. LIEs on 1160 same link are considered part of the same negotiation independent on 1161 the address family they arrive on. An implementation MUST be ready 1162 to accept TIEs on all addresses it used as source of LIE frames. 1164 As shown in the above figure, without further checks adjacency of 1165 node A and B may form, but the forwarding between node A and node B 1166 may fail because subnet X mismatches with subnet Y. 1168 To prevent this a RIFT implementation should check for subnet 1169 mismatch just like e.g. ISIS does. This can lead to scenarios where 1170 an adjacency, despite exchange of LIEs in both address families may 1171 end up having an adjacency in a single AF only. This is a 1172 consideration especially in Section 4.8 scenarios. 1174 4.14. Anycast Considerations 1176 + traffic 1177 | 1178 v 1179 +------+------+ 1180 | ToF | 1181 +---+-----+---+ 1182 | | | | 1183 +------------+ | | +------------+ 1184 | | | | 1185 +---+---+ +-------+ +-------+ +---+---+ 1186 | | | | | | | | 1187 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 1188 +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ 1189 | | | | | | | | 1190 | +---------+ | | +---------+ | 1191 | | | | | | | | 1192 | +-------+ | | | +-------+ | | 1193 | | | | | | | | 1194 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 1195 | | | | | | | | 1196 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 1197 +-+-----+ ++------+ +-----+-+ +-----+-+ 1198 + + + ^ | 1199 PrefixA PrefixB PrefixA | PrefixC 1200 | 1201 + traffic 1203 Figure 13: Anycast 1205 If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast 1206 prefix PrefixA. RIFT can deal with this case well. But if the 1207 traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. 1208 But Spine21 or Spine22 doesn't know another PrefixA attaching 1209 Leaf111. So it will always get to Leaf121 and never get to Leaf111. 1210 If the intension is that the traffic should been offloaded to 1211 Leaf111, then use policy guided prefixes defined in "Routing in Fat 1212 Trees" [RIFT]. 1214 4.15. IoT Applicability 1216 The design of RIFT inherits from RPL [RFC6550] the anisotropic design 1217 of a default route upwards (northwards); it also inherits the 1218 capability to inject external host routes at the Leaf level using 1219 Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host 1220 and a RIFT router. Both the RPL and the RIFT protocols are meant for 1221 large scale, and WiND enables device mobility at the edge the same 1222 way in both cases. 1224 The main difference between RIFT and RPL is that with RPL, there's a 1225 single Root, whereas RIFT has many ToF nodes. The adds huge 1226 capabilities for leaf-2-leaf ECMP paths, but additional complexity 1227 with the need to disaggregate. Also RIFT uses Link State flooding 1228 northwards, and is not designed for low-power operation. 1230 Still nothing prevents that the IP devices connected at the Leaf are 1231 IoT (Internet of Things) devices, which typically expose their 1232 address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. 1234 A network that serves high speed/ high power IoT devices should 1235 typically provide deterministic capabilities for applications such as 1236 high speed control loops or movement detection. The Fat Tree is 1237 highly reliable, and in normal condition provides an equilatent 1238 multipath operation; but the ECMP doesn't provide hard guarantees for 1239 either delivery or latency. As long as the fabric is non-blocking 1240 the result is the same; but there can be load unbalances resulting in 1241 incast and possibly congestion loss that will prevent the delivery 1242 within bounded latency. 1244 This could be alleviated with Packet Replication, Elimination and 1245 Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide 1246 at the scale of all flows, and the replication may increase the 1247 probability of the overload that it attempts to solve. 1249 Note that the load balancing is not RIFT's problem, but it is key to 1250 serve IoT adequately. 1252 5. Security Considerations 1254 This document presents applicability of RIFT. As such, it does not 1255 introduce any security considerations. However, there are a number 1256 of security concerns at [RIFT]. 1258 6. Contributors 1260 The following people (listed in alphabetical order) contributed 1261 significantly to the content of this document and should be 1262 considered co-authors: 1264 Tony Przygienda 1266 Juniper Networks 1268 1194 N. Mathilda Ave 1270 Sunnyvale, CA 94089 1272 US 1274 Email: prz@juniper.net 1276 7. Normative References 1278 [ISO10589-Second-Edition] 1279 International Organization for Standardization, 1280 "Intermediate system to Intermediate system intra-domain 1281 routeing information exchange protocol for use in 1282 conjunction with the protocol for providing the 1283 connectionless-mode Network Service (ISO 8473)", November 1284 2002. 1286 [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central 1287 Office Reference Architectural Framework", January 2018. 1289 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 1290 DOI 10.17487/RFC2328, April 1998, 1291 . 1293 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1294 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1295 DOI 10.17487/RFC4861, September 2007, 1296 . 1298 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1299 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1300 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1301 . 1303 [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., 1304 Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional 1305 Forwarding Detection (BFD) on Link Aggregation Group (LAG) 1306 Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 1307 2014, . 1309 [RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network 1310 Layer Reachability Information with an IPv6 Next Hop", 1311 RFC 5549, DOI 10.17487/RFC5549, May 2009, 1312 . 1314 [RFC6550] Winter, T., Ed., Thubert, P., Ed., Brandt, A., Hui, J., 1315 Kelsey, R., Levis, P., Pister, K., Struik, R., Vasseur, 1316 JP., and R. Alexander, "RPL: IPv6 Routing Protocol for 1317 Low-Power and Lossy Networks", RFC 6550, 1318 DOI 10.17487/RFC6550, March 2012, 1319 . 1321 [RFC6775] Shelby, Z., Ed., Chakrabarti, S., Nordmark, E., and C. 1322 Bormann, "Neighbor Discovery Optimization for IPv6 over 1323 Low-Power Wireless Personal Area Networks (6LoWPANs)", 1324 RFC 6775, DOI 10.17487/RFC6775, November 2012, 1325 . 1327 [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, 1328 "Deterministic Networking Architecture", RFC 8655, 1329 DOI 10.17487/RFC8655, October 2019, 1330 . 1332 [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and 1333 D. Afanasiev, "RIFT: Routing in Fat Trees", Work in 1334 Progress, Internet-Draft, draft-ietf-rift-rift-12, 26 May 1335 2020, 1336 . 1338 [I-D.white-distoptflood] 1339 White, R., Hegde, S., and S. Zandi, "IS-IS Optimal 1340 Distributed Flooding for Dense Topologies", Work in 1341 Progress, Internet-Draft, draft-white-distoptflood-04, 27 1342 July 2020, 1343 . 1345 8. Informative References 1347 [IEEEstd1588] 1348 IEEE standard for Information Technology, "IEEE Standard 1349 for a Precision Clock Synchronization Protocol for 1350 Networked Measurement and Control Systems", 1351 . 1353 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 1354 Communication Environments", IEEE International Parallel & 1355 Distributed Processing Symposium, 2011. 1357 [FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for 1358 Hardware-Efficient Supercomputing", 1985. 1360 [RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link 1361 State Routing Protocol (OLSR)", RFC 3626, 1362 DOI 10.17487/RFC3626, October 2003, 1363 . 1365 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 1366 "Network Time Protocol Version 4: Protocol and Algorithms 1367 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 1368 . 1370 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1371 (IPv6) Specification", STD 86, RFC 8200, 1372 DOI 10.17487/RFC8200, July 2017, 1373 . 1375 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 1376 Perkins, "Registration Extensions for IPv6 over Low-Power 1377 Wireless Personal Area Network (6LoWPAN) Neighbor 1378 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 1379 . 1381 [RFC8928] Thubert, P., Ed., Sarikaya, B., Sethi, M., and R. Struik, 1382 "Address-Protected Neighbor Discovery for Low-Power and 1383 Lossy Networks", RFC 8928, DOI 10.17487/RFC8928, November 1384 2020, . 1386 Authors' Addresses 1388 Yuehua Wei (editor) 1389 ZTE Corporation 1390 No.50, Software Avenue 1391 Nanjing 1392 210012 1393 China 1394 Email: wei.yuehua@zte.com.cn 1396 Zheng Zhang 1397 ZTE Corporation 1398 No.50, Software Avenue 1399 Nanjing 1400 210012 1401 China 1403 Email: zhang.zheng@zte.com.cn 1405 Dmitry Afanasiev 1406 Yandex 1408 Email: fl0w@yandex-team.ru 1410 Pascal Thubert 1411 Cisco Systems, Inc 1412 Building D 1413 45 Allee des Ormes - BP1200 1414 06254 MOUGINS - Sophia Antipolis 1415 France 1417 Phone: +33 497 23 26 34 1418 Email: pthubert@cisco.com 1420 Tom Verhaeg 1421 Juniper Networks 1423 Email: tverhaeg@juniper.net 1425 Jaroslaw Kowalczyk 1426 Orange Polska 1428 Email: jaroslaw.kowalczyk2@orange.com