idnits 2.17.00 (12 Aug 2021) /tmp/idnits62882/draft-ietf-rift-applicability-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1088: '...scovery. A node MUST NOT originate LI...' RFC 2119 keyword, line 1091: '...e on. An implementation MUST be ready...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (9 October 2020) is 589 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-15) exists of draft-ietf-rift-rift-12 Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT WG Yuehua. Wei, Ed. 3 Internet-Draft Zheng. Zhang 4 Intended status: Informational ZTE Corporation 5 Expires: 12 April 2021 Dmitry. Afanasiev 6 Yandex 7 Tom. Verhaeg 8 Juniper Networks 9 Jaroslaw. Kowalczyk 10 Orange Polska 11 P. Thubert 12 Cisco Systems 13 9 October 2020 15 RIFT Applicability 16 draft-ietf-rift-applicability-02 18 Abstract 20 This document discusses the properties, applicability and operational 21 considerations of RIFT in different network scenarios. It intends to 22 provide a rough guide how RIFT can be deployed to simplify routing 23 operations in Clos topologies and their variations. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on 12 April 2021. 42 Copyright Notice 44 Copyright (c) 2020 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 49 license-info) in effect on the date of publication of this document. 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. Code Components 52 extracted from this document must include Simplified BSD License text 53 as described in Section 4.e of the Trust Legal Provisions and are 54 provided without warranty as described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Problem Statement of Routing in Modern IP Fabric Fat Tree 60 Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 62 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 4 63 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 6 64 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 65 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 66 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 7 67 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 68 3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 8 69 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 8 70 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 8 71 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 9 72 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 9 73 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 11 74 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 12 75 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 12 76 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 14 77 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 15 78 4.5. Miscabling Examples . . . . . . . . . . . . . . . . . . . 15 79 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 18 80 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 19 81 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 21 82 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 22 83 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 23 84 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 24 85 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 24 86 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 24 87 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 25 88 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 25 89 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 25 90 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 25 91 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 26 92 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 93 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 27 94 7. Normative References . . . . . . . . . . . . . . . . . . . . 27 95 8. Informative References . . . . . . . . . . . . . . . . . . . 28 96 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 98 1. Introduction 100 This document intends to explain the properties and applicability of 101 "Routing in Fat Trees" [RIFT] in different deployment scenarios and 102 highlight the operational simplicity of the technology compared to 103 traditional routing solutions. It also documents special 104 considerations when RIFT is used with or without overlays, 105 controllers and corrects topology miscablings and/or node and link 106 failures. 108 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 110 Clos and Fat-Tree topologies have gained prominence in today's 111 networking, primarily as result of the paradigm shift towards a 112 centralized data-center based architecture that is poised to deliver 113 a majority of computation and storage services in the future. 115 Today's current routing protocols were geared towards a network with 116 an irregular topology and low degree of connectivity originally. 117 When they are applied to Fat-Tree topologies: 119 * they tend to need extensive configuration or provisioning during 120 bring up and re-dimensioning. 122 * spine and leaf nodes have the entire network topology and routing 123 information, which is in fact, not needed on the leaf nodes during 124 normal operation. 126 * significant Link State PDUs (LSPs) flooding duplication between 127 spine nodes and leaf nodes occurs during network bring up and 128 topology updates. It consumes both spine and leaf nodes' CPU and 129 link bandwidth resources and with that limits protocol 130 scalability. 132 3. Applicability of RIFT to Clos IP Fabrics 134 Further content of this document assumes that the reader is familiar 135 with the terms and concepts used in OSPF [RFC2328] and IS-IS 136 [ISO10589-Second-Edition] link-state protocols and at least the 137 sections of [RIFT] outlining the requirement of routing in IP fabrics 138 and RIFT protocol concepts. 140 3.1. Overview of RIFT 142 RIFT is a dynamic routing protocol for Clos and fat-tree network 143 topologies. It defines a link-state protocol when "pointing north" 144 and path-vector protocol when "pointing south". 146 It floods flat link-state information northbound only so that each 147 level obtains the full topology of levels south of it. That 148 information is never flooded East-West or back South again. So a top 149 tier node has full set of prefixes from the SPF calculation. 151 In the southbound direction the protocol operates like a "fully 152 summarizing, unidirectional" path vector protocol or rather a 153 distance vector with implicit split horizon whereas the information 154 propagates one hop south and is 're-advertised' by nodes at next 155 lower level, normally just the default route. 157 +-----------+ +-----------+ 158 | ToF | | ToF | LEVEL 2 159 + +-----+--+--+ +-+--+------+ 160 | | | | | | | | | ^ 161 + | | | +-------------------------+ | 162 Distance | +-------------------+ | | | | | 163 Vector | | | | | | | | + 164 South | | | | +--------+ | | | Link+State 165 + | | | | | | | | Flooding 166 | | | +-------------+ | | | North 167 v | | | | | | | | + 168 +-+--+-+ +------+ +-------+ +--+--+-+ | 169 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 170 + ++----++ ++---+-+ +--+--+-+ ++----+-+ | 171 + | | | | | | | | | ^ N 172 Distance | +-------+ | | +--------+ | | | E 173 Vector | | | | | | | | | +------> 174 South | +-------+ | | | +-------+ | | | | 175 + | | | | | | | | | + 176 v ++--++ +-+-++ ++-+-+ +-+--++ + 177 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 178 +----+ +----+ +----+ +-----+ 180 Figure 1: Rift overview 182 A middle tier node has only information necessary for its level, 183 which are all destinations south of the node based on SPF 184 calculation, default route and potential disaggregated routes. 186 RIFT combines the advantage of both Link-State and Distance Vector: 188 * Fastest Possible Convergence 190 * Automatic Detection of Topology 192 * Minimal Routes/Info on TORs 194 * High Degree of ECMP 196 * Fast De-commissioning of Nodes 198 * Maximum Propagation Speed with Flexible Prefixes in an Update 200 And RIFT eliminates the disadvantages of Link-State or Distance 201 Vector: 203 * Reduced and Balanced Flooding 205 * Automatic Neighbor Detection 207 So there are two types of link state database which are "north 208 representation" N-TIEs and "south representation" S-TIEs. The N-TIEs 209 contain a link state topology description of lower levels and S-TIEs 210 carry simply default routes for the lower levels. 212 There are a bunch of more advantages unique to RIFT listed below 213 which could be understood if you read the details of [RIFT]. 215 * True ZTP 217 * Minimal Blast Radius on Failures 219 * Can Utilize All Paths Through Fabric Without Looping 221 * Automatic Disaggregation on Failures 223 * Simple Leaf Implementation that Can Scale Down to Servers 225 * Key-Value Store 227 * Horizontal Links Used for Protection Only 229 * Supports Non-Equal Cost Multipath and Can Replace MC-LAG 231 * Optimal Flooding Reduction and Load-Balancing 233 3.2. Applicable Topologies 235 Albeit RIFT is specified primarily for "proper" Clos or "fat-tree" 236 structures, it already supports PoD concepts which are strictly 237 speaking not found in original Clos concepts. 239 Further, the specification explains and supports operations of multi- 240 plane Clos variants where the protocol relies on set of rings to 241 allow the reconciliation of topology view of different planes as most 242 desirable solution making proper disaggregation viable in case of 243 failures. This observations hold not only in case of RIFT but in the 244 generic case of dynamic routing on Clos variants with multiple planes 245 and failures in bi-sectional bandwidth, especially on the leafs. 247 3.2.1. Horizontal Links 249 RIFT is not limited to pure Clos divided into PoD and multi-planes 250 but supports horizontal links below the top of fabric level. Those 251 links are used however only as routes of last resort northbound when 252 a spine loses all northbound links or cannot compute a default route 253 through them. 255 A possible configuration is a "ring" of horizontal links at a level. 256 In presence of such a "ring" in any level (except ToF level) neither 257 N-SPF nor S-SPF will provide a "ring-based protection" scheme since 258 such a computation would have to deal necessarily with breaking of 259 "loops" in Dijkstra sense; an application for which RIFT is not 260 intended. 262 A full-mesh connectivity between nodes on the same level can be 263 employed and that allows N-SPF to provide for any node loosing all 264 its northbound adjacencies (as long as any of the other nodes in the 265 level are northbound connected) to still participate in northbound 266 forwarding. 268 3.2.2. Vertical Shortcuts 270 Through relaxations of the specified adjacency forming rules RIFT 271 implementations can be extended to support vertical "shortcuts" as 272 proposed by e.g. [I-D.white-distoptflood]. The RIFT specification 273 itself does not provide the exact details since the resulting 274 solution suffers from either much larger blast radius with increased 275 flooding volumes or in case of maximum aggregation routing bow-tie 276 problems. 278 3.2.3. Generalizing to any Directed Acyclic Graph 280 RIFT is an anisotropic routing protocol, meaning that it has a sense 281 of direction (Northbound, Southbound, East-West) and that it operates 282 differently depending on the direction. 284 * Northbound, RIFT operates as a Link State IGP, whereby the control 285 packets are reflooded first all the way North and only interpreted 286 later. All the individual fine grained routes are advertised. 288 * Southbound, RIFT operates as a Distance Vector IGP, whereby the 289 control packets are flooded only one hop, interpreted, and the 290 consequence of that computation is what gets flooded on more hop 291 South. In the most common use-cases, a ToF node can reach most of 292 the prefixes in the fabric. If that is the case, the ToF node 293 advertises the fabric default and disaggregates the prefixes that 294 it cannot reach. On the other hand, a ToF Node that can reach 295 only a small subset of the prefixes in the fabric will preferably 296 advertise those prefixes and refrain from aggregating. 298 In the general case, what gets advertised South is in more 299 details: 301 1. A fabric default that aggregates all the prefixes that are 302 reachable within the fabric, and that could be a default route 303 or a prefix that is dedicated to this particular fabric. 305 2. The loopback addresses of the Northbound nodes, e.g., for 306 inband management. 308 3. The disaggregated prefixes for the dynamic exceptions to the 309 fabric Default, advertised to route around the black hole that 310 may form 312 * East-West routing can optionally be used, with specific 313 restrictions. It is useful in particular when a sibling has 314 access to the fabric default but this node does not. 316 A Directed Acyclic Graph (DAG) provides a sense of North (the 317 direction of the DAG) and of South (the reverse), which can be used 318 to apply RIFT. For the purpose of RIFT, an edge in the DAG that has 319 only incoming vertices is a ToF node. 321 There are a number of caveats though: 323 * The DAG structure must exist before RIFT starts, so there is a 324 need for a companion protocol to establish the logical DAG 325 structure. 327 * A generic DAG does not have a sense of East and West. The 328 operation specified for East-West links and the Southbound 329 reflection between nodes are not applicable. 331 * In order to aggregate and disaggregate routes, RIFT requires that 332 all the ToF nodes share the full knowledge of the prefixes in the 333 fabric. This can be achieved with a ring as suggested by the RIFT 334 main specification, by some preconfiguration, or using a 335 synchronization with a common repository where all the active 336 prefixes are registered. 338 3.3. Use Cases 340 3.3.1. DC Fabrics 342 RIFT is largely driven by demands and hence ideally suited for 343 application in underlay of data center IP fabrics, vast majority of 344 which seem to be currently (and for the foreseeable future) Clos 345 architectures. It significantly simplifies operation and deployment 346 of such fabrics as described in Section 4 for environments compared 347 to extensive proprietary provisioning and operational solutions. 349 3.3.2. Metro Fabrics 351 The demand for bandwidth is increasing steadily, driven primarily by 352 environments close to content producers (server farms connection via 353 DC fabrics) but in proximity to content consumers as well. Consumers 354 are often clustered in metro areas with their own network 355 architectures that can benefit from simplified, regular Clos 356 structures and hence RIFT. 358 3.3.3. Building Cabling 360 Commercial edifices are often cabled in topologies that are either 361 Clos or its isomorphic equivalents. With many floors the Clos can 362 grow rather high and with that present a challenge for traditional 363 routing protocols (except BGP and by now largely phased-out PNNI) 364 which do not support an arbitrary number of levels which RIFT does 365 naturally. Moreover, due to limited sizes of forwarding tables in 366 active elements of building cabling the minimum FIB size RIFT 367 maintains under normal conditions can prove particularly cost- 368 effective in terms of hardware and operational costs. 370 3.3.4. Internal Router Switching Fabrics 372 It is common in high-speed communications switching and routing 373 devices to use fabrics when a crossbar is not feasible due to cost, 374 head-of-line blocking or size trade-offs. Normally such fabrics are 375 not self-healing or rely on 1:/+1 protection schemes but it is 376 conceivable to use RIFT to operate Clos fabrics that can deal 377 effectively with interconnections or subsystem failures in such 378 module. RIFT is neither IP specific and hence any link addressing 379 connecting internal device subnets is conceivable. 381 3.3.5. CloudCO 383 The Cloud Central Office (CloudCO) is a new stage of telecom Central 384 Office. It takes the advantage of Software Defined Networking (SDN) 385 and Network Function Virtualization (NFV) in conjunction with general 386 purpose hardware to optimize current networks. The following figure 387 illustrates this architecture at a high level. It describes a single 388 instance or macro-node of cloud CO. An Access I/O module faces a 389 Cloud CO Access Node, and the CPEs behind it. A Network I/O module 390 is facing the core network. The two I/O modules are interconnected 391 by a leaf and spine fabric. [TR-384] 392 +---------------------+ +----------------------+ 393 | Spine | | Spine | 394 | Switch | | Switch | 395 +------+---+------+-+-+ +--+-+-+-+-----+-------+ 396 | | | | | | | | | | | | 397 | | | | | +-------------------------------+ | 398 | | | | | | | | | | | | 399 | | | | +-------------------------+ | | | 400 | | | | | | | | | | | | 401 | | +----------------------+ | | | | | | | | 402 | | | | | | | | | | | | 403 | +---------------------------------+ | | | | | | | 404 | | | | | | | | | | | | 405 | | | +-----------------------------+ | | | | | 406 | | | | | | | | | | | | 407 | | | | | +--------------------+ | | | | 408 | | | | | | | | | | | | 409 +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ 410 |L | | Leaf | | Leaf | | Leaf | | Leaf | |L | 411 |S | | Switch | | Switch | | Switch | | Switch| |S | 412 ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ 413 | | | | | | | | | | | | | | 414 | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | 415 | |Compute | |Compute | | Compute | |Compute| | 416 | |Node | |Node | | Node | |Node | | 417 | +--------+ +--------+ +----------+ +-------+ | 418 | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | 419 | |--------| |--------| |----------| |-------| | 420 | |--------| |--------| |----------| |-------| | 421 | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | 422 | |--------| |--------| |----------| |-------| | 423 | |--------| |--------| |----------| |-------| | 424 | || VAS7 || || VAS4 || || vIGMP || ||BAA || | 425 | |--------| |--------| |----------| |-------| | 426 | +--------+ +--------+ +----------+ +-------+ | 427 | | 428 ++-----------+ +---------++ 429 |Network I/O | |Access I/O| 430 +------------+ +----------+ 432 Figure 2: An example of CloudCO architecture 434 The Spine-Leaf architecture deployed inside CloudCO meets the network 435 requirements of adaptable, agile, scalable and dynamic. 437 4. Deployment Considerations 439 RIFT presents the opportunity for organizations building and 440 operating IP fabrics to simplify their operation and deployments 441 while achieving many desirable properties of a dynamic routing on 442 such a substrate: 444 * RIFT design follows minimum blast radius and minimum necessary 445 epistemological scope philosophy which leads to very good scaling 446 properties while delivering maximum reactiveness. 448 * RIFT allows for extensive Zero Touch Provisioning within the 449 protocol. In its most extreme version RIFT does not rely on any 450 specific addressing and for IP fabric can operate using IPv6 ND 451 [RFC4861] only. 453 * RIFT has provisions to detect common IP fabric mis-cabling 454 scenarios. 456 * RIFT negotiates automatically BFD per link allowing this way for 457 IP and micro-BFD [RFC7130] to replace LAGs which do hide bandwidth 458 imbalances in case of constituent failures. Further automatic 459 link validation techniques similar to [RFC5357] could be supported 460 as well. 462 * RIFT inherently solves many difficult problems associated with the 463 use of traditional routing topologies with dense meshes and high 464 degrees of ECMP by including automatic bandwidth balancing, flood 465 reduction and automatic disaggregation on failures while providing 466 maximum aggregation of prefixes in default scenarios. 468 * RIFT reduces FIB size towards the bottom of the IP fabric where 469 most nodes reside and allows with that for cheaper hardware on the 470 edges and introduction of modern IP fabric architectures that 471 encompass e.g. server multi-homing. 473 * RIFT provides valley-free routing and with that is loop free. 474 This allows the use of any such valley-free path in bi-sectional 475 fabric bandwidth between two destination irrespective of their 476 metrics which can be used to balance load on the fabric in 477 different ways. 479 * RIFT includes a key-value distribution mechanism which allows for 480 many future applications such as automatic provisioning of basic 481 overlay services or automatic key roll-overs over whole fabrics. 483 * RIFT is designed for minimum delay in case of prefix mobility on 484 the fabric. 486 * Many further operational and design points collected over many 487 years of routing protocol deployments have been incorporated in 488 RIFT such as fast flooding rates, protection of information 489 lifetimes and operationally easily recognizable remote ends of 490 links and node names. 492 4.1. South Reflection 494 South reflection is a mechanism that South Node TIEs are "reflected" 495 back up north to allow nodes in same level without E-W links to "see" 496 each other. 498 For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs 499 from ToF21 to ToF22 separately. Respectively, 500 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 501 to ToF21 separately. So ToF22 and ToF21 see each other's node 502 information as level 2 nodes. 504 In an equivalent fashion, as the result of the south reflection 505 between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, 506 Spine121 and Spine 122 knows each other at level 1. 508 4.2. Suboptimal Routing on Link Failures 509 +--------+ +--------+ 510 | ToF21 | | ToF22 | LEVEL 2 511 ++--+-+-++ ++-+--+-++ 512 | | | | | | | + 513 | | | | | | | linkTS8 514 +-------------+ | +-+linkTS3+-+ | | | +--------------+ 515 | | | | | | + | 516 | +----------------------------+ | linkTS7 | 517 | | | | + + + | 518 | | | +-------+linkTS4+------------+ | 519 | | | + + | | | 520 | | | +------------+--+ | | 521 | | | | | linkTS6 | | 522 +-+----++ ++-----++ ++------+ ++-----++ 523 |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 524 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 525 | | | | | | | | 526 | +--------------+ | + ++XX+linkSL6+---+ + 527 | | | | linkSL5 | | linkSL8 528 | +------------+ | | + +---+linkSL7+-+ | + 529 | | | | | | | | 530 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 531 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 532 +-+-----+ ++------+ +-----+-+ +-+-----+ 533 + + + + 534 Prefix111 Prefix112 Prefix121 Prefix122 536 Figure 3: Suboptimal routing upon link failure use case 538 As shown in Figure 3, as the result of the south reflection between 539 Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and 540 Spine 122 knows each other at level 1. 542 Without disaggregation mechanism, when linkSL6 fails, the packet from 543 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 544 then go down through linkTS4 to linkSL8 to Leaf122 or go up through 545 linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to 546 Leaf122 based on pure default route. It's the case of suboptimal 547 routing or bow-tieing. 549 With disaggregation mechanism, when linkSL6 fails, Spine122 will 550 detect the failure according to the reflected node S-TIE from 551 Spine121. Based on the disaggregation algorithm provided by RIFT, 552 Spine122 will explicitly advertise prefix122 in Disaggregated Prefix 553 S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to 554 prefix122 will only be sent to linkSL7 following a longest-prefix 555 match to prefix 122 directly then go down through linkSL8 to Leaf122 556 . 558 4.3. Black-Holing on Link Failures 560 +--------+ +--------+ 561 | ToF 21 | | ToF 22 | LEVEL 2 562 ++-+--+-++ ++-+--+-++ 563 | | | | | | | | 564 | | | | | | | linkTS8 565 +--------------+ | +--linkTS3-X+ | | | +--------------+ 566 linkTS1 | | | | | | | 567 | +-----------------------------+ | linkTS7 | 568 | | | | | | | | 569 | | linkTS2 +--------linkTS4-X-----------+ | 570 | | | | | | | | 571 | linkTS5 +-+ +---------------+ | | 572 | | | | | linkTS6 | | 573 +-+----++ +-+-----+ ++----+-+ ++-----++ 574 |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 575 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 576 | | | | | | | | 577 | +---------------+ | | +----linkSL6----+ | 578 linkSL1 | | | linkSL5 | | linkSL8 579 | +---linkSL3---+ | | | +----linkSL7--+ | | 580 | | | | | | | | 581 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 582 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 583 +-+-----+ ++------+ +-----+-+ +-+-----+ 584 + + + + 585 Prefix111 Prefix112 Prefix121 Prefix122 587 Figure 4: Black-holing upon link failure use case 589 This scenario illustrates a case when double link failure occurs and 590 with that black-holing can happen. 592 Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, 593 the packet from leaf111 to prefix122 would suffer 50% black-holing 594 based on pure default route. The packet supposed to go up through 595 linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be 596 dropped. The packet supposed to go up through linkSL3 to linkTS2 597 then go down through linkTS3 or linkTS4 will be dropped as well. 598 It's the case of black-holing. 600 With disaggregation mechanism, when linkTS3 and linkTS4 both fail, 601 ToF22 will detect the failure according to the reflected node S-TIE 602 of ToF21 from Spine111\Spine112. Based on the disaggregation 603 algorithm provided by RITF, ToF22 will explicitly originate an S-TIE 604 with prefix 121 and prefix 122, that is flooded to spines 111, 112, 605 121 and 122. 607 The packet from leaf111 to prefix122 will not be routed to linkTS1 or 608 linkTS2. The packet from leaf111 to prefix122 will only be routed to 609 linkTS5 or linkTS7 following a longest-prefix match to prefix122. 611 4.4. Zero Touch Provisioning (ZTP) 613 Each RIFT node may operate in zero touch provisioning (ZTP) mode. It 614 has no configuration (unless it is a Top-of-Fabric at the top of the 615 topology or it is desired to confine it to leaf role w/o leaf-2-leaf 616 procedures). In such case RIFT will fully configure the node's level 617 after it is attached to the topology. 619 The most import component for ZTP is the automatic level derivation 620 procedure. All the Top-of-Fabric nodes are explicitly marked with 621 TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP 622 nodes to derive their level in the topology. The derivation of the 623 level of each node happens then based on LIEs received from its 624 neighbors whereas each node (with possibly exceptions of configured 625 leafs) tries to attach at the highest possible point in the fabric. 626 This guarantees that even if the diffusion front reaches a node from 627 "below" faster than from "above", it will greedily abandon already 628 negotiated level derived from nodes topologically below it and 629 properly peer with nodes above. 631 4.5. Miscabling Examples 633 +----------------+ +-----------------+ 634 | ToF21 | +------+ ToF22 | LEVEL 2 635 +-------+----+---+ | +----+---+--------+ 636 | | | | | | | | | 637 | | | +----------------------------+ | 638 | +---------------------------+ | | | | 639 | | | | | | | | | 640 | | | | +-----------------------+ | | 641 | | +------------------------+ | | | 642 | | | | | | | | | 643 +-+---+-+ +-+---+-+ | +-+---+-+ +-+---+-+ 644 |Spin111| |Spin112| | |Spin121| |Spin122| LEVEL 1 645 +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ 646 | | | | | | | | | 647 | +---------+ | link-M | +---------+ | 648 | | | | | | | | | 649 | +-------+ | | | | +-------+ | | 650 | | | | | | | | | 651 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 652 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 653 +-------+ +-------+ +-------+ +-------+ 654 Figure 5: A single plane miscabling example 656 Figure 5 shows a single plane miscabling example. It's a perfect 657 fat-tree fabric except link-M connecting Leaf112 to ToF22. 659 The RIFT control protocol can discover the physical links 660 automatically and be able to detect cabling that violates fat-tree 661 topology constraints. It react accordingly to such mis-cabling 662 attempts, at a minimum preventing adjacencies between nodes from 663 being formed and traffic from being forwarded on those mis-cabled 664 links. Leaf112 will in such scenario use link-M to derive its level 665 (unless it is leaf) and can report links to spines 111 and 112 as 666 miscabled unless the implementations allows horizontal links. 668 Figure 6 shows a multiple plane miscabling example. Since Leaf112 669 and Spine121 belong to two different PoDs, the adjacency between 670 Leaf112 and Spine121 can not be formed. link-W would be detected and 671 prevented. 673 +-------+ +-------+ +-------+ +-------+ 674 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 675 +-------+ +-------+ +-------+ +-------+ 676 | | | | | | | | 677 | | | +-----------------+ | | | 678 | +--------------------------+ | | | | 679 | | | | | | | | 680 | +------+ | | | +------+ | 681 | | +-----------------+ | | | | | 682 | | | +--------------------------+ | | 683 | A | | B | | A | | B | 684 +-----+-+ +-+---+-+ +-+---+-+ +-+-----+ 685 |Spin111| |Spin112| +----+Spin121| |Spin122| LEVEL 1 686 +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ 687 | | | | | | | | | 688 | +---------+ | | | +---------+ | 689 | | | | link-W | | | | 690 | +-------+ | | | | +-------+ | | 691 | | | | | | | | | 692 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 693 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 694 +-------+ +-------+ +-------+ +-------+ 695 +--------PoD#1----------+ +---------PoD#2---------+ 697 Figure 6: A multiple plane miscabling example 699 RIFT provides an optional level determination procedure in its Zero 700 Touch Provisioning mode. Nodes in the fabric without their level 701 configured determine it automatically. This can have possibly 702 counter-intuitive consequences however. One extreme failure scenario 703 is depicted in Figure 7 and it shows that if all northbound links of 704 spine11 fail at the same time, spine11 negotiates a lower level than 705 Leaf11 and Leaf12. 707 To prevent such scenario where leafs are expected to act as switches, 708 LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is 709 invalid, Spine11 would not derive a valid level from the topology in 710 Figure 7. It will be isolated from the whole fabric and it would be 711 up to the leafs to declare the links towards such spine as miscabled. 713 +-------+ +-------+ +-------+ +-------+ 714 |ToF A1| |ToF A2| |ToF A1| |ToF A2| 715 +-------+ +-------+ +-------+ +-------+ 716 | | | | | | 717 | +-------+ | | | 718 + + | | ====> | | 719 X X +------+ | +------+ | 720 + + | | | | 721 +----+--+ +-+-----+ +-+-----+ 722 |Spine11| |Spine12| |Spine12| 723 +-+---+-+ ++----+-+ ++----+-+ 724 | | | | | | 725 | +---------+ | | | 726 | | | | | | 727 | +-------+ | | +-------+ | 728 | | | | | | 729 +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ 730 |Leaf111| |Leaf112| |Leaf111| |Leaf112| 731 +-------+ +-------+ +-+-----+ +-+-----+ 732 | | 733 | +--------+ 734 | | 735 +-+---+-+ 736 |Spine11| 737 +-------+ 739 Figure 7: Fallen spine 741 4.6. Positive vs. Negative Disaggregation 743 Disaggregation is the procedure whereby [RIFT] advertises more a 744 specific route Southwards as an exception to the aggregated fabric- 745 default North. Disaggregation is useful when a prefix within the 746 aggregation is reachable via some of the parents but not the others 747 at the same level of the fabric. It is mandatory when the level is 748 the ToF since a ToF node that cannot reach a prefix becomes a black 749 hole for that prefix. The hard problem is to know which prefixes are 750 reachable by whom. 752 In the general case, [RIFT] solves that problem by interconnecting 753 the ToF nodes so they can exchange the full list of prefixes that 754 exist in the fabric and figure when a ToF node lacks reachability and 755 to existing prefix. This requires additional ports at the ToF, 756 typically 2 ports per ToF node to form a ToF-spanning ring. [RIFT] 757 also defines the Southbound Reflection procedure that enables a 758 parent to explore the direct connectivity of its peers, meaning their 759 own parents and children; based on the advertisements received from 760 the shared parents and children, it may enable the parent to infer 761 the prefixes its peers can reach. 763 When a parent lacks reachability to a prefix, it may disaggregate the 764 prefix negatively, i.e., advertise that this parent can be used to 765 reach any prefix in the aggregation except that one. The Negative 766 Disaggregation signaling is simple and functions transitively from 767 ToF to ToP and then from Top to Leaf. But it is hard for a parent to 768 figure which prefix it needs to disaggregate, because it does not 769 know what it does not know; it results that the use of a spanning 770 ring at the ToF is required to operate the Negative Disaggregation. 771 Also, though it is only an implementation problem, the programmation 772 of the FIB is complex compared to normal routes, and may incur 773 recursions. 775 The more classical alternative is, for the parents that can reach a 776 prefix that peers at the same level cannot, to advertise a more 777 specific route to that prefix. This leverages the normal longest 778 prefix match in the FIB, and does not require a special 779 implementation. But as opposed to the Negative Disaggregation, the 780 Positive Disaggregation is difficult and inefficient to operate 781 transitively. 783 Transitivity is not needed to a grandchild if all its parents 784 received the Positive Disaggregation, meaning that they shall all 785 avoid the black hole; when that is the case, they collectively build 786 a ceiling that protects the grandchild. But until then, a parent 787 that received a Positive Disaggregation may believe that some peers 788 are lacking the reachability and readvertise too early, or defer and 789 maintain a black hole situation longer than necessary. 791 In a non-partitioned fabric, all the ToF nodes see one another 792 through the reflection and can figure if one is missing a child. In 793 that case it is possible to compute the prefixes that the peer cannot 794 reach and disaggregate positively without a ToF-spanning ring. The 795 ToF nodes can also ascertain that the ToP nodes are connected each to 796 at least a ToF node that can still reach the prefix, meaning that the 797 transitive operation is not required. 799 The bottom line is that in a fabric that is partitioned (e.g., using 800 multiple planes) and/or where the ToP nodes are not guaranteed to 801 always form a ceiling for their children, it is mandatory to use the 802 Negative Disaggregation. On the other hand, in a highly symmetrical 803 and fully connected fabric, (e.g., a canonical Clos Network), the 804 Positive Disaggregation methods allows to save the complexity and 805 cost associated to the ToF-spanning ring. 807 Note that in the case of Positive Disaggregation, the first ToF 808 node(s) that announces a more-specific route attracts all the traffic 809 for that route and may suffer from a transient incast. A ToP node 810 that defers injecting the longer prefix in the FIB, in order to 811 receive more advertisements and spread the packets better, also keeps 812 on sending a portion of the traffic to the black hole in the 813 meantime. In the case of Negative Disaggregation, the last ToF 814 node(s) that injects the route may also incur an incast issue; this 815 problem would occur if a prefix that becomes totally unreachable is 816 disaggregated, but doing so is mostly useless and is not recommended. 818 4.7. Mobile Edge and Anycast 820 When a physical or a virtual node changes its point of attachement in 821 the fabric from a previous-leaf to a next-leaf, new routes must be 822 installed that supercede the old ones. Since the flooding flows 823 Northwards, the nodes (if any) between the previous-leaf and the 824 common parent are not immediately aware that the path via previous- 825 leaf is obsolete, and a stale route may exist for a while. The 826 common parent needs to select the freshest route advertisement in 827 order to install the correct route via the next-leaf. This requires 828 that the fabric determines the sequence of the movements of the 829 mobile node. 831 On the one hand, a classical sequence counter provides a total order 832 for a while but it will eventually wrap. On the other hand, a 833 timestamp provides a permanent order but it may miss a movement that 834 happens too quickly vs. the granularity of the timing information. 835 It is not envisioned in the short term that the average fabric 836 supports a Precision Time Protocol, and the precision that may be 837 available with the Network Time Protocol [RFC5905], in the order of 838 100 to 200ms, may not be necessarily enough to cover, e.g., the fast 839 mobility of a Virtual Machine. 841 Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that 842 combines a sequence counter from the mobile node and a timestamp from 843 the network taken at the leaf when the route is injected. If the 844 timestamps of the concurrent advertisements are comparable (i.e., 845 more distant than the precision of the timing protocol), then the 846 timestamp alone is used to determine the relative freshness of the 847 routes. Otherwise, the sequence counter from the mobile node, if 848 available, is used. One caveat is that the sequence counter must not 849 wrap within the precision of the timing protocol. Another is that 850 the mobile node may not even provide a sequence counter, in which 851 case the mobility itself must be slower than the precision of the 852 timing. 854 Mobility must not be confused with Anycast. In both cases, a same 855 address is injected in RIFT at different leaves. In the case of 856 mobility, only the freshest route must be conserved, since mobile 857 node changed its point of attachement for a leaf to the next. In the 858 case of anycast, the node may be either multihomed (attached to 859 multiple leaves in parallel) or reachable beyond the fabric via 860 multiple routes that are redistributed to different leaves; either 861 way, in the case of anycast, the multiple routes are equally valid 862 and should be conserved. Without further information from the 863 redistributed routing protocol, it is impossible to sort out a 864 movement from a redistribution that happens asynchronously on 865 different leaves. [RIFT] expects that anycast addresses are 866 advertised within the timing precision, which is typically the case 867 with a low-precision timing and a multihomed node. Beyond that time 868 interval, RIFT interprets the lag as a mobility and only the freshest 869 route is retained. 871 When using IPv6 [RFC8200], RIFT suggests to leverage "Registration 872 Extensions for IPv6 over Low-Power Wireless Personal Area Network 873 (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND 874 interaction between the mobile node and the leaf. This provides not 875 only a sequence counter but also a lifetime and a security token that 876 may be used to protect the ownership of an address. When using 877 [RFC8505], the parallel registration of an anycast address to 878 multiple leaves is done with the same sequence counter, whereas the 879 sequence counter is incremented when the point of attachement 880 changes. This way, it is possible to differentiate a mobile node 881 from a multihomed node, even when the mobility happens within the 882 timing precision. It is also possible for a mobile node to be 883 multihomed as well, e.g., to change only one of its points of 884 attachement. 886 4.8. IPv4 over IPv6 888 RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 889 AF configures via the usual ND mechanisms and then V4 can use V6 890 nexthops analogous to RFC5549. It is expected that the whole fabric 891 supports the same type of forwarding of address families on all the 892 links. RIFT provides an indication whether a node is v4 forwarding 893 capable and implementations are possible where different routing 894 tables are computed per address family as long as the computation 895 remains loop-free. 897 +-----+ +-----+ 898 +---+---+ | ToF | | ToF | 899 ^ +--+--+ +-----+ 900 | | | | | 901 | | +-------------+ | 902 | | +--------+ | | 903 | | | | | 904 V6 +-----+ +-+---+ 905 Forwarding |SPINE| |SPINE| 906 | +--+--+ +-----+ 907 | | | | | 908 | | +-------------+ | 909 | | +--------+ | | 910 | | | | | 911 v +-----+ +-+---+ 912 +---+---+ |LEAF | | LEAF| 913 +--+--+ +--+--+ 914 | | 915 IPv4 prefixes| |IPv4 prefixes 916 | | 917 +---+----+ +---+----+ 918 | V4 | | V4 | 919 | subnet | | subnet | 920 +--------+ +--------+ 922 Figure 8: IPv4 over IPv6 924 4.9. In-Band Reachability of Nodes 926 RIFT doesn't precondition that nodes of the fabric have reachable 927 addresses. But the operational purposes to reach the internal nodes 928 may exist. Figure 9 shows an example that the NMS attaches to LEAF1. 930 +-------+ +-------+ 931 | ToF1 | | ToF2 | 932 ++---- ++ ++-----++ 933 | | | | 934 | +----------+ | 935 | +--------+ | | 936 | | | | 937 ++-----++ +--+---++ 938 |SPINE1 | |SPINE2 | 939 ++-----++ ++-----++ 940 | | | | 941 | +----------+ | 942 | +--------+ | | 943 | | | | 944 ++-----++ +--+---++ 945 | LEAF1 | | LEAF2 | 946 +---+---+ +-------+ 947 | 948 |NMS 950 Figure 9: In-Band reachability of node 952 If NMS wants to access LEAF2, it simply works. Because loopback 953 address of LEAF2 is flooded in its Prefix North TIE. 955 If NMS wants to access SPINE2, it simply works too. Because spine 956 node always advertises its loopback address in the Prefix North TIE. 957 NMS may reach SPINE2 from LEAF1-SPINE2 or LEAF1-SPINE1-ToF1/ 958 ToF2-SPINE2. 960 If NMS wants to access ToF2, ToF2's loopback address needs to be 961 injected into its Prefix South TIE. Otherwise, the traffic from NMS 962 may be sent to ToF1. 964 And in case of failure between ToF2 and spine nodes, ToF2's loopback 965 address must be sent all the way down to the leaves. 967 4.10. Dual Homing Servers 969 Each RIFT node may operate in zero touch provisioning (ZTP) mode. It 970 has no configuration (unless it is a Top-of-Fabric at the top of the 971 topology or the must operate in the topology as leaf and/or support 972 leaf-2-leaf procedures) and it will fully configure itself after 973 being attached to the topology. 975 +---+ +---+ +---+ 976 |ToF| |ToF| |ToF| 977 +---+ +---+ +---+ 978 | | | | | | 979 | +----------------+ | | 980 | | | | | | 981 | +----------------+ | 982 | | | | | | 983 +----------+--+ +--+----------+ 984 | Spine|ToR1 | | Spine|ToR2 | 985 +--+------+---+ +--+-------+--+ 986 +---+ | | | | | | +---+ 987 | | | | | | | | 988 | +-----------------+ | | | 989 | | | +-------------+ | | 990 + | + | | |-----------------+ | 991 X | X | +--------x-----+ | X | 992 + | + | | | + | 993 +---+ +---+ +---+ +---+ 994 | | | | | | | | 995 +---+ +---+ ...............+---+ +---+ 996 SV(1) SV(2) SV(n+1) SV(n) 998 Figure 10: Dual-homing servers 1000 In the single plane, the worst condition is disaggregation of every 1001 other servers at the same level. Suppose the links from ToR1 to all 1002 the leaves become not available. All the servers' routes are 1003 disaggregated and the FIB of the servers will be expanded with n-1 1004 more spicific routes. 1006 Sometimes, pleople may prefer to disaggregate from ToR to servers 1007 from start on, i.e. the servers have couple tens of routes in FIB 1008 from start on beside default routes to avoid breakages at rack level. 1009 Full disaggregation of the fabric could be achieved by configuration 1010 supported by RIFT. 1012 4.11. Fabric With A Controller 1014 There are many different ways to deploy the controller. One 1015 possibility is attaching a controller to the RIFT domain from ToF and 1016 another possibility is attaching a controller from the leaf. 1018 +------------+ 1019 | Controller | 1020 ++----------++ 1021 | | 1022 | | 1023 +----++ ++----+ 1024 ------- | ToF | | ToF | 1025 | +--+--+ +-----+ 1026 | | | | | 1027 | | +-------------+ | 1028 | | +--------+ | | 1029 | | | | | 1030 +-----+ +-+---+ 1031 RIFT domain |SPINE| |SPINE| 1032 +--+--+ +-----+ 1033 | | | | | 1034 | | +-------------+ | 1035 | | +--------+ | | 1036 | | | | | 1037 | +-----+ +-+---+ 1038 ------- |LEAF | | LEAF| 1039 +-----+ +-----+ 1041 Figure 11: Fabric with a controller 1043 4.11.1. Controller Attached to ToFs 1045 If a controller is attaching to the RIFT domain from ToF, it usually 1046 uses dual-homing connections. The loopback prefix of the controller 1047 should be advertised down by the ToF and spine to leaves. If the 1048 controller loses link to ToF, make sure the ToF withdraw the prefix 1049 of the controller(use different mechanisms). 1051 4.11.2. Controller Attached to Leaf 1053 If the controller is attaching from a leaf to the fabric, no special 1054 provisions are needed. 1056 4.12. Internet Connectivity With Underlay 1058 If global addressing is running without overlay, an external default 1059 route needs to be advertised through rift fabric to achieve internet 1060 connectivity. For the purpose of forwarding of the entire rift 1061 fabric, an internal fabric prefix needs to be advertised in the South 1062 Prefix TIE by ToF and spine nodes. 1064 4.12.1. Internet Default on the Leaf 1066 In case that an internet access request comes from a leaf and the 1067 internet gateway is another leaf, the leaf node as the internet 1068 gateway needs to advertise a default route in its Prefix North TIE. 1070 4.12.2. Internet Default on the ToFs 1072 In case that an internet access request comes from a leaf and the 1073 internet gateway is a ToF, the ToF and spine nodes need to advertise 1074 a default route in the Prefix South TIE. 1076 4.13. Subnet Mismatch and Address Families 1078 +--------+ +--------+ 1079 | | LIE LIE | | 1080 | A | +----> <----+ | B | 1081 | +---------------------+ | 1082 +--------+ +--------+ 1083 X/24 Y/24 1085 Figure 12: subnet mismatch 1087 LIEs are exchanged over all links running RIFT to perform Link 1088 (Neighbor) Discovery. A node MUST NOT originate LIEs on an address 1089 family if it does not process received LIEs on that family. LIEs on 1090 same link are considered part of the same negotiation independent on 1091 the address family they arrive on. An implementation MUST be ready 1092 to accept TIEs on all addresses it used as source of LIE frames. 1094 As shown in the above figure, without further checks adjacency of 1095 node A and B may form, but the forwarding between node A and node B 1096 may fail because subnet X mismatches with subnet Y. 1098 To prevent this a RIFT implementation should check for subnet 1099 mismatch just like e.g. ISIS does. This can lead to scenarios where 1100 an adjacency, despite exchange of LIEs in both address families may 1101 end up having an adjacency in a single AF only. This is a 1102 consideration especially in Section 4.8 scenarios. 1104 4.14. Anycast Considerations 1106 + traffic 1107 | 1108 v 1109 +------+------+ 1110 | ToF | 1111 +---+-----+---+ 1112 | | | | 1113 +------------+ | | +------------+ 1114 | | | | 1115 +---+---+ +-------+ +-------+ +---+---+ 1116 | | | | | | | | 1117 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 1118 +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ 1119 | | | | | | | | 1120 | +---------+ | | +---------+ | 1121 | | | | | | | | 1122 | +-------+ | | | +-------+ | | 1123 | | | | | | | | 1124 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 1125 | | | | | | | | 1126 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 1127 +-+-----+ ++------+ +-----+-+ +-----+-+ 1128 + + + ^ | 1129 PrefixA PrefixB PrefixA | PrefixC 1130 | 1131 + traffic 1133 Figure 13: Anycast 1135 If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast 1136 prefix PrefixA. RIFT can deal with this case well. But if the 1137 traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. 1138 But Spine21 or Spine22 doesn't know another PrefixA attaching 1139 Leaf111. So it will always get to Leaf121 and never get to Leaf111. 1140 If the intension is that the traffic should been offloaded to 1141 Leaf111, then use policy guided prefixes [PGP reference]. 1143 5. Acknowledgements 1144 6. Contributors 1146 The following people (listed in alphabetical order) contributed 1147 significantly to the content of this document and should be 1148 considered co-authors: 1150 Tony Przygienda 1152 Juniper Networks 1154 1194 N. Mathilda Ave 1156 Sunnyvale, CA 94089 1158 US 1160 Email: prz@juniper.net 1162 7. Normative References 1164 [ISO10589-Second-Edition] 1165 International Organization for Standardization, 1166 "Intermediate system to Intermediate system intra-domain 1167 routeing information exchange protocol for use in 1168 conjunction with the protocol for providing the 1169 connectionless-mode Network Service (ISO 8473)", November 1170 2002. 1172 [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central 1173 Office Reference Architectural Framework", January 2018. 1175 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 1176 DOI 10.17487/RFC2328, April 1998, 1177 . 1179 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1180 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1181 DOI 10.17487/RFC4861, September 2007, 1182 . 1184 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1185 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1186 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1187 . 1189 [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., 1190 Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional 1191 Forwarding Detection (BFD) on Link Aggregation Group (LAG) 1192 Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 1193 2014, . 1195 [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and 1196 D. Afanasiev, "RIFT: Routing in Fat Trees", Work in 1197 Progress, Internet-Draft, draft-ietf-rift-rift-12, 26 May 1198 2020, 1199 . 1201 [I-D.white-distoptflood] 1202 White, R., Hegde, S., and S. Zandi, "IS-IS Optimal 1203 Distributed Flooding for Dense Topologies", Work in 1204 Progress, Internet-Draft, draft-white-distoptflood-04, 27 1205 July 2020, 1206 . 1208 8. Informative References 1210 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 1211 "Network Time Protocol Version 4: Protocol and Algorithms 1212 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 1213 . 1215 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1216 (IPv6) Specification", STD 86, RFC 8200, 1217 DOI 10.17487/RFC8200, July 2017, 1218 . 1220 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 1221 Perkins, "Registration Extensions for IPv6 over Low-Power 1222 Wireless Personal Area Network (6LoWPAN) Neighbor 1223 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 1224 . 1226 Authors' Addresses 1228 Yuehua Wei (editor) 1229 ZTE Corporation 1230 No.50, Software Avenue 1231 Nanjing 1232 210012 1233 China 1235 Email: wei.yuehua@zte.com.cn 1236 Zheng Zhang 1237 ZTE Corporation 1238 No.50, Software Avenue 1239 Nanjing 1240 210012 1241 China 1243 Email: zhang.zheng@zte.com.cn 1245 Dmitry Afanasiev 1246 Yandex 1248 Email: fl0w@yandex-team.ru 1250 Tom Verhaeg 1251 Juniper Networks 1253 Email: tverhaeg@juniper.net 1255 Jaroslaw Kowalczyk 1256 Orange Polska 1258 Email: jaroslaw.kowalczyk2@orange.com 1260 Pascal Thubert 1261 Cisco Systems, Inc 1262 Building D 1263 45 Allee des Ormes - BP1200 1264 06254 MOUGINS - Sophia Antipolis 1265 France 1267 Phone: +33 497 23 26 34 1268 Email: pthubert@cisco.com