idnits 2.17.00 (12 Aug 2021) /tmp/idnits50620/draft-ietf-bier-entropy-staged-dc-clos-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (November 4, 2019) is 928 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC8365' is defined on line 332, but no explicit reference was found in the text == Outdated reference: draft-ietf-mpls-spring-entropy-label has been published as RFC 8662 == Outdated reference: draft-ietf-spring-segment-routing-msdc has been published as RFC 8670 Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Xie 3 Internet-Draft Huawei Technologies 4 Intended status: Informational X. Xu 5 Expires: May 7, 2020 Alibaba Inc. 6 G. Yan 7 Huawei Technologies 8 M. McBride 9 Futurewei 10 November 4, 2019 12 Use of BIER Entropy for Data Center Clos Networks 13 draft-ietf-bier-entropy-staged-dc-clos-02 15 Abstract 17 Bit Index Explicit Replication (BIER) introduces a new multicast- 18 specific BIER Header. BIER can be applied to the Multi Protocol 19 Label Switching (MPLS) data plane or Non-MPLS data plane. Entropy is 20 a technique used in BIER to support load-balancing. This document 21 examines and describes how BIER Entropy is to be applied to Data 22 Center Clos networks for path selection. 24 Requirements Language 26 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 27 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 28 "OPTIONAL" in this document are to be interpreted as described in BCP 29 14 [RFC2119] [RFC8174] when, and only when, they appear in all 30 capitals, as shown here. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on May 7, 2020. 49 Copyright Notice 51 Copyright (c) 2019 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (https://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 68 3. Problem Statement and Considerations . . . . . . . . . . . . 3 69 3.1. Problem Statement . . . . . . . . . . . . . . . . . . . . 3 70 3.2. Considerations . . . . . . . . . . . . . . . . . . . . . 4 71 4. Use of BIER Entropy for DC Clos Network . . . . . . . . . . . 5 72 4.1. Use of BIER Entropy for DC Clos Network . . . . . . . . . 5 73 4.2. Steering for elephant flows . . . . . . . . . . . . . . . 6 74 4.3. Path Division for Tenant flows to different SIs . . . . . 6 75 4.4. Link Failure and Convergence . . . . . . . . . . . . . . 6 76 5. Data-Plane Processing . . . . . . . . . . . . . . . . . . . . 7 77 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 78 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 79 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 80 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 81 9.1. Normative References . . . . . . . . . . . . . . . . . . 7 82 9.2. Informative References . . . . . . . . . . . . . . . . . 8 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 85 1. Introduction 87 Bit Index Explicit Replication (BIER) [RFC8279] is an architecture 88 that provides optimal multicast forwarding without requiring 89 intermediate routers to maintain any per-flow state by using a 90 multicast-specific BIER header. [RFC8296] defines two types of BIER 91 encapsulation formats: one is MPLS encapsulation, the other is non- 92 MPLS encapsulation. Entropy is a technique used in BIER to support 93 load-balancing. This document examines and describes how BIER 94 Entropy is to be applied to Data Center Clos networks for path 95 selection. 97 2. Terminology 99 Readers of this document are assumed to be familiar with the 100 terminology and concepts of the documents listed as Normative 101 References. 103 3. Problem Statement and Considerations 105 3.1. Problem Statement 107 A common choice for a horizontally scalable topology used in Data 108 Center is a Clos topology. This topology features an odd number of 109 stages, for example, a 5-Stage Clos Topology as a example in 110 [RFC7938]. 112 ECMP is the fundamental load-sharing mechanism used by a Clos 113 topology. Effectively, every lower-tier device will use all of its 114 directly attached upper-tier devices to load-share traffic destined 115 to the same IP prefix. The number of ECMP paths between any two Tier 116 3 devices in Clos topology is equal to the number of the devices in 117 the middle stage (Tier 1). For example, Figure 1 illustrates a 118 topology where Tier 3 device L1 has four paths to reach servers X and 119 Y, via Tier 2 devices S1 and S2 and then Tier 1 devices S11, S12, S21 120 and S22 respectively. 122 Tier 1 123 +-----+ 124 Cluster |SUPER| 125 +----------------------------+ +--| S11 |--+ 126 | | | +-----+ | 127 | Tier 2 | | | Tier 2 128 | +-----+ | | +-----+ | +-----+ 129 | +-------------|SPINE|------+--|SUPER|--+--|SPINE|-------------+ 130 | | +-----| S1 |------+ | S12 | +--| S3 |-----+ | 131 | | | +-----+ | +-----+ +-----+ | | 132 | | | | | | 133 | | | +-----+ | +-----+ +-----+ | | 134 | | +-----------|SPINE|------+ |SUPER| +--|SPINE|-----------+ | 135 | | | | +---| S2 |------+--| S21 |--+--| S4 |---+ | | | 136 | | | | | +-----+ | | +-----+ | +-----+ | | | | 137 | | | | | | | | | | | | 138 | +-----+ +-----+ | | +-----+ | +-----+ +-----+ 139 | | LEAF| | LEAF| | +--|SUPER|--+ | LEAF| | LEAF| 140 | | L1 | | L2 | Tier 3 | | S22 | Tier 3 | L3 | | L4 | 141 | +-----+ +-----+ | +-----+ +-----+ +-----+ 142 | | | | | | | | | | 143 | O O O O | X Y O O 144 | Servers | Servers 145 +----------------------------+ 147 Figure 1: 5-Stage Clos Topology 149 When BIER is deployed in a multi-tenant data center network 150 environment for efficient delivery of Broadcast, Unknown-unicast and 151 Multicast (BUM) traffic, a network operator may want a deterministic 152 path for every packet. For example, when L1 needs to send a BUM 153 packet to L3 and L4, which are in different SIs, L1 has to send the 154 packet twice, and expects the packet along two deterministic paths of 155 L1->S1->S11-->L3 and L1->S2->S21-->L4 seperately. Another example of 156 using a deterministic path in a DC is for per-flow steering of 157 "elephant" flows defined in [I-D.ietf-spring-segment-routing-msdc]. 159 A deterministic path for a multicast packet, with multiple staged 160 equal cost paths, is comparable to a traffic-engineering path defined 161 in [I-D.ietf-mpls-spring-entropy-label] for a unicast path with 162 multiple hop equal cost paths. 164 3.2. Considerations 166 The idea behind entropy is that the ingress router computes a hash 167 based on several fields from a given packet and places the result in 168 an additional label, named "entropy label". Then this entropy label 169 can be used as part of the hash keys used by an transit router. When 170 entropy label is used, the keys used in the hashing functions are 171 still a local configuration matter. A router may soley use the 172 entropy label or use a combination of multiple fields from the 173 incoming packet. The hashing function is to randomly load balance 174 the mass of flows between the small number of equal cost paths. 176 If one wants, however, to get a deterministic path from the equal 177 cost paths, one can use part of the 20-bit entropy field. For 178 example, bit 0 to bit 2 of entropy label can represent a value of 0 179 to 7, and thus can be used to select a deterministic path from 8 180 equal cost paths. And thus, a 20-bit entropy label can be used by 181 routers in different tiers to select a deterministic path 182 independently by using different parts of the 20-bit entropy label, 183 and form an end-to-end deterministic path. 185 This is simple and applicable especially for DC Clos networks, 186 because data delivery in DC Clos networks for tenants is always 187 multi-staged, with the upstream direction stages having equal cost 188 paths. 190 4. Use of BIER Entropy for DC Clos Network 192 4.1. Use of BIER Entropy for DC Clos Network 194 Take the 5-stage Clos network in figure 1 as an example. 196 Tier 2 in every cluster has N nodes, and the Tier 1 has M nodes. M 197 is equal to N multiplied by P. 199 Tier 3 switches, in upstream direction, act as stage 1 of data 200 delivery and have N equal cost paths to every BFERs in other 201 clusters. Tier 2 switches, in upstream direction, act as stage 2 of 202 data delivery and have P equal cost paths to every BFERs in other 203 clusters. 205 Example 1: One can configure, on each Tier 3 switch, the use of bit 0 206 for path selection when N is equal to 2, and configure, on each Tier 207 2 switch, to use bit 1 for path selection when P is equal to 2. 209 Example 2: One can configure, on each Tier 3 switch, the use of bit 0 210 to bit 1 for path selection when N is equal to 4, and configure on 211 each Tier 2 switches the use of bit 2 to bit 7 for path selection 212 when P is equal to 48. 214 Assume that, each of the Tier 3 and Tier 2 switchs in the example has 215 two parameters, X and Y, configured locally for using part of entropy 216 label to do path selection, then in example 2: 218 o Each of Tier 3 (Stage 1) switches has a pair of parameters (X1=1, 219 Y1=4) 221 o Each of Tier 2 (Stage 2) switches has a pair of parameters 222 (X2=X1*Y1=4, Y2=64) 224 o Each of Tier 3 (Stage 1) switches populates its BIFTs for ECMP, 225 for example, BIFT-0 to BIFT-3. 227 o Each of Tier 2 (Stage 2) switches populates its BIFTs for ECMP, 228 for example, BIFT-0 to BIFT-47. 230 For each of Tier 3 (Stage 1) switches, each of the BIFT will have a 231 prefered neighboring BFR. For example, LEAF L1 will have a prefered 232 neighbor S1/S2 for BIFT-0/1 seperately, and when forming the BIFT-0 233 table through the underlay routing to every BFER, the prefered 234 neighboring BFR will has a highest priority among all the locally 235 available ECMP path. 237 Then an end-to-end deterministic path for a BIER packet can be had by 238 calculating an entropy label value like this: 240 o Entropy = (P1-1)*X1 + (P2-1)*X2 242 Where P1 represents one of the Stage 1 equal cost paths with a value 243 between 1 and N, and P2 represents one of the Stage 2 equal cost 244 paths with a value between 1 and P. 246 4.2. Steering for elephant flows 248 One can steer an "elephant" flow to an end-to-end deterministic path, 249 or some divided end-to-end deterministic paths across different SIs. 251 4.3. Path Division for Tenant flows to different SIs 253 When the VNEs for a tenant span multiple SIs, then it is useful to 254 divide the BUM packets paths across different SIs. 256 One can configure a policy to use different paths for BIER SIs when 257 using BIER as the BUM tunnel, on each VNE for each VNI. 259 4.4. Link Failure and Convergence 261 As stated above, each of the BIFT on a BFR will have a prefered 262 neighboring BFR. But when the link to the prefered neighbor of some 263 BIFT (say BIFT-X) fail, BIFT-X will converge normally, and the path 264 of this BIFT-X will then probably not being the 'best optimized' 265 path. For example, the link between S1 and L2 fail, then the 266 prefered neighbor of BIFT-0 of LEAF L1, S1, is no longer the 267 neighboring BFR for LEAF L2, and the flow using a Entropy using LEAF 268 L1's BIFT-0 will have to replicate on L1, one packet to S1 for BFER 269 L3 and L4, and one packet to S2 for BFER L2. If the flow changes to 270 use a Entropy using LEAF L1's BIFT-1, it will then be the 'best 271 optimized' path, because the flow doesn't have to replicate on L1, 272 and it need to forward only one copy to S1 for BFER L2 and L3 and L4. 273 Such a change to a flow's entropy is the Ingress switch's 274 responsibility, possibly with the assisstance of a controller. 276 5. Data-Plane Processing 278 The use of BIER entropy label to select a path between some equal 279 cost paths is a local configuration matter. This draft defines a 280 method to use part of the 20-bit entropy label in each router, and 281 this needs a data-plane to do some bit operation function. It is 282 expected to be easier than hashing function. 284 6. Security Considerations 286 This document introduces no new security considerations beyond those 287 already specified in [RFC8279] and [RFC8296]. 289 7. IANA Considerations 291 This document contains no actions for IANA. 293 8. Acknowledgements 295 The authors wish to thank Tony Przygienda, Greg Shepherd, Alia Atlas, 296 Jeffery Zhang, Andrew Dolganow, and Toerless Eckert for their 297 reviews, comments and suggestions. 299 9. References 301 9.1. Normative References 303 [I-D.ietf-mpls-spring-entropy-label] 304 Kini, S., Kompella, K., Sivabalan, S., Litkowski, S., 305 Shakir, R., and J. Tantsura, "Entropy label for SPRING 306 tunnels", draft-ietf-mpls-spring-entropy-label-12 (work in 307 progress), July 2018. 309 [I-D.ietf-spring-segment-routing-msdc] 310 Filsfils, C., Previdi, S., Dawra, G., Aries, E., and P. 311 Lapukhov, "BGP-Prefix Segment in large-scale data 312 centers", draft-ietf-spring-segment-routing-msdc-11 (work 313 in progress), November 2018. 315 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 316 BGP for Routing in Large-Scale Data Centers", RFC 7938, 317 DOI 10.17487/RFC7938, August 2016, 318 . 320 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 321 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 322 Explicit Replication (BIER)", RFC 8279, 323 DOI 10.17487/RFC8279, November 2017, 324 . 326 [RFC8296] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 327 Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation 328 for Bit Index Explicit Replication (BIER) in MPLS and Non- 329 MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January 330 2018, . 332 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 333 Uttaro, J., and W. Henderickx, "A Network Virtualization 334 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 335 DOI 10.17487/RFC8365, March 2018, 336 . 338 9.2. Informative References 340 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 341 Requirement Levels", BCP 14, RFC 2119, 342 DOI 10.17487/RFC2119, March 1997, 343 . 345 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 346 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 347 May 2017, . 349 Authors' Addresses 351 Jingrong Xie 352 Huawei Technologies 354 Email: xiejingrong@huawei.com 356 Xiaohu Xu 357 Alibaba Inc. 359 Email: xiaohu.xxh@alibaba-inc.com 360 Gang Yan 361 Huawei Technologies 363 Email: yangang@huawei.com 365 Mike McBride 366 Futurewei 368 Email: mmcbride7@gmail.com