idnits 2.17.00 (12 Aug 2021) /tmp/idnits65337/draft-ietf-mboned-dc-deploy-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (February 07, 2019) is 1198 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2710' is mentioned on line 339, but not defined == Missing Reference: 'RFC 3810' is mentioned on line 339, but not defined == Missing Reference: 'RFC 4604' is mentioned on line 340, but not defined == Missing Reference: 'RFC 4443' is mentioned on line 342, but not defined == Missing Reference: 'RFC 8279' is mentioned on line 505, but not defined == Unused Reference: 'RFC2119' is defined on line 589, but no explicit reference was found in the text == Unused Reference: 'RFC2710' is defined on line 622, but no explicit reference was found in the text == Unused Reference: 'RFC8279' is defined on line 680, but no explicit reference was found in the text == Outdated reference: A later version (-12) exists of draft-ietf-bier-use-cases-06 == Outdated reference: draft-ietf-nvo3-geneve has been published as RFC 8926 == Outdated reference: A later version (-12) exists of draft-ietf-nvo3-vxlan-gpe-06 == Outdated reference: draft-ietf-spring-segment-routing has been published as RFC 8402 -- Obsolete informational reference (is this intentional?): RFC 4601 (Obsoleted by RFC 7761) Summary: 0 errors (**), 0 flaws (~~), 15 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MBONED M. McBride 3 Internet-Draft Huawei 4 Intended status: Informational O. Komolafe 5 Expires: August 11, 2019 Arista Networks 6 February 07, 2019 8 Multicast in the Data Center Overview 9 draft-ietf-mboned-dc-deploy-04 11 Abstract 13 The volume and importance of one-to-many traffic patterns in data 14 centers is likely to increase significantly in the future. Reasons 15 for this increase are discussed and then attention is paid to the 16 manner in which this traffic pattern may be judiously handled in data 17 centers. The intuitive solution of deploying conventional IP 18 multicast within data centers is explored and evaluated. Thereafter, 19 a number of emerging innovative approaches are described before a 20 number of recommendations are made. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on August 11, 2019. 39 Copyright Notice 41 Copyright (c) 2019 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 58 2. Reasons for increasing one-to-many traffic patterns . . . . . 3 59 2.1. Applications . . . . . . . . . . . . . . . . . . . . . . 3 60 2.2. Overlays . . . . . . . . . . . . . . . . . . . . . . . . 5 61 2.3. Protocols . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3. Handling one-to-many traffic using conventional multicast . . 6 63 3.1. Layer 3 multicast . . . . . . . . . . . . . . . . . . . . 6 64 3.2. Layer 2 multicast . . . . . . . . . . . . . . . . . . . . 6 65 3.3. Example use cases . . . . . . . . . . . . . . . . . . . . 8 66 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 9 67 4. Alternative options for handling one-to-many traffic . . . . 9 68 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 10 69 4.2. Head end replication . . . . . . . . . . . . . . . . . . 10 70 4.3. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 11 71 4.4. Segment Routing . . . . . . . . . . . . . . . . . . . . . 12 72 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 12 73 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 74 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 75 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 76 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 77 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 78 9.2. Informative References . . . . . . . . . . . . . . . . . 13 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 81 1. Introduction 83 The volume and importance of one-to-many traffic patterns in data 84 centers is likely to increase significantly in the future. Reasons 85 for this increase include the nature of the traffic generated by 86 applications hosted in the data center, the need to handle broadcast, 87 unknown unicast and multicast (BUM) traffic within the overlay 88 technologies used to support multi-tenancy at scale, and the use of 89 certain protocols that traditionally require one-to-many control 90 message exchanges. These trends, allied with the expectation that 91 future highly virtualized data centers must support communication 92 between potentially thousands of participants, may lead to the 93 natural assumption that IP multicast will be widely used in data 94 centers, specifically given the bandwidth savings it potentially 95 offers. However, such an assumption would be wrong. In fact, there 96 is widespread reluctance to enable IP multicast in data centers for a 97 number of reasons, mostly pertaining to concerns about its 98 scalability and reliability. 100 This draft discusses some of the main drivers for the increasing 101 volume and importance of one-to-many traffic patterns in data 102 centers. Thereafter, the manner in which conventional IP multicast 103 may be used to handle this traffic pattern is discussed and some of 104 the associated challenges highlighted. Following this discussion, a 105 number of alternative emerging approaches are introduced, before 106 concluding by discussing key trends and making a number of 107 recommendations. 109 1.1. Requirements Language 111 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 112 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 113 document are to be interpreted as described in RFC 2119. 115 2. Reasons for increasing one-to-many traffic patterns 117 2.1. Applications 119 Key trends suggest that the nature of the applications likely to 120 dominate future highly-virtualized multi-tenant data centers will 121 produce large volumes of one-to-many traffic. For example, it is 122 well-known that traffic flows in data centers have evolved from being 123 predominantly North-South (e.g. client-server) to predominantly East- 124 West (e.g. distributed computation). This change has led to the 125 consensus that topologies such as the Leaf/Spine, that are easier to 126 scale in the East-West direction, are better suited to the data 127 center of the future. This increase in East-West traffic flows 128 results from VMs often having to exchange numerous messages between 129 themselves as part of executing a specific workload. For example, a 130 computational workload could require data, or an executable, to be 131 disseminated to workers distributed throughout the data center which 132 may be subsequently polled for status updates. The emergence of such 133 applications means there is likely to be an increase in one-to-many 134 traffic flows with the increasing dominance of East-West traffic. 136 The TV broadcast industry is another potential future source of 137 applications with one-to-many traffic patterns in data centers. The 138 requirement for robustness, stability and predicability has meant the 139 TV broadcast industry has traditionally used TV-specific protocols, 140 infrastructure and technologies for transmitting video signals 141 between cameras, studios, mixers, encoders, servers etc. However, 142 the growing cost and complexity of supporting this approach, 143 especially as the bit rates of the video signals increase due to 144 demand for formats such as 4K-UHD and 8K-UHD, means there is a 145 consensus that the TV broadcast industry will transition from 146 industry-specific transmission formats (e.g. SDI, HD-SDI) over TV- 147 specific infrastructure to using IP-based infrastructure. The 148 development of pertinent standards by the SMPTE, along with the 149 increasing performance of IP routers, means this transition is 150 gathering pace. A possible outcome of this transition will be the 151 building of IP data centers in broadcast plants. Traffic flows in 152 the broadcast industry are frequently one-to-many and so if IP data 153 centers are deployed in broadcast plants, it is imperative that this 154 traffic pattern is supported efficiently in that infrastructure. In 155 fact, a pivotal consideration for broadcasters considering 156 transitioning to IP is the manner in which these one-to-many traffic 157 flows will be managed and monitored in a data center with an IP 158 fabric. 160 One of the few success stories in using conventional IP multicast has 161 been for disseminating market trading data. For example, IP 162 multicast is commonly used today to deliver stock quotes from the 163 stock exchange to financial services provider and then to the stock 164 analysts or brokerages. The network must be designed with no single 165 point of failure and in such a way that the network can respond in a 166 deterministic manner to any failure. Typically, redundant servers 167 (in a primary/backup or live-live mode) send multicast streams into 168 the network, with diverse paths being used across the network. 169 Another critical requirement is reliability and traceability; 170 regulatory and legal requirements means that the producer of the 171 marketing data must know exactly where the flow was sent and be able 172 to prove conclusively that the data was received within agreed SLAs. 173 The stock exchange generating the one-to-many traffic and stock 174 analysts/brokerage that receive the traffic will typically have their 175 own data centers. Therefore, the manner in which one-to-many traffic 176 patterns are handled in these data centers are extremely important, 177 especially given the requirements and constraints mentioned. 179 Many data center cloud providers provide publish and subscribe 180 applications. There can be numerous publishers and subscribers and 181 many message channels within a data center. With publish and 182 subscribe servers, a separate message is sent to each subscriber of a 183 publication. With multicast publish/subscribe, only one message is 184 sent, regardless of the number of subscribers. In a publish/ 185 subscribe system, client applications, some of which are publishers 186 and some of which are subscribers, are connected to a network of 187 message brokers that receive publications on a number of topics, and 188 send the publications on to the subscribers for those topics. The 189 more subscribers there are in the publish/subscribe system, the 190 greater the improvement to network utilization there might be with 191 multicast. 193 2.2. Overlays 195 The proposed architecture for supporting large-scale multi-tenancy in 196 highly virtualized data centers [RFC8014] consists of a tenant's VMs 197 distributed across the data center connected by a virtual network 198 known as the overlay network. A number of different technologies 199 have been proposed for realizing the overlay network, including VXLAN 200 [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and 201 GENEVE [I-D.ietf-nvo3-geneve]. The often fervent and arguably 202 partisan debate about the relative merits of these overlay 203 technologies belies the fact that, conceptually, it may be said that 204 these overlays typically simply provide a means to encapsulate and 205 tunnel Ethernet frames from the VMs over the data center IP fabric, 206 thus emulating a layer 2 segment between the VMs. Consequently, the 207 VMs believe and behave as if they are connected to the tenant's other 208 VMs by a conventional layer 2 segment, regardless of their physical 209 location within the data center. Naturally, in a layer 2 segment, 210 point to multi-point traffic can result from handling BUM (broadcast, 211 unknown unicast and multicast) traffic. And, compounding this issue 212 within data centers, since the tenant's VMs attached to the emulated 213 segment may be dispersed throughout the data center, the BUM traffic 214 may need to traverse the data center fabric. Hence, regardless of 215 the overlay technology used, due consideration must be given to 216 handling BUM traffic, forcing the data center operator to consider 217 the manner in which one-to-many communication is handled within the 218 IP fabric. 220 2.3. Protocols 222 Conventionally, some key networking protocols used in data centers 223 require one-to-many communication. For example, ARP and ND use 224 broadcast and multicast messages within IPv4 and IPv6 networks 225 respectively to discover MAC address to IP address mappings. 226 Furthermore, when these protocols are running within an overlay 227 network, then it essential to ensure the messages are delivered to 228 all the hosts on the emulated layer 2 segment, regardless of physical 229 location within the data center. The challenges associated with 230 optimally delivering ARP and ND messages in data centers has 231 attracted lots of attention [RFC6820]. Popular approaches in use 232 mostly seek to exploit characteristics of data center networks to 233 avoid having to broadcast/multicast these messages, as discussed in 234 Section 4.1. 236 There are networking protocols that are being modified/developed to 237 specifically target working in a data center CLOS environment. BGP 238 has been extended to work in these type of DC environments and well 239 supports multicast. RIFT (Routing in Fat Trees) is a new protocol 240 being developed to work efficiently in DC CLOS environments and also 241 is being specified to support multicast addressing and forwarding. 243 3. Handling one-to-many traffic using conventional multicast 245 3.1. Layer 3 multicast 247 PIM is the most widely deployed multicast routing protocol and so, 248 unsurprisingly, is the primary multicast routing protocol considered 249 for use in the data center. There are three potential popular 250 flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] 251 or PIM-BIDIR [RFC5015]. It may be said that these different modes of 252 PIM tradeoff the optimality of the multicast forwarding tree for the 253 amount of multicast forwarding state that must be maintained at 254 routers. SSM provides the most efficient forwarding between sources 255 and receivers and thus is most suitable for applications with one-to- 256 many traffic patterns. State is built and maintained for each (S,G) 257 flow. Thus, the amount of multicast forwarding state held by routers 258 in the data center is proportional to the number of sources and 259 groups. At the other end of the spectrum, BIDIR is the most 260 efficient shared tree solution as one tree is built for all (S,G)s, 261 therefore minimizing the amount of state. This state reduction is at 262 the expense of optimal forwarding path between sources and receivers. 263 This use of a shared tree makes BIDIR particularly well-suited for 264 applications with many-to-many traffic patterns, given that the 265 amount of state is uncorrelated to the number of sources. SSM and 266 BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely 267 deployed multicast routing protocol. PIM-SM can also be the most 268 complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the 269 multicast tree and subsequently there is the option of switching to 270 the SPT (shortest path tree), similar to SSM, or staying on the 271 shared tree, similar to BIDIR. 273 3.2. Layer 2 multicast 275 With IPv4 unicast address resolution, the translation of an IP 276 address to a MAC address is done dynamically by ARP. With multicast 277 address resolution, the mapping from a multicast IPv4 address to a 278 multicast MAC address is done by assigning the low-order 23 bits of 279 the multicast IPv4 address to fill the low-order 23 bits of the 280 multicast MAC address. Each IPv4 multicast address has 28 unique 281 bits (the multicast address range is 224.0.0.0/12) therefore mapping 282 a multicast IP address to a MAC address ignores 5 bits of the IP 283 address. Hence, groups of 32 multicast IP addresses are mapped to 284 the same MAC address meaning a a multicast MAC address cannot be 285 uniquely mapped to a multicast IPv4 address. Therefore, planning is 286 required within an organization to choose IPv4 multicast addresses 287 judiciously in order to avoid address aliasing. When sending IPv6 288 multicast packets on an Ethernet link, the corresponding destination 289 MAC address is a direct mapping of the last 32 bits of the 128 bit 290 IPv6 multicast address into the 48 bit MAC address. It is possible 291 for more than one IPv6 multicast address to map to the same 48 bit 292 MAC address. 294 The default behaviour of many hosts (and, in fact, routers) is to 295 block multicast traffic. Consequently, when a host wishes to join an 296 IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to 297 the router attached to the layer 2 segment and also it instructs its 298 data link layer to receive Ethernet frames that match the 299 corresponding MAC address. The data link layer filters the frames, 300 passing those with matching destination addresses to the IP module. 301 Similarly, hosts simply hand the multicast packet for transmission to 302 the data link layer which would add the layer 2 encapsulation, using 303 the MAC address derived in the manner previously discussed. 305 When this Ethernet frame with a multicast MAC address is received by 306 a switch configured to forward multicast traffic, the default 307 behaviour is to flood it to all the ports in the layer 2 segment. 308 Clearly there may not be a receiver for this multicast group present 309 on each port and IGMP snooping is used to avoid sending the frame out 310 of ports without receivers. 312 IGMP snooping, with proxy reporting or report suppression, actively 313 filters IGMP packets in order to reduce load on the multicast router 314 by ensuring only the minimal quantity of information is sent. The 315 switch is trying to ensure the router has only a single entry for the 316 group, regardless of the number of active listeners. If there are 317 two active listeners in a group and the first one leaves, then the 318 switch determines that the router does not need this information 319 since it does not affect the status of the group from the router's 320 point of view. However the next time there is a routine query from 321 the router the switch will forward the reply from the remaining host, 322 to prevent the router from believing there are no active listeners. 323 It follows that in active IGMP snooping, the router will generally 324 only know about the most recently joined member of the group. 326 In order for IGMP and thus IGMP snooping to function, a multicast 327 router must exist on the network and generate IGMP queries. The 328 tables (holding the member ports for each multicast group) created 329 for snooping are associated with the querier. Without a querier the 330 tables are not created and snooping will not work. Furthermore, IGMP 331 general queries must be unconditionally forwarded by all switches 332 involved in IGMP snooping. Some IGMP snooping implementations 333 include full querier capability. Others are able to proxy and 334 retransmit queries from the multicast router. 336 Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by 337 IPv6 routers for discovering multicast listeners on a directly 338 attached link, performing a similar function to IGMP in IPv4 339 networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] 340 [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does 341 not send its own distinct protocol messages. Rather, MLD is a 342 subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of 343 ICMPv6 messages. MLD snooping works similarly to IGMP snooping, 344 described earlier. 346 3.3. Example use cases 348 A use case where PIM and IGMP are currently used in data centers is 349 to support multicast in VXLAN deployments. In the original VXLAN 350 specification [RFC7348], a data-driven flood and learn control plane 351 was proposed, requiring the data center IP fabric to support 352 multicast routing. A multicast group is associated with each virtual 353 network, each uniquely identified by its VXLAN network identifiers 354 (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the 355 hypervisor or ToR switch, with local VMs that belong to this VNI 356 would join the multicast group and use it for the exchange of BUM 357 traffic with the other VTEPs. Essentially, the VTEP would 358 encapsulate any BUM traffic from attached VMs in an IP multicast 359 packet, whose destination address is the associated multicast group 360 address, and transmit the packet to the data center fabric. Thus, 361 PIM must be running in the fabric to maintain a multicast 362 distribution tree per VNI. 364 Alternatively, rather than setting up a multicast distribution tree 365 per VNI, a tree can be set up whenever hosts within the VNI wish to 366 exchange multicast traffic. For example, whenever a VTEP receives an 367 IGMP report from a locally connected host, it would translate this 368 into a PIM join message which will be propagated into the IP fabric. 369 In order to ensure this join message is sent to the IP fabric rather 370 than over the VXLAN interface (since the VTEP will have a route back 371 to the source of the multicast packet over the VXLAN interface and so 372 would naturally attempt to send the join over this interface) a more 373 specific route back to the source over the IP fabric must be 374 configured. In this approach PIM must be configured on the SVIs 375 associated with the VXLAN interface. 377 Another use case of PIM and IGMP in data centers is when IPTV servers 378 use multicast to deliver content from the data center to end users. 379 IPTV is typically a one to many application where the hosts are 380 configured for IGMPv3, the switches are configured with IGMP 381 snooping, and the routers are running PIM-SSM mode. Often redundant 382 servers send multicast streams into the network and the network is 383 forwards the data across diverse paths. 385 Windows Media servers send multicast streams to clients. Windows 386 Media Services streams to an IP multicast address and all clients 387 subscribe to the IP address to receive the same stream. This allows 388 a single stream to be played simultaneously by multiple clients and 389 thus reducing bandwidth utilization. 391 3.4. Advantages and disadvantages 393 Arguably the biggest advantage of using PIM and IGMP to support one- 394 to-many communication in data centers is that these protocols are 395 relatively mature. Consequently, PIM is available in most routers 396 and IGMP is supported by most hosts and routers. As such, no 397 specialized hardware or relatively immature software is involved in 398 using them in data centers. Furthermore, the maturity of these 399 protocols means their behaviour and performance in operational 400 networks is well-understood, with widely available best-practices and 401 deployment guides for optimizing their performance. 403 However, somewhat ironically, the relative disadvantages of PIM and 404 IGMP usage in data centers also stem mostly from their maturity. 405 Specifically, these protocols were standardized and implemented long 406 before the highly-virtualized multi-tenant data centers of today 407 existed. Consequently, PIM and IGMP are neither optimally placed to 408 deal with the requirements of one-to-many communication in modern 409 data centers nor to exploit characteristics and idiosyncrasies of 410 data centers. For example, there may be thousands of VMs 411 participating in a multicast session, with some of these VMs 412 migrating to servers within the data center, new VMs being 413 continually spun up and wishing to join the sessions while all the 414 time other VMs are leaving. In such a scenario, the churn in the PIM 415 and IGMP state machines, the volume of control messages they would 416 generate and the amount of state they would necessitate within 417 routers, especially if they were deployed naively, would be 418 untenable. 420 4. Alternative options for handling one-to-many traffic 422 Section 2 has shown that there is likely to be an increasing amount 423 one-to-many communications in data centers. And Section 3 has 424 discussed how conventional multicast may be used to handle this 425 traffic. Having said that, there are a number of alternative options 426 of handling this traffic pattern in data centers, as discussed in the 427 subsequent section. It should be noted that many of these techniques 428 are not mutually-exclusive; in fact many deployments involve a 429 combination of more than one of these techniques. Furthermore, as 430 will be shown, introducing a centralized controller or a distributed 431 control plane, makes these techniques more potent. 433 4.1. Minimizing traffic volumes 435 If handling one-to-many traffic in data centers can be challenging 436 then arguably the most intuitive solution is to aim to minimize the 437 volume of such traffic. 439 It was previously mentioned in Section 2 that the three main causes 440 of one-to-many traffic in data centers are applications, overlays and 441 protocols. While, relatively speaking, little can be done about the 442 volume of one-to-many traffic generated by applications, there is 443 more scope for attempting to reduce the volume of such traffic 444 generated by overlays and protocols. (And often by protocols within 445 overlays.) This reduction is possible by exploiting certain 446 characteristics of data center networks: fixed and regular topology, 447 owned and exclusively controlled by single organization, well-known 448 overlay encapsulation endpoints etc. 450 A way of minimizing the amount of one-to-many traffic that traverses 451 the data center fabric is to use a centralized controller. For 452 example, whenever a new VM is instantiated, the hypervisor or 453 encapsulation endpoint can notify a centralized controller of this 454 new MAC address, the associated virtual network, IP address etc. The 455 controller could subsequently distribute this information to every 456 encapsulation endpoint. Consequently, when any endpoint receives an 457 ARP request from a locally attached VM, it could simply consult its 458 local copy of the information distributed by the controller and 459 reply. Thus, the ARP request is suppressed and does not result in 460 one-to-many traffic traversing the data center IP fabric. 462 Alternatively, the functionality supported by the controller can 463 realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] 464 is the most popular control plane used in data centers. Typically, 465 the encapsulation endpoints will exchange pertinent information with 466 each other by all peering with a BGP route reflector (RR). Thus, 467 information about local MAC addresses, MAC to IP address mapping, 468 virtual networks identifiers etc can be disseminated. Consequently, 469 ARP requests from local VMs can be suppressed by the encapsulation 470 endpoint. 472 4.2. Head end replication 474 A popular option for handling one-to-many traffic patterns in data 475 centers is head end replication (HER). HER means the traffic is 476 duplicated and sent to each end point individually using conventional 477 IP unicast. Obvious disadvantages of HER include traffic duplication 478 and the additional processing burden on the head end. Nevertheless, 479 HER is especially attractive when overlays are in use as the 480 replication can be carried out by the hypervisor or encapsulation end 481 point. Consequently, the VMs and IP fabric are unmodified and 482 unaware of how the traffic is delivered to the multiple end points. 483 Additionally, it is possible to use a number of approaches for 484 constructing and disseminating the list of which endpoints should 485 receive what traffic and so on. 487 For example, the reluctance of data center operators to enable PIM 488 and IGMP within the data center fabric means VXLAN is often used with 489 HER. Thus, BUM traffic from each VNI is replicated and sent using 490 unicast to remote VTEPs with VMs in that VNI. The list of remote 491 VTEPs to which the traffic should be sent may be configured manually 492 on the VTEP. Alternatively, the VTEPs may transmit appropriate state 493 to a centralized controller which in turn sends each VTEP the list of 494 remote VTEPs for each VNI. Lastly, HER also works well when a 495 distributed control plane is used instead of the centralized 496 controller. Again, BGP-EVPN may be used to distribute the 497 information needed to faciliate HER to the VTEPs. 499 4.3. BIER 501 As discussed in Section 3.4, PIM and IGMP face potential scalability 502 challenges when deployed in data centers. These challenges are 503 typically due to the requirement to build and maintain a distribution 504 tree and the requirement to hold per-flow state in routers. Bit 505 Index Explicit Replication (BIER) [RFC 8279] is a new multicast 506 forwarding paradigm that avoids these two requirements. 508 When a multicast packet enters a BIER domain, the ingress router, 509 known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header 510 to the packet. This header contains a bit string in which each bit 511 maps to an egress router, known as Bit-Forwarding Egress Router 512 (BFER). If a bit is set, then the packet should be forwarded to the 513 associated BFER. The routers within the BIER domain, Bit-Forwarding 514 Routers (BFRs), use the BIER header in the packet and information in 515 the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise 516 operations to determine how the packet should be replicated optimally 517 so it reaches all the appropriate BFERs. 519 BIER is deemed to be attractive for facilitating one-to-many 520 communications in data ceneters [I-D.ietf-bier-use-cases]. The 521 deployment envisioned with overlay networks is that the the 522 encapsulation endpoints would be the BFIR. So knowledge about the 523 actual multicast groups does not reside in the data center fabric, 524 improving the scalability compared to conventional IP multicast. 525 Additionally, a centralized controller or a BGP-EVPN control plane 526 may be used with BIER to ensure the BFIR have the required 527 information. A challenge associated with using BIER is that, unlike 528 most of the other approaches discussed in this draft, it requires 529 changes to the forwarding behaviour of the routers used in the data 530 center IP fabric. 532 4.4. Segment Routing 534 Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the the 535 source routing paradigm in which the manner in which a packet 536 traverses a network is determined by an ordered list of instructions. 537 These instructions are known as segments may have a local semantic to 538 an SR node or global within an SR domain. SR allows enforcing a flow 539 through any topological path while maintaining per-flow state only at 540 the ingress node to the SR domain. Segment Routing can be applied to 541 the MPLS and IPv6 data-planes. In the former, the list of segments 542 is represented by the label stack and in the latter it is represented 543 as a routing extension header. Use-cases are described in [I-D.ietf- 544 spring-segment-routing] and are being considered in the context of 545 BGP-based large-scale data-center (DC) design [RFC7938]. 547 Multicast in SR continues to be discussed in a variety of drafts and 548 working groups. The SPRING WG has not yet been chartered to work on 549 Multicast in SR. Multicast can include locally allocating a Segment 550 Identifier (SID) to existing replication solutions, such as PIM, 551 mLDP, P2MP RSVP-TE and BIER. It may also be that a new way to signal 552 and install trees in SR is developed without creating state in the 553 network. 555 5. Conclusions 557 As the volume and importance of one-to-many traffic in data centers 558 increases, conventional IP multicast is likely to become increasingly 559 unattractive for deployment in data centers for a number of reasons, 560 mostly pertaining its inherent relatively poor scalability and 561 inability to exploit characteristics of data center network 562 architectures. Hence, even though IGMP/MLD is likely to remain the 563 most popular manner in which end hosts signal interest in joining a 564 multicast group, it is unlikely that this multicast traffic will be 565 transported over the data center IP fabric using a multicast 566 distribution tree built by PIM. Rather, approaches which exploit 567 characteristics of data center network architectures (e.g. fixed and 568 regular topology, owned and exclusively controlled by single 569 organization, well-known overlay encapsulation endpoints etc.) are 570 better placed to deliver one-to-many traffic in data centers, 571 especially when judiciously combined with a centralized controller 572 and/or a distributed control plane (particularly one based on BGP- 573 EVPN). 575 6. IANA Considerations 577 This memo includes no request to IANA. 579 7. Security Considerations 581 No new security considerations result from this document 583 8. Acknowledgements 585 9. References 587 9.1. Normative References 589 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 590 Requirement Levels", BCP 14, RFC 2119, 591 DOI 10.17487/RFC2119, March 1997, 592 . 594 9.2. Informative References 596 [I-D.ietf-bier-use-cases] 597 Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., 598 Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. 599 Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06 600 (work in progress), January 2018. 602 [I-D.ietf-nvo3-geneve] 603 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 604 Network Virtualization Encapsulation", draft-ietf- 605 nvo3-geneve-06 (work in progress), March 2018. 607 [I-D.ietf-nvo3-vxlan-gpe] 608 Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol 609 Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work 610 in progress), April 2018. 612 [I-D.ietf-spring-segment-routing] 613 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 614 Litkowski, S., and R. Shakir, "Segment Routing 615 Architecture", draft-ietf-spring-segment-routing-15 (work 616 in progress), January 2018. 618 [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 619 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, 620 . 622 [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast 623 Listener Discovery (MLD) for IPv6", RFC 2710, 624 DOI 10.17487/RFC2710, October 1999, 625 . 627 [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. 628 Thyagarajan, "Internet Group Management Protocol, Version 629 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, 630 . 632 [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, 633 "Protocol Independent Multicast - Sparse Mode (PIM-SM): 634 Protocol Specification (Revised)", RFC 4601, 635 DOI 10.17487/RFC4601, August 2006, 636 . 638 [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for 639 IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, 640 . 642 [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, 643 "Bidirectional Protocol Independent Multicast (BIDIR- 644 PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, 645 . 647 [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution 648 Problems in Large Data Center Networks", RFC 6820, 649 DOI 10.17487/RFC6820, January 2013, 650 . 652 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 653 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 654 eXtensible Local Area Network (VXLAN): A Framework for 655 Overlaying Virtualized Layer 2 Networks over Layer 3 656 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 657 . 659 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 660 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 661 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 662 2015, . 664 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 665 Virtualization Using Generic Routing Encapsulation", 666 RFC 7637, DOI 10.17487/RFC7637, September 2015, 667 . 669 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 670 BGP for Routing in Large-Scale Data Centers", RFC 7938, 671 DOI 10.17487/RFC7938, August 2016, 672 . 674 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 675 Narten, "An Architecture for Data-Center Network 676 Virtualization over Layer 3 (NVO3)", RFC 8014, 677 DOI 10.17487/RFC8014, December 2016, 678 . 680 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 681 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 682 Explicit Replication (BIER)", RFC 8279, 683 DOI 10.17487/RFC8279, November 2017, 684 . 686 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 687 Uttaro, J., and W. Henderickx, "A Network Virtualization 688 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 689 DOI 10.17487/RFC8365, March 2018, 690 . 692 Authors' Addresses 694 Mike McBride 695 Huawei 697 Email: michael.mcbride@huawei.com 699 Olufemi Komolafe 700 Arista Networks 702 Email: femi@arista.com