idnits 2.17.00 (12 Aug 2021) /tmp/idnits9315/draft-mcbride-armd-mcast-overview-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 0 form feeds but 11 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 10, 2012) is 3723 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: '224-239' is mentioned on line 391, but not defined == Outdated reference: draft-ietf-armd-problem-statement has been published as RFC 6820 -- Obsolete informational reference (is this intentional?): RFC 4601 (Obsoleted by RFC 7761) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force M. McBride 2 Internet-Draft H. Lui 3 Intended status: Informational Huawei Technologies 4 Expires: September 11, 2012 March 10, 2012 6 Multicast in the Data Center Overview 7 draft-mcbride-armd-mcast-overview-01 9 Abstract 11 There has been much interest in issues surrounding massive amounts of 12 hosts in the data center. There was a discussion, in ARMD, involving 13 the issues with address resolution for non ARP/ND multicast traffic 14 in data centers with massive number of hosts. This document provides 15 a quick survey of multicast in the data center and should serve as an 16 aid to further discussion of issues related to large amounts of 17 multicast in the data center. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on September 11, 2012. 36 Copyright Notice 38 Copyright (c) 2012 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Multicast Applications in the Data Center . . . . . . . . . . 3 55 2.1. Client-Server Applications . . . . . . . . . . . . . . . . 3 56 2.2. Non Client-Server Multicast Applications . . . . . . . . . 4 57 3. L2 Multicast Protocols in the Data Center . . . . . . . . . . 5 58 4. L3 Multicast solutions in the Data Center . . . . . . . . . . 6 59 5. Challenges of using multicast in the Data Center . . . . . . . 7 60 6. Layer 3 / Layer 2 Topological Variations . . . . . . . . . . . 8 61 7. Address Resolution . . . . . . . . . . . . . . . . . . . . . . 9 62 7.1. Solicited-node Multicast Addresses for IPv6 address 63 resolution . . . . . . . . . . . . . . . . . . . . . . . . 9 64 7.2. Direct Mapping for Multicast address resolution . . . . . 9 65 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 66 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 67 10. Security Considerations . . . . . . . . . . . . . . . . . . . 10 68 11. Informative References . . . . . . . . . . . . . . . . . . . . 10 69 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 71 1. Introduction 73 Data center servers often use IP Multicast to send data to clients or 74 other application servers. IP Multicast is expected to help conserve 75 bandwidth in the data center and reduce the load on servers. 76 Increased reliance on multicast, in next generation data centers, 77 requires higher performance and capacity especially from the 78 switches. If multicast is to continue to be used in the data center, 79 it must scale well within and between datacenters. There has been 80 much interest in issues surrounding massive amounts of hosts in the 81 data center. There was a discussion, in ARMD, involving the issues 82 with address resolution for non ARP/ND multicast traffic in data 83 centers. This document provides a quick survey of multicast in the 84 data center and should serve as an aid to further discussion of 85 issues related to multicast in the data center. 87 ARP/ND issues are not addressed in this document except to explain 88 how address resolution occurs with multicast. ARP/ND issues are 89 addressed in [I-D.armd-problem-statement] 91 2. Multicast Applications in the Data Center 93 There are many data center operators who do not deploy Multicast in 94 their networks for scalability and stability reasons. There are also 95 many operators for whom multicast is critical and is enabled on their 96 data center switches and routers. For this latter group, there are 97 several uses of multicast in their data centers. An understanding of 98 the uses of that multicast is important in order to properly support 99 these applications in the ever evolving data centers. If, for 100 instance, the majority of the applications are discovering/signaling 101 each other using multicast there may be better ways to support them 102 then using multicast. If, however, the multicasting of data is 103 occurring in large volumes, there is a need for very good data center 104 under/overlay multicast support. The applications either fall into 105 the category of those that leverage L2 multicast for discovery or of 106 those that require L3 support and likely span multiple subnets. 108 2.1. Client-Server Applications 110 IPTV servers use multicast to deliver content from the data center to 111 end users. IPTV is typically a one to many application where the 112 hosts are configured for IGMPv3, the switches are configured with 113 IGMP snooping, and the routers are running PIM-SSM mode. Often 114 redundant servers are sending multicast streams into the network and 115 the network is forwarding the data across diverse paths. 117 Windows Media servers send multicast streaming to clients. Windows 118 Media Services streams to an IP multicast address and all clients 119 subscribe to the IP address to receive the same stream. This allows 120 a single stream to be played simultaneously by multiple clients and 121 thus reducing bandwidth utilization. 123 Market data relies extensively on IP multicast to deliver stock 124 quotes from the data center to a financial services provider and then 125 to the stock analysts. The most critical requirement of a multicast 126 trading floor is that it be highly available. The network must be 127 designed with no single point of failure and in a way the network can 128 respond in a deterministic manner to any failure. Typically 129 redundant servers (in a primary/backup or live live mode) are sending 130 multicast streams into the network and the network is forwarding the 131 data across diverse paths (when duplicate data is sent by multiple 132 servers). 134 With publish and subscribe servers a separate message is sent to each 135 subscriber of a publication. With multicast publish/subscribe, only 136 one message is sent, regardless of the number of subscribers. In a 137 publish/subscribe system, client applications, some of which are 138 publishers and some of which are subscribers, are connected to a 139 network of message brokers that receive publications on a number of 140 topics, and send the publications on to the subscribers for those 141 topics. The more subscribers there are in the publish/subscribe 142 system, the greater the improvement to network utilization there 143 might be with multicast. 145 2.2. Non Client-Server Multicast Applications 147 With load balancing protocols, such as VRRP, routers communicate 148 within themselves using a multicast address. 150 Overlays may use IP multicast to virtualize L2 multicasts. VXLAN, 151 for instance, is an encapsulation scheme to carry L2 frames over L3 152 networks. The VXLAN Tunnel End Point (VTEP) encapsulates frames 153 inside an L3 tunnel. VXLANs are identified by a 24 bit VXLAN Network 154 Identifier (VNI). The VTEP maintains a table of known destination 155 MAC addresses, and stores the IP address of the tunnel to the remote 156 VTEP to use for each. Unicast frames, between VMs, are sent directly 157 to the unicast L3 address of the remote VTEP. Multicast frames are 158 sent to a multicast IP group associated with the VNI. Underlying IP 159 Multicast protocols (PIM-SM/SSM/BIDIR) are used to forward multicast 160 data across the overlay. 162 Applications, such as Ganglia, uses multicast for distributed 163 monitoring of computing systems such as clusters and grids. 165 Windows Server, cluster node exchange, relies upon the use of 166 multicast heartbeats between servers. Only the other interfaces in 167 the same multicast group use the data. Unlike broadcast, multicast 168 traffic does not need to be flooded throughout the network, reducing 169 the chance that unnecessary CPU cycles are expended filtering traffic 170 on nodes outside the cluster. As the number of nodes increases, the 171 ability to replace several unicast messages with a single multicast 172 message improves node performance and decreases network bandwidth 173 consumption. Multicast messages replace unicast messages in two 174 components of clustering: 176 o Heartbeats: The clustering failure detection engine is based on a 177 scheme whereby nodes send heartbeat messages to other nodes. 178 Specifically, for each network interface, a node sends a heartbeat 179 message to all other nodes with interfaces on that network. 180 Heartbeat messages are sent every 1.2 seconds. In the common case 181 where each node has an interface on each cluster network, there 182 are N * (N - 1) unicast heartbeats sent per network every 1.2 183 seconds in an N-node cluster. With multicast heartbeats, the 184 message count drops to N multicast heartbeats per network every 185 1.2 seconds, because each node sends 1 message instead of N - 1. 186 This represents a reduction in processing cycles on the sending 187 node and a reduction in network bandwidth consumed. 189 o Regroup: The clustering membership engine executes a regroup 190 protocol during a membership view change. The regroup protocol 191 algorithm assumes the ability to broadcast messages to all cluster 192 nodes. To avoid unnecessary network flooding and to properly 193 authenticate messages, the broadcast primitive is implemented by a 194 sequence of unicast messages. Converting the unicast messages to 195 a single multicast message conserves processing power on the 196 sending node and reduces network bandwidth consumption. 198 Multicast addresses in the 224.0.0.x range are considered link local 199 multicast addresses. They are used for protocol discovery and are 200 flooded to every port. For example, OSPF uses 224.0.0.5 and 201 224.0.0.6 for neighbor and DR discovery. These addresses are 202 reserved and will not be constrained by IGMP snooping. These 203 addresses are not to be used by any application. 205 These types of multicast applications should be able to be supported 206 in data centers which support multicast. 208 3. L2 Multicast Protocols in the Data Center 210 The switches, in between the servers and the routers, rely upon igmp 211 snooping to bound the multicast to the ports leading to interested 212 hosts and to L3 routers. A switch will, by default, flood multicast 213 traffic to all the ports in a broadcast domain (VLAN). IGMP snooping 214 is designed to prevent hosts on a local network from receiving 215 traffic for a multicast group they have not explicitly joined. It 216 provides switches with a mechanism to prune multicast traffic from 217 links that do not contain a multicast listener (an IGMP client). 218 IGMP snooping is a L2 optimization for L3 IGMP. 220 IGMP snooping, with proxy reporting or report suppression, actively 221 filters IGMP packets in order to reduce load on the multicast router. 222 Joins and leaves heading upstream to the router are filtered so that 223 only the minimal quantity of information is sent. The switch is 224 trying to ensure the router only has a single entry for the group, 225 regardless of how many active listeners there are. If there are two 226 active listeners in a group and the first one leaves, then the switch 227 determines that the router does not need this information since it 228 does not affect the status of the group from the router's point of 229 view. However the next time there is a routine query from the router 230 the switch will forward the reply from the remaining host, to prevent 231 the router from believing there are no active listeners. It follows 232 that in active IGMP snooping, the router will generally only know 233 about the most recently joined member of the group. 235 In order for IGMP, and thus IGMP snooping, to function, a multicast 236 router must exist on the network and generate IGMP queries. The 237 tables (holding the member ports for each multicast group) created 238 for snooping are associated with the querier. Without a querier the 239 tables are not created and snooping will not work. Furthermore IGMP 240 general queries must be unconditionally forwarded by all switches 241 involved in IGMP snooping. Some IGMP snooping implementations 242 include full querier capability. Others are able to proxy and 243 retransmit queries from the multicast router. 245 In source-only networks, however, which presumably describes most 246 data center networks, there are no IGMP hosts on switch ports to 247 generate IGMP packets. Switch ports are connected to multicast 248 source ports and multicast router ports. The switch typically learns 249 about multicast groups from the multicast data stream by using a type 250 of source only learning (when only receiving multicast data on the 251 port, no IGMP packets). The switch forwards traffic only to the 252 multicast router ports. When the switch receives traffic for new IP 253 multicast groups, it will typically flood the packets to all ports in 254 the same VLAN. This unnecessary flooding can impact switch 255 performance. 257 4. L3 Multicast solutions in the Data Center 259 There are three flavors of PIM used for Multicast Routing in the Data 260 Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015]. 261 SSM provides the most efficient forwarding between sources and 262 receivers and is most suitable for one to many types of multicast 263 applications. State is built for each S,G channel therefore the more 264 sources and groups there are, the more state there is in the network. 265 BIDIR is the most efficient shared tree solution as one tree is built 266 for all S,G's, therefore saving state. But it is not the most 267 efficient in forwarding path between sources and receivers. SSM and 268 BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely 269 deployed multicast routing protocol. PIM-SM can also be the most 270 complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the 271 multicast tree and then will either switch to the SPT (shortest path 272 tree), similar to SSM, or stay on the shared tree (similar to BIDIR). 273 For massive amounts of hosts sending (and receiving) multicast, the 274 shared tree (particularly with PIM-BIDIR) provides the best potential 275 scaling since no matter how many multicast sources exist within a 276 VLAN, the tree number stays the same. IGMP snooping, IGMP proxy, and 277 PIM-BIDIR have the potential to scale to the huge scaling numbers 278 required in a data center. 280 5. Challenges of using multicast in the Data Center 282 When IGMP/MLD Snooping is not implemented, ethernet switches will 283 flood multicast frames out of all switch-ports, which turns the 284 traffic into something more like broadcast. 286 VRRP uses multicast heartbeat to communicate between routers. The 287 communication between the host and the default gateway is unicast. 288 The multicast heartbeat can be very chatty when there are thousands 289 of VRRP pairs with sub-second heartbeat calls back and forth. 291 Link-local multicast should scale well within one IP subnet 292 particularly with a large layer3 domain extending down to the access 293 or aggregation switches. But if multicast traverses beyond one IP 294 subnet, which is necessary for an overlay like VXLAN, you could 295 potentially have scaling concerns. If using a VXLAN overlay, it is 296 necessary to map the L2 multicast in the overlay to L3 multicast in 297 the underlay or do head end replication in the overlay and receive 298 duplicate frames on the first link from the router to the core 299 switch. The solution could be to run potentially thousands of PIM 300 messages to generate/maintain the required multicast state in the IP 301 underlay. The behavior of the upper layer, with respect to 302 broadcast/multicast, affects the choice of head end (*,G) or (S,G) 303 replication in the underlay, which affects the opex and capex of the 304 entire solution. A VXLAN, with thousands of logical groups, maps to 305 head end replication in the hypervisor or to IGMP from the hypervisor 306 and then PIM between the TOR and CORE 'switches' and the gateway 307 router. 309 Requiring IP multicast (especially PIM BIDIR) from the network can 310 prove challenging for data center operators especially at the kind of 311 scale that the VXLAN/NVGRE proposals require. This is also true when 312 the L2 topological domain is large and extended all the way to the L3 313 core. In data centers with highly virtualized servers, even small L2 314 domains may spread across many server racks (i.e. multiple switches 315 and router ports). 317 It's not uncommon for there to be 10-20 VMs per server in a 318 virtualized environment. One vendor reported a customer requesting a 319 scale to 400VM's per server. For multicast to be a viable solution 320 in this environment, the network needs to be able to scale to these 321 numbers when these VMs are sending/receiving multicast. 323 A lot of switching/routing hardware has problems with IP Multicast, 324 particularly with regards to hardware support of PIM-BIDIR. 326 Sending L2 multicast over a campus or data center backbone, in any 327 sort of significant way, is a new challenge enabled for the first 328 time by overlays. There are interesting challenges when pushing 329 large amounts of multicast traffic through a network, and have thus 330 far been dealt with using purpose-built networks. While the overlay 331 proposals have been careful not to impose new protocol requirements, 332 they have not addressed the issues of performance and scalability, 333 nor the large-scale availability of these protocols. 335 There is an unnecessary multicast stream flooding problem in the link 336 layer switches between the multicast source and the PIM First Hop 337 Router (FHR). The IGMP-Snooping Switch will forward multicast 338 streams to router ports, and the PIM FHR must receive all multicast 339 streams even if there is no request from receiver. This often leads 340 to waste of switch cache and link bandwidth when the multicast 341 streams are not actually required. [I-D.pim-umf-problem-statement] 342 details the problem and defines design goals for a generic mechanism 343 to restrain the unnecessary multicast stream flooding. 345 6. Layer 3 / Layer 2 Topological Variations 347 As discussed in [I-D.armd-problem-statement], there are a variety of 348 topological data center variations including L3 to Access Switches, 349 L3 to Aggregation Switches, and L3 in the Core only. Further 350 analysis is needed in order to understand how these variations affect 351 IP Multicast scalability 353 7. Address Resolution 355 7.1. Solicited-node Multicast Addresses for IPv6 address resolution 357 Solicited-node Multicast Addresses are used with IPv6 Neighbor 358 Discovery to provide the same function as the Address Resolution 359 Protocol (ARP) in IPv4. ARP uses broadcasts, to send an ARP 360 Requests, which are received by all end hosts on the local link. 361 Only the host being queried responds. However, the other hosts still 362 have to process and discard the request. With IPv6, a host is 363 required to join a Solicited-Node multicast group for each of its 364 configured unicast or anycast addresses. Because a Solicited-node 365 Multicast Address is a function of the last 24-bits of an IPv6 366 unicast or anycast address, the number of hosts that are subscribed 367 to each Solicited-node Multicast Address would typically be one 368 (there could be more because the mapping function is not a 1:1 369 mapping). Compared to ARP in IPv4, a host should not need to be 370 interrupted as often to service Neighbor Solicitation requests. 372 7.2. Direct Mapping for Multicast address resolution 374 With IPv4 unicast address resolution, the translation of an IP 375 address to a MAC address is done dynamically by ARP. With multicast 376 address resolution, the mapping from a multicast IP address to a 377 multicast MAC address is derived from direct mapping. In IPv4, the 378 mapping is done by assigning the low-order 23 bits of the multicast 379 IP address to fill the low-order 23 bits of the multicast MAC 380 address. When a host joins an IP multicast group, it instructs the 381 data link layer to receive frames that match the MAC address that 382 corresponds to the IP address of the multicast group. The data link 383 layer filters the frames and passes frames with matching destination 384 addresses to the IP module. Since the mapping from multicast IP 385 address to a MAC address ignores 5 bits of the IP address, groups of 386 32 multicast IP addresses are mapped to the same MAC address. As a 387 result a multicast MAC address cannot be uniquely mapped to a 388 multicast IPv4 address. Planning is required within an organization 389 to select IPv4 groups that are far enough away from each other as to 390 not end up with the same L2 address used. Any multicast address in 391 the [224-239].0.0.x and [224-239].128.0.x ranges should not be 392 considered. When sending IPv6 multicast packets on an Ethernet link, 393 the corresponding destination MAC address is a direct mapping of the 394 last 32 bits of the 128 bit IPv6 multicast address into the 48 bit 395 MAC address. It is possible for more than one IPv6 Multicast address 396 to map to the same 48 bit MAC address. 398 8. Acknowledgements 400 The authors would like to thank the many individuals who contributed 401 opinions on the ARMD wg mailing list about this topic: Linda Dunbar, 402 Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor 403 Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas 404 Narten. 406 9. IANA Considerations 408 This memo includes no request to IANA. 410 10. Security Considerations 412 No security considerations at this time. 414 11. Informative References 416 [I-D.armd-problem-statement] 417 Narten, T., Karir, M., and I. Foo, 418 "draft-ietf-armd-problem-statement", February 2012. 420 [I-D.pim-umf-problem-statement] 421 Zhou, D., Deng, H., Shi, Y., Liu, H., and I. Bhattacharya, 422 "draft-dizhou-pim-umf-problem-statement", October 2010. 424 [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, 425 "Protocol Independent Multicast - Sparse Mode (PIM-SM): 426 Protocol Specification (Revised)", RFC 4601, August 2006. 428 [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for 429 IP", RFC 4607, August 2006. 431 [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, 432 "Bidirectional Protocol Independent Multicast (BIDIR- 433 PIM)", RFC 5015, October 2007. 435 Authors' Addresses 437 Mike McBride 438 Huawei Technologies 439 2330 Central Expressway 440 Santa Clara, CA 95050 441 USA 443 Email: michael.mcbride@huawei.com 445 Helen Lui 446 Huawei Technologies 447 Building Q14, No. 156, Beiqing Rd. 448 Beijing, 100095 449 China 451 Email: helen.liu@huawei.com