idnits 2.17.00 (12 Aug 2021) /tmp/idnits60700/draft-sajassi-l2vpn-rvpls-bgp-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 12 longer pages, the longest (page 5) being 64 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? RFC 2119 keyword, line 433: '... then VLAN-based service MUST be used....' RFC 2119 keyword, line 778: '... MUST be used. If this label is not ...' RFC 2119 keyword, line 782: '... RD: This field is encoded as described in [RFC4364]. The RD MUST be...' RFC 2119 keyword, line 842: '... RD: This field is encoded as described in [RFC4364]. The RD MUST be...' RFC 2119 keyword, line 880: '... used, this label MUST be set to NULL....' (2 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 7, 2010) is 4329 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'L2VPN-Sig' is mentioned on line 156, but not defined == Missing Reference: 'MEF' is mentioned on line 378, but not defined == Missing Reference: 'RFC 4762' is mentioned on line 403, but not defined == Missing Reference: 'MPLS-MDT' is mentioned on line 645, but not defined == Missing Reference: 'RFC4760' is mentioned on line 741, but not defined == Missing Reference: 'RFC4364' is mentioned on line 884, but not defined == Missing Reference: 'RFC 4671' is mentioned on line 912, but not defined == Missing Reference: 'RFC 4672' is mentioned on line 912, but not defined == Missing Reference: 'MCAST-BGP' is mentioned on line 1000, but not defined == Missing Reference: 'MLDP' is mentioned on line 1036, but not defined == Missing Reference: 'TBD' is mentioned on line 1062, but not defined == Missing Reference: 'VPLS-BGP-DH' is mentioned on line 1145, but not defined ** Downref: Normative reference to an Informational RFC: RFC 4664 == Outdated reference: A later version (-07) exists of draft-ietf-l2vpn-vpls-multihoming-00 == Outdated reference: draft-ietf-l2vpn-vpls-mcast has been published as RFC 7117 == Outdated reference: draft-ietf-pwe3-iccp has been published as RFC 7275 == Outdated reference: draft-ietf-pwe3-fat-pw has been published as RFC 6391 Summary: 2 errors (**), 0 flaws (~~), 19 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Working Group Ali Sajassi 3 Internet Draft Samer Salam 4 Category: Standards Track Keyur Patel 5 Pradosh Mohapatra 6 Clarence Filsfils 7 Sami Boutros 8 Cisco 10 Nabil Bitar 11 Verizon 13 Expires: January 7, 2011 July 7, 2010 15 Routed VPLS using BGP 16 draft-sajassi-l2vpn-rvpls-bgp-01.txt 18 Status of this Memo 20 This Internet-Draft is submitted to IETF in full conformance with 21 the provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six 29 months and may be updated, replaced, or obsoleted by other documents 30 at any time. It is inappropriate to use Internet-Drafts as 31 reference material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html 39 This Internet-Draft will expire on December 7, 2010. 41 Copyright Notice 43 Copyright (c) 2010 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with 51 respect to this document. Code Components extracted from this 52 document must include Simplified BSD License text as described in 53 Section 4.e of the Trust Legal Provisions and are provided without 54 warranty as described in the Simplified BSD License. 56 Abstract 58 VPLS, as currently defined, has challenges pertaining to the areas 59 of redundancy and multicast optimization. In particular, multi- 60 homing with all-active forwarding cannot be supported and there's no 61 known solution to date for leveraging MP2MP MDTs for optimizing the 62 delivery of multi-destination frames. This document defines an 63 evolution of the current VPLS solution, referred to as Routed VPLS 64 (R-VPLS), to address these shortcomings. In addition, this solution 65 offers several benefits over current VPLS such as: ease of 66 provisioning, per-flow load-balancing of traffic from/to multi-homed 67 sites, optimum traffic forwarding to PEs with both single-homed and 68 multi-homed sites, support for flexible multi-homing groups and fast 69 convergence upon failures. 71 Conventions 73 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 74 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 75 document are to be interpreted as described in RFC 2119 77 Table of Contents 79 1. Introduction.................................................... 3 80 2. Terminology..................................................... 4 81 3. Requirements.................................................... 5 82 3.1. All-Active Multi-homing....................................... 5 83 3.1.1. Flow-based Load Balancing................................... 5 84 3.1.2. Flow-based Multi-pathing.................................... 5 85 3.1.3. Geo-redundant PE Nodes...................................... 5 86 3.1.4. Optimal Traffic Forwarding.................................. 6 87 3.1.5. Flexible Redundancy Grouping Support........................ 6 88 3.2. Multi-homed Network........................................... 6 89 3.3. Multicast Optimization with MP2MP MDT......................... 7 90 3.4. Ease of Provisioning Requirements............................. 7 91 3.5. New Service Interface Requirements............................ 8 92 3.6. Fast Convergence.............................................. 9 93 3.7. Flood Suppression............................................ 10 94 4. VPLS Issues.................................................... 10 95 4.1. Forwarding Loops............................................. 11 96 4.2. Duplicate Frame Delivery..................................... 11 97 4.3. MAC Forwarding Table Instability............................. 12 98 4.4. Identifying Source PE in MP2MP MDT........................... 12 99 5. Solution Overview: Routed VPLS (R-VPLS)........................ 12 100 5.1. MAC Learning & Forwarding in Bridge Module................... 14 101 5.2. MAC Address Distribution in BGP.............................. 14 102 6. BGP Encoding................................................... 15 103 6.1. R-VPLS MAC NLRI.............................................. 15 104 6.2. R-VPLS RG NLRI............................................... 16 105 6.3. R-VPLS MH-ID NLRI............................................ 17 106 6.4. BGP Route Targets............................................ 18 107 6.4.1. VPN-RT..................................................... 18 108 6.4.2. RG-RT...................................................... 19 109 6.4.3. MH-RT...................................................... 19 110 7. Operation...................................................... 20 111 7.1. Auto-discovery............................................... 20 112 7.2. Setup of Multicast Tunnels................................... 21 113 7.3. Host MACs Distribution over Core............................. 21 114 7.4. Device Multi-homing.......................................... 22 115 7.4.1. Special Considerations for Multi-homing.................... 22 116 7.4.2. Multi-homed Site Topology Discovery........................ 23 117 7.4.3. Dynamic Assignment of Site-ID Label........................ 24 118 7.4.4. Load-balancing............................................. 25 119 7.4.5. Auto-Derivation of MH-ID RTs............................... 25 120 7.4.6. Site-ID Label for Single-Homed Sites....................... 26 121 7.4.7. LACP State Synchronization................................. 26 122 7.5. Frame Forwarding over MPLS Core.............................. 27 123 7.5.1. Unicast.................................................... 27 124 7.5.2. Multicast/Broadcast........................................ 28 125 7.6. MPLS Forwarding at Disposition PE............................ 28 126 8. Acknowledgements............................................... 29 127 9. Security Considerations........................................ 29 128 10. IANA Considerations........................................... 29 129 11. Intellectual Property Considerations.......................... 29 130 12. Normative References.......................................... 29 131 13. Informative References........................................ 29 132 14. Authors' Addresses............................................ 30 134 1. 135 Introduction 137 VPLS, as defined in [RFC4664][RFC4761][RFC4762], is a proven and 138 widely deployed technology. However, the existing solution has a 139 number of challenges when it comes to redundancy and multicast 140 optimization. 142 In the area of redundancy, current VPLS can only support multi- 143 homing with active/standby resiliency model, for e.g. as described 144 in [VPLS-BGP-MH]. Flexible multi-homing with all-active ACs cannot 145 be supported without adding considerable complexity to the VPLS 146 data-path. 148 In the area of multicast optimization, [VPLS-MCAST] describes how 149 LSM MDTs can be used in conjunction with VPLS. However, this 150 solution is limited to P2MP MDTs, as there's no known solution to 151 date for leveraging MP2MP MDTs with VPLS. The lack of MP2MP support 152 can create scalability issues for certain applications. 154 In the area of provisioning simplicity, current VPLS does offer a 155 mechanism for single-sided provisioning by relying on BGP-based 156 service auto-discovery [RFC4761][L2VPN-Sig]. This, however, still 157 requires the operator to configure a number of network-side 158 parameters on top of the access-side Ethernet configuration. 160 Furthermore, data center interconnect applications are driving the 161 need for a new service interface type which is a hybrid combination 162 of port-based and vlan-based service interfaces. This is referred to 163 as 'VLAN-aware Port-Based' service interface. 165 This document defines an evolution of the current VPLS solution, to 166 address the aforementioned shortcomings. The proposed solution is 167 referred to as Routed VPLS (R-VPLS). 169 Section 2 provides a summary of the terminology used. Section 3 170 discusses the requirements for all-active resiliency and multicast 171 optimization. Section 4 described the issues associated with the 172 current VPLS solution in addressing the requirements. Section 5 173 offers an overview of R-VPLS and then Section 6 goes into the 174 details of its components. 176 2. 177 Terminology 179 CE: Customer Edge 180 DHD: Dual-homed Device 181 DHN: Dual-homed Network 182 LACP: Link Aggregation Control Protocol 183 LSM: Label Switched Multicast 184 MDT: Multicast Delivery Tree 185 MP2MP: Multipoint to Multipoint 186 P2MP: Point to Multipoint 187 P2P: Point to Point 188 PE: Provider Edge 189 PoA: Point of Attachment 190 PW: Pseudowire 191 R-VPLS: Routed VPLS 192 3. 193 Requirements 195 This section describes the requirements for all-active multi-homing, 196 MP2MP MDT support, ease of provisioning and new service interface 197 type. 199 3.1. 200 All-Active Multi-homing 202 3.1.1. 203 Flow-based Load Balancing 205 A customer network or a customer device can be multi-homed to a 206 provider network using IEEE link aggregation standard -[802.1AX]. 207 In [802.1AX], the load-balancing algorithms by which a CE 208 distributes traffic over the Attachment Circuits connecting to the 209 PEs are quite flexible. The only requirement is for the algorithm to 210 ensure in-order frame delivery for a given traffic flow. In typical 211 implementations, these algorithms involve selecting an outbound link 212 within the bundle based on a hash function that identifies a flow 213 based on one or more of the following fields: 214 i) Layer 2: Source MAC Address, Destination MAC Address, VLAN 215 i 216 i) Layer 3: Source IP Address, Destination IP Address 217 i 218 i 219 i) Layer 4: UDP or TCP Source Port, Destination Port 220 iv) Combinations of the above. 222 A key point to note here is that [802.1AX] does not define a 223 standard load-balancing algorithm for Ethernet bundles, and as such 224 different implementations behave differently. As a matter of fact, a 225 bundle operates correctly even in the presence of asymmetric load- 226 balancing over the links. This being the case, the first requirement 227 for active/active VPLS dual-homing is the ability to accommodate 228 flexible flow-based load-balancing from the CE node based on L2, L3 229 and/or L4 header fields. 231 3.1.2. 232 Flow-based Multi-pathing 234 [PWE3-FAT-PW] defines a mechanism that allows PE nodes to exploit 235 equal-cost multi-paths (ECMPs) in the MPLS core network by 236 identifying traffic flows within a PW, and associating these flows 237 with a Flow Label. The flows can be classified based on any 238 arbitrary combination of L2, L3 and/or L4 headers. Any active/active 239 VPLS dual-homing mechanism should seamlessly interoperate and 240 leverage the mechanisms defined in [PWE3-FAT-PW]. 242 3.1.3. 243 Geo-redundant PE Nodes 245 The PE nodes offering dual-homed connectivity to a CE or access 246 network may be situated in the same physical location (co-located), 247 or may be spread geographically (e.g. in different COs or POPs). The 248 latter is desirable when offering a geo-redundant solution that 249 ensures business continuity for critical applications in the case of 250 power outages, natural disasters, etc. An active/active VPLS dual- 251 homing mechanism should support both co-located as well as geo- 252 redundant PE placement. The latter scenario often means that 253 requiring a dedicated link between the PEs, for the operation of the 254 dual-homing mechanism, is not appealing from cost standpoint. 255 Furthermore, the IGP cost from remote PEs to the pair of PEs in the 256 dual-homed setup cannot be assumed to be the same when those latter 257 PEs are geo-redundant. 259 3.1.4. 260 Optimal Traffic Forwarding 262 In a typical network, and considering a designated pair of PEs, it 263 is common to find both single-homed as well as multi-homed CEs being 264 connected to those PEs. An active/active VPLS multi-homing solution 265 should support optimal forwarding of unicast traffic for all the 266 following scenarios: 267 i) single-homed CE to single-homed CE 268 i 269 i) single-homed CE to dual-homed CE 270 i 271 i 272 i) dual-homed CE to single-homed CE 273 iv) dual-homed CE to dual-homed CE 275 This is especially important in the case of geo-redundant PEs, where 276 having traffic forwarded from one PE to another within the same 277 multi-homed group introduces additional latency, on top of the 278 inefficient use of the PE node's and core nodes' switching capacity. 279 A multi-homed group (also known as a multi-chassis LACP group) is a 280 group of PEs supporting a multi-homed CE. 282 3.1.5. 283 Flexible Redundancy Grouping Support 285 In order to simplify service provisioning and activation, the VPLS 286 multi-homing mechanism should allow arbitrary grouping of PE nodes 287 into redundancy groups where each redundancy group represents all 288 multi-homed groups that share the same group of PEs. This is best 289 explained with an example: consider three PE nodes - PE1, PE2 and 290 PE3. The multi-homing mechanism must allow a given PE, say PE1, to 291 be part of multiple redundancy groups concurrently. For example, 292 there can be a group (PE1, PE2), a group (PE1, PE3), and another 293 group (PE2, PE3) where CEs could be dual-homed to any one of these 294 three redundancy groups. 296 3.2. 297 Multi-homed Network 299 Supporting all-active multi-homing of an Ethernet network (a.k.a. 300 Multi-homed Network or MHN) to several VPLS PEs poses a number of 301 challenges. 303 First, some resiliency mechanism needs to be in place between the 304 MHN and the PEs offering multi-homing, in order to prevent the 305 formation of L2 forwarding loops. Two options are possible here: 306 either the PEs participate in the control plane protocol of the MHN 307 (e.g. MST or ITU-T G.8032), or some auxiliary mechanism needs to run 308 between the CE nodes and the PEs. The latter must be complemented 309 with an interworking function, at the CE, between the auxiliary 310 mechanism and the MHN's native control protocol. However, unless the 311 PEs participate directly in the control protocol of the MHN, fast 312 control-plane re-convergence and fault recovery cannot be 313 guaranteed. Secondly, all existing Ethernet network resiliency 314 mechanisms operate at best at the granularity of VLANs. Hence, any 315 load-balancing would be limited to L2 flows at best if not at the 316 VLAN granularity level. Depending on the applications at hand, this 317 coarse flow granularity may not have enough entropy to provide 318 proper link/node utilization distribution within the provider's 319 network. Thirdly, an open issue remains with the handling of MHN 320 partitioning: the PEs need to reliably detect the situation where 321 the MHN has been partitioned and each PE needs to handle 322 inbound/outbound traffic for only those customers (or hosts) 323 connected to the local partition. 325 As described above, all-active load balancing for L3 and L4 flows is 326 not feasible for MHNs. Although all-active load balancing for L2 327 flows is possible, it comes at the cost of requiring the locally 328 attached PEs to perform local switching for a subset of the traffic 329 within the MHN - e.g., using service provider resources to perform 330 intra-site traffic forwarding and switching. Therefore, all-active 331 load balancing for MHNs is not considered as a requirement; however, 332 what is considered as a requirement for MHNs is for the PEs to auto 333 detect the resiliency protocol used in a MHN and to auto-provision 334 themselves to perform load balancing at the VLAN granularity without 335 participating in the MHN's resiliency protocol. 337 3.3. 338 Multicast Optimization with MP2MP MDT 340 In certain applications, multiple multicast sources may exist for a 341 given VPLS instance, and these sources are dispersed over the 342 various PEs. For these applications, relying on P2MP MDTs for VPLS 343 can result in an increase in the number of states in the core 344 relative to the use of MP2MP MDTs by a factor of O(n); where n is 345 the average number of PEs per VPLS instance. In scenarios where the 346 average number of PEs per VPLS instance is large, then the use of 347 MDT rooted on every PE can result in two or more orders of magnitude 348 more states in the core relative to the use of MP2MP MDTs. By using 349 MP2MP MDTs, it is possible to scale multicast states in the core 350 better by eliminating the above O(n) factor all together. Therefore, 351 the scalability of multicast becomes no longer a function of the 352 number of sites or number of PEs. 354 3.4. 355 Ease of Provisioning Requirements 357 As L2VPN technologies expand into enterprise deployments, ease of 358 provisioning becomes paramount. Even though current VPLS has auto- 359 discovery mechanisms which allow for single-sided provisioning, 360 further simplifications are required, as outlined below: 362 -Single-sided provisioning behavior must be maintained 363 -For deployments where VLAN identifiers are global across the MPLS 364 network (i.e. the network is limited to a maximum of 4K services), 365 it is required that the devices derive the MPLS specific attributes 366 (e.g. VPN ID, BGP RT, etc...) from the VLAN identifier. This way, it 367 is sufficient for the network operator to configure the VLAN 368 identifier(s) on the access circuit, and all the MPLS and BGP 369 parameters required for setting up the service over the core network 370 would be automatically derived without any need for explicit 371 configuration. 372 -Implementations should revert to using default values for 373 parameters as and where applicable. 375 3.5. 376 New Service Interface Requirements 378 [MEF] and [IEEE 802.1Q] have the following services specified: 379 - Port mode: in this mode, all traffic on the port is mapped to a 380 single bridge domain and a single corresponding L2VPN service 381 instance. Customer VLAN transparency is guaranteed end-to-end. 383 - VLAN mode: in this mode, each VLAN on the port is mapped to a 384 unique bridge domain and corresponding L2VPN service instance. 385 This mode allows for service multiplexing over the port and 386 supports optional VLAN translation. 388 - VLAN bundling: in this mode, a group of VLANs on the port are 389 collectively mapped to a unique bridge domain and corresponding 390 L2VPN service instance. Customer MAC addresses must be unique 391 across all VLANs mapped to the same service instance. 393 For each of the above services a single bridge domain is assigned 394 per service instance on the PE supporting the associated service. 395 For example, in case of the port mode, a single bridge domain is 396 assigned for all the ports belonging to that service instance 397 regardless of number of VLANs coming through these ports. 399 It is worth noting that the term 'bridge domain' as used above 400 refers to a MAC forwarding table as defined in the IEEE bridge 401 model, and does not denote or imply any specific implementation. 403 [RFC 4762] defines two types of VPLS services based on 'unqualified 404 and qualified learning' which in turn maps to port mode and VLAN 405 mode respectively. 407 R-VPLS is required to support the above three service types plus one 408 additional service type which is primarily intended for hosted data 409 center applications and it is described below. 411 For hosted data center interconnect applications, network operators 412 require the ability to extend Ethernet VLANs over a WAN using a 413 single L2VPN instance while maintaining data-plane separation 414 between the various VLANs associated with that instance. This gives 415 rise to a new service interface type, which will be referred to as 416 the 'VLAN-aware Port-based' service interface. The characteristics 417 of this service interface are as follows: 419 - The service interface must provide all-to-one bundling of customer 420 VLANs into a single L2VPN service instance. 421 - The service interface must guarantee customer VLAN transparency 422 end-to-end. 423 - The service interface must maintain data-plane separation between 424 the customer VLANs (i.e. create a dedicated bridge-domain per 425 VLAN). 426 - The service interface must not assume any a priori knowledge of 427 the customer VLANs. In other words, the customer VLANs shall not 428 be configured on the PE, rather the interface is configured just 429 like a port-based service. 431 Since this is a port-based service, customer VLAN translation is not 432 allowed over this service interface. If VLAN translation is 433 required, then VLAN-based service MUST be used. 435 The main difference, in terms of service provider resource 436 allocation, between this new service type and the previously defined 437 three types is that the new service requires several bridge domains 438 to be allocated (one per customer VLAN) per L2VPN service instance 439 as opposed to a single bridge domain per L2VPN service instance. 441 3.6. 442 Fast Convergence 444 A key driver for multi-homing is providing protection against node 445 as well as link and port failures. The R-VPLS solution should ensure 446 fast convergence upon the occurrence of these failures, in order to 447 minimize the disruption of traffic flowing from/to a multi-homed 448 site. Here, two cases need to be distinguished depending on whether 449 a device or a network is being multi-homed. This is primarily 450 because a different set of convergence time characteristics can be 451 guaranteed by the core network operator in each case. 453 For the case of a multi-homed device with all-active forwarding, the 454 convergence of site-to-core traffic upon attachment circuit or PE 455 node failure is a function of how quickly the CE node can 456 redistribute the traffic flows over the surviving member links of 457 the multi-chassis Ethernet link aggregation group. For managed 458 services, where the CE is owned by the Service Provider, the latter 459 can offer convergence time guarantees for such failures. Whereas, 460 for non-managed services the SP has no control over the CE's 461 capabilities and cannot provide any guarantees. For multi-homed 462 device with all-active forwarding, the convergence of core-to-site 463 traffic is a function of how quickly the protocol running between 464 the PEs can detect and react to the topology change. The key 465 requirement here is to have the convergence time be independent (to 466 the extent possible) of the number of MAC addresses affected by the 467 topology change, and the number of service instances emanating from 468 the affected site. Given that all this is under the control of the 469 core network operator, strict convergence time guarantees can be 470 delivered by the operator. 472 For the case of a multi-homed network, the convergence time of site- 473 to-core traffic upon attachment circuit or PE node failures is a 474 function of two components: first, how quickly the MHN's control 475 protocol detects and reacts to the topology change (this may involve 476 blocking/unblocking VLANs on ports as well as propagating MAC 477 address flush indications); and second, the reaction time of the 478 locally attached PE(s) in order to update their forwarding state as 479 necessary. The first component is outside the control of the core 480 network operators, therefore it is not possible for them to make any 481 convergence time guarantees except under tightly controlled 482 conditions. For a multi-homed network, the convergence time of core- 483 to-site traffic upon failures is a function of the inter-PE protocol 484 if the PEs don't participate in the MHN control protocol. Otherwise, 485 the convergence time is a function of both the inter-PE protocol in 486 addition to the MHN's control protocol convergence time. In the 487 latter scenario, again no guarantees can be made by the core 488 operator as far as the convergence time is concerned except under 489 tightly controlled conditions. 491 3.7. 492 Flood Suppression 494 The solution should allow the network operator to choose whether 495 unknown unicast frames are to be dropped or to be flooded. This 496 attribute need to be configurable on a per service instance basis. 498 Furthermore, it is required to eliminate any unnecessary flooding of 499 unicast traffic upon topology changes, especially in the case of 500 multi-homed site where the PEs have a priori knowledge of the backup 501 paths for a given MAC address. 503 4. 504 VPLS Issues 506 This section describes issues associated with the current VPLS 507 solution in meeting the above requirements. The current solution for 508 VPLS, as defined in [RFC4761]and [RFC4762], relies on establishing a 509 full-mesh of pseudowires among participating PEs, and data-plane 510 learning for the purpose of building the MAC forwarding tables. This 511 learning is performed on traffic received over both the attachment 512 circuits as well as the pseudowires. 513 Supporting an all-active multi-homing solution with current VPLS is 514 subject to three fundamental problems: the formation of forwarding 515 loops, duplicate delivery of flooded frames and MAC Forwarding Table 516 instability. These problems will be described next in the context of 517 the example network shown in figure 1 below. 519 +--------------+ 520 | | 521 | | 522 +----+ AC1 +----+ | | +----+ +----+ 523 | CE1|-----|VPLS| | | |VPLS|---| CE2| 524 +----+\ | PE1| | IP/MPLS | | PE3| +----+ 525 \ +----+ | Network | +----+ 526 \ | | 527 AC2\ +----+ | | 528 \|VPLS| | | 529 | PE2| | | 530 +----+ | | 531 +--------------+ 533 Figure 1: VPLS Multi-homed Network 535 In the network of Figure 1, it is assumed that CE1 has both 536 attachment circuits AC1 & AC2 active towards PE1 and PE2, 537 respectively. This can be achieved, for example, by running a multi- 538 chassis Ethernet link aggregation group from CE1 to the pair of PEs. 540 4.1. 541 Forwarding Loops 543 Consider the case where CE1 sends a unicast frame over AC1, destined 544 to CE2. If PE1 doesn't have a forwarding entry in its MAC address 545 table for CE2, it will flood the frame to all other PEs in the VPLS 546 instance (namely PE3 & PE2) using either ingress replication over 547 the full-mesh of pseudowires, or alternatively over an LSM tree 548 [VPLS-MCAST]. When PE2 receives the flooded traffic, and assuming it 549 doesn't know the destination port to CE2, it will flood the traffic 550 over the ACs for the VFI in question, including AC2. Hence, a 551 forwarding loop is created where CE1 receives its own traffic. 553 4.2. 554 Duplicate Frame Delivery 556 Examine the scenario where CE2 sends a multi-destination frame 557 (unknown unicast, broadcast or multicast) to PE3. PE3 will then 558 flood the frame to both PE1 & PE2, using either ingress replication 559 over the pseudowire full-mesh or an LSM tree. Both PE1 and PE2 will 560 receive copies of the frame, and both will forward the traffic on to 561 CE1. Net result is that CE1 receives duplicate frames. 563 4.3. 564 MAC Forwarding Table Instability 566 Assume that both PE1 and PE2 have learnt that CE2 is reachable via 567 PE3. Now, CE1 starts sending unicast traffic to CE2. Given that CE1 568 has its ACs configured in an Ethernet link aggregation group, it 569 will forward traffic over both ACs using some load-balancing 570 technique as described in section 3.1 above. Both PE1 and PE2 will 571 forward frames from CE1 to PE3. Consequently, PE3 will see the same 572 MAC address for CE1 constantly moving between its pseudowire to PE1 573 and its pseudowire to PE2. The MAC table entry for CE1 will keep 574 flip-flopping indefinitely depending on traffic patterns. This MAC 575 table instability on PE3 may lead to frame mis-ordering for traffic 576 going from CE2 back to CE1. 578 Shifting focus towards the requirement to support MP2MP MDT, the 579 problem facing VPLS here is performing MAC learning over MP2MP MDT, 580 as discussed next. 582 4.4. 583 Identifying Source PE in MP2MP MDT 585 In the solution described in [VPLS-MCAST], a PE must perform MAC 586 learning on traffic received over an LSM MDT. To that end, the 587 receiving PE must be able to identify the source PE transmitting the 588 frame, in order to associate the MAC address with the p2p pseudowire 589 leading back to the source. With P2MP MDT, the MDT label uniquely 590 identifies the source PE. For inclusive trees, the MDT label also 591 identifies the VFI; whereas, for aggregate inclusive trees, a second 592 upstream-assigned label identifies the VFI. 594 However, when it comes to MP2MP MDT, the MDT label identifies the 595 root of the tree (which most likely is not the source PE), and the 596 second label (if present) identifies the VFI. There is no known 597 solution to date for dynamic label allocation among the VPLS PEs to 598 identify the source PE since neither upstream nor downstream label 599 assignment can work among the VPLS PEs. 601 From the above, it should be clear that with the current VPLS 602 solution it is not possible to support all-active multi-homing or 603 MP2MP MDTs. In the sections that follow, we will explore a new 604 solution that meets the requirements identified in section 3 and 605 addresses the problems highlighted in this section. 607 5. 608 Solution Overview: Routed VPLS (R-VPLS) 610 R-VPLS follows a conceptually simple model where customer MAC 611 addresses are treated as routable addresses over the MPLS core, and 612 distributed using BGP. In a sense, the R-VPLS solution represents an 613 evolution of VPLS where data-plane learning over pseudowires is 614 replaced with control-plane based MAC distribution and learning over 615 the MPLS core. 617 MAC addresses are learnt in the data-plane over the access 618 attachment circuits (ACs) using native Ethernet bridging 619 capabilities as is the case in current VPLS. MAC addresses learnt by 620 a PE over its ACs are advertised in BGP along with a downstream- 621 assigned MPLS label identifying the bridge-domain (this is analogous 622 to L3VPNs where the label identifies the VRF). The BGP route is 623 advertised to all other PEs in the same service instance. Remote PEs 624 receiving these BGP NLRIs install the advertised MAC addresses in 625 their forwarding tables with the associated MPLS/IP adjacency 626 information. When multiple PE nodes advertise the same MAC address 627 with the same BGP Local Preference, then the receiving PEs create 628 multiple adjacencies for the same MAC address. This allows for load- 629 balancing of Ethernet traffic among multiple disposition PEs when 630 the AC is part of a multi-chassis Link Aggregation Group. The 631 imposition PE can select one of the available adjacencies for 632 forwarding traffic based on any hashing of Layer 2, 3 or 4 fields. 633 Multicast and broadcast traffic can be forwarded using ingress 634 replication per current VPLS, or over a P2MP LSM tree leveraging the 635 model described in [VPLS-MCAST] or using a MP2MP MDT. The latter is 636 possible since no MAC address learning is performed for traffic 637 received from the core. Forwarding of unknown unicast traffic over 638 the MPLS/IP core is optional and if the default mode is set to not 639 forward it, it is still flooded over the local ACs per normal 640 bridging operations. 642 Auto-discovery in R-VPLS involves identifying the set of PEs 643 belonging to a given service instance and also discovering the set 644 of PEs that are connected to the same multi-homed site. After auto- 645 discovery is complete, an inclusive MP2MP MDT is set up per [MPLS- 646 MDT]. Optionally, a set of P2MP MDTs per [VPLS-MCAST] can be set up 647 or if ingress replication is required, a set of MP2P tunnels can be 648 used. The purpose of the MP2MP MDT or the set of P2MP MDTs, or the 649 set of MP2P tunnels, is for transporting customer multicast/ 650 broadcast frames and optionally for customer unknown unicast frames. 651 No MAC address learning is needed for frames received over the 652 MDT(s)or the MP2P tunnels. 654 The mapping of customer Ethernet frames to a service instance for 655 qualified learning and unqualified learning, is performed as in 656 VPLS. Furthermore, the setup of any additional MDT per user 657 multicast group or groups is also performed per [VPLS-MCAST]. 659 Figure 2 below shows the model of a PE participating in R-VPLS. The 660 modules in this figure will be used to explain the components of R- 661 VPLS. 663 MPLS Core 664 +-------------------------------+ 665 | +-----------+ | R-VPLS PE 666 | +---------| R-VPLS | | 667 | +----+ | Forwarder | | 668 | |BGP | +-----------+ | 669 | +----+ |... | | | Virtual Interfaces 670 | | +-----------+ | 671 | +---------| Bridge | | 672 | +-----------+ | 673 +-----------------|---|---|-----+ 674 AC1 AC2 ACn 676 CEs 678 Figure 2: R-VPLS PE Model 680 5.1. 681 MAC Learning & Forwarding in Bridge Module 683 The Bridge module within an R-VPLS PE performs basic bridging 684 operations as before and is responsible for: 686 i) Learning the source MAC address on all frames received over the 687 ACs, and dynamically building the bridge forwarding database. 689 ii) Forwarding known unicast frames to local ACs for local 690 destinations or the Virtual interface(s) for remote destinations. 692 iii) Flooding unknown unicast frames over the local ACs and 693 optionally over the Virtual interface(s). 695 iv) Flooding multicast/broadcast frames to the local ACs and to the 696 Virtual interface(s). 698 v) Informing the BGP module of all MAC addresses learnt over the 699 local ACs. Also informing the BGP module when a MAC entry ages out, 700 or is flushed due to a topology change. 702 vi) Enforcing the filtering rules described in section 7.3. 704 5.2. 705 MAC Address Distribution in BGP 707 The BGP module within an R-VPLS PE is responsible for two main 708 functions: 710 First, advertising all MAC addresses learnt over the local ACs (by 711 the Bridge module) to all remote PEs participating in the R-VPLS 712 instance in question. This is done using a new BGP NLRI as defined 713 in the next section. The BGP module should withdraw the advertised 714 NLRIs for MAC addresses as they age out, or when the bridge table is 715 flushed due to a topology change. Since no MAC address learning is 716 performed for traffic received from the MPLS core, these BGP NLRI 717 advertisements are used to build the forwarding entries for remote 718 MAC addresses reachable over the MPLS network. 720 This brings the discussion to the second function of the BGP module, 721 namely: programming entries in the forwarding table (in the R-VPLS 722 Forwarder module) using the information in the received BGP NLRIs. 723 These entries will be used for forwarding traffic over the MPLS core 724 to remotely reachable MAC addresses. Of course, the BGP module must 725 remove the forwarding entries corresponding to withdrawn NLRIs. Note 726 that these entries are not subject to timed aging (as they follow a 727 control-plane learning paradigm rather than data-plane learning). 729 6. 730 BGP Encoding 732 This section describes the new BGP Routes and Attributes that are 733 required for R-VPLS. Three new BGP Routes (NLRIs) are defined below 734 for the R-VPLS solution. All these R-VPLS NLRIs are carried in BGP 735 using BGP Multiprotocol Extensions [RFC4760] with the existing L2VPN 736 AFI but with different new SAFIs. 738 In order for two BGP speakers to exchange these NLRIs, they must use 739 BGP Capabilities Advertisement to ensure that they both are capable 740 of properly processing such NLRIs. This is done as specified in 741 [RFC4760], by using capability code 1 (multiprotocol BGP) with an 742 AFI of L2VPN and the corresponding SAFI for that NLRI. 744 6.1. 745 R-VPLS MAC NLRI 747 This Layer-2 BGP route is used for distribution of MAC addresses 748 over MPLS/IP network and has dual purposes: 750 1. For auto-discovery of member PEs in a given R-VPLS instance for 751 the purpose of setting up an MP2MP MDT, a set of P2MP MDTs, or 752 a set of MP2P tunnels among these PEs 753 2. For distribution of host MAC addresses to other remote PEs in a 754 given R-VPLS instance 756 +--------------------------------+ 757 | Length (1 octet) | 758 +--------------------------------+ 759 | MPLS MAC Label (nx3 octets) | 760 +--------------------------------+ 761 | RD (8 octets) | 762 +--------------------------------+ 763 | VLAN (2 octets) | 764 +--------------------------------+ 765 | MAC address (6 octets) | 766 +--------------------------------+ 768 Figure 1: R-VPLS MAC NLRI Format 770 Length: This field indicates the length in octets for this NLRI. 772 MPLS Label: This is a downstream assigned MPLS label that typically 773 identifies the R-VPLS instance on the downstream PE (this label can 774 be considered analogous to L3VPN label associated with a given VRF). 775 The downstream PE may assign more than one label per RFC 3107. If 776 this label is NULL, it means the VPN label (for this R-VPLS 777 instance) that was previously advertised as part of auto-discovery 778 MUST be used. If this label is not NULL, then it MUST be used by the 779 remote PEs for traffic forwarding destined to the associated MAC 780 address. 782 RD: This field is encoded as described in [RFC4364]. The RD MUST be 783 the RD of the R-VPLS instance that is advertising this NLRI. 785 VLAN: This field may be zero or may represent a valid VLAN ID 786 associated with the host MAC. If it is zero, then it means that 787 there is only one bridge domain per R-VPLS instance (the most 788 typical case) and the forwarding lookup on the egress PE should be 789 performed based on bridge-domain ID (derived from R-VPLS instance) 790 and MAC address. If this field is non-zero, then it means that there 791 can be multiple bridge domains per R-VPLS instance (for the new 792 VLAN-aware port-based service) and the forwarding lookup on the 793 egress PE should be performed based on bridge-domain ID (derived 794 from ) and MAC address. 796 MAC: This MAC address can be either unicast or broadcast MAC 797 address. If it is an unicast address, then it represents a host MAC 798 address being distributed for the purpose of control plane learning 799 via BGP. However, if it is a broadcast address, then it is used 800 during auto-discovery phase of R-VPLS instance so that an inclusive 801 MDT or a set of MP2P tunnels can be setup among participant PEs for 802 that R-VPLS instance. 804 A new SAFI known as R-VPLS-MAC SAFI pending IANA assignment will be 805 used for this NLRI. The NLRI field in the MP_REACH_NLRI/ 806 MP_UNREACH_NLRI attribute contains the R-VPLS MAC NLRI encoded as 807 specified in the above. 809 6.2. 810 R-VPLS RG NLRI 812 This Layer-2 BGP route is used for distribution of a common site ID 813 among member PEs of a redundancy group. For MHD scenarios, this 814 route is used for auto-discovery of member PEs connected to an MHD 815 and Designated Forwarder (DF) election among these PEs. 817 +--------------------------------+ 818 | Length (1 octet) | 819 +--------------------------------+ 820 | MPLS Label (3 octets) | 821 +--------------------------------+ 822 | RD (8 octets) | 823 +--------------------------------+ 824 | Site ID (10 octets) | 825 +--------------------------------+ 827 Figure 2: R-VPLS RG NLRI Format 829 Length: This field indicates the length in octets for this NLRI. 831 MPLS Label: This label basically identifies the site of origin and 832 it is used for filtering purposes on egress PEs so that multi- 833 destination frames that are sourced by a site are not sent back to 834 the same site. This filtering action is commonly referred to as 835 split-horizon. When multi-destination frames are sent using P2MP 836 MDT, then this label is upstream assigned. When multi-destination 837 frames are sent using ingress replication over a set of MP2P 838 tunnels, then this label is downstream assigned. When multi- 839 destination frames are sent using MP2MP tunnel, then this label 840 needs to be scoped uniquely within the MP2MP tunnel context. 842 RD: This field is encoded as described in [RFC4364]. The RD MUST be 843 the RD of the R-VPLS instance that is advertising this NLRI. 845 Site ID: This field uniquely represent a multi-homed site or a 846 device connected to a set of PEs. In case of MHD scenarios, this ID 847 consists of the CE's LAG system ID (MAC address), the CE's LAG 848 system priority, and the CE's LAG Aggregator Key. 850 A new SAFI known as R-VPLS-RG SAFI pending IANA assignment will be 851 used for this NLRI. The NLRI field in the MP_REACH_NLRI/ 852 MP_UNREACH_NLRI attribute contains the R-VPLS RG NLRI encoded as 853 specified in the above. 855 6.3. 856 R-VPLS MH-ID NLRI 858 This Layer-2 BGP route is used for distribution of a site ID to the 859 remote PEs that have VPNs participating in that site. This route is 860 primarily used by remote PEs for the creation of the path list for a 861 given site and load balancing of traffic destined to that site among 862 its member PEs. 864 +--------------------------------+ 865 | Length (1 octet) | 866 +--------------------------------+ 867 | MPLS Label (3 octets) | 868 +--------------------------------+ 869 | RD (8 octets) | 870 +--------------------------------+ 871 | Site ID (10 octets) | 872 +--------------------------------+ 874 Figure 3: R-VPLS MH-ID NLRI Format 876 MPLS Label: This is a downstream assigned label that identifies the 877 Site ID (and subsequently the AC) on the disposition PE. This label 878 is used for forwarding of known unicast L2 frames in the disposition 879 PE when MPLS forwarding is used in lieu of MAC lookup. When MAC 880 lookup is used, this label MUST be set to NULL. 882 Length: This field indicates the length in octets for this NLRI. 884 RD: This field is encoded as described in [RFC4364]. The RD MUST be 885 the RD of the R-VPLS instance that is advertising this NLRI. 887 Site ID: This field uniquely represent a multi-homed site or a 888 device connected to a set of PEs. In case of MHD scenarios, this ID 889 consists of the CE's LAG system ID (MAC address), the CE's LAG 890 system priority, and the CE's LAG Aggregator Key. 892 A new SAFI known as R-VPLS-MH-ID SAFI pending IANA assignment will 893 be used for this NLRI. The NLRI field in the MP_REACH_NLRI/ 894 MP_UNREACH_NLRI attribute contains the R-VPLS MH-ID NLRI encoded as 895 specified in the above. 897 6.4. 898 BGP Route Targets 900 Each BGP R-VPLS NLRI will have one or more route-target extended 901 communities to associate a R-VPLS NLRI with a given VSI. These 902 route-targets control distribution of the R-VPLS NLRIs and thereby 903 will control the formation of the overlay topology of the network 904 that constitutes a particular VPN. This document defines the 905 following route-targets for R-VPLS: 907 6.4.1. 908 VPN-RT 910 This RT includes all the PEs in a given R-VPLS service instance. It 911 is used to distribute R-VPLS MAC NLRIs and it is analogous to RT 912 used for VPLS instance in [RFC 4671] or [RFC 4672]. 914 In data center applications where the network is limited to 915 supporting only 4K VLANs, then this VPN-RT can be derived 916 automatically from the VLAN itself (e.g., the VLAN ID can be used as 917 the VPN ID). Such RT auto-derivation is applicable to both Port mode 918 and VLAN mode services. In case of Port mode service, the default 919 VLAN for the port is used to derive the RT automatically and in case 920 of the VLAN mode service, the S-VLAN (service VLAN) is used to 921 derive the RT automatically. 923 6.4.2. 924 RG-RT 926 This RT is a transitive RT extended community and it includes all 927 the PEs in a given Redundancy Group, i.e. connected to the same 928 multi-homed site. It is used to distribute R-VPLS RG NLRIs. This RT 929 is derived automatically from the Site ID by encoding the 6-byte 930 system MAC address of the Site ID in this RT. In order to derive 931 this RT automatically, it is assumed that the system MAC address of 932 the CE is unique in the service provider network (e.g., the CE is a 933 managed CE or the customer doesn't fiddle with the CE's system MAC 934 address). 936 Each RG specific RT extended community is encoded as a 8-octet value 937 as follows: 939 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 940 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 941 | 0x44 | Sub-Type | RG-RT | 942 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 943 | RG-RT Cont'd | 944 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 946 6.4.3. 947 MH-RT 949 This RT is a transitive RT extended community and it includes all 950 the PEs whose R-VPLS service instances are part of the same multi- 951 homed site. It is used to distribute R-VPLS MH-ID NLRIs. This RT is 952 derived automatically from the MH-ID by encoding the 6-byte system 953 MAC address of the MH-ID in this RT. For a given multi-homed site, 954 this RT and RG-RT correspond to the same Site ID; however, the 955 reason for having two different RTs is to have exact filtering and 956 to differentiate between filtering needed among member PEs of a 957 multi-homed site versus among member PEs of all R-VPLS instances 958 participating in a multi-homed site. The former is needed for DF 959 election in a multi-homed site; whereas, the latter is needed for 960 load-balancing of the unicast traffic by the remote PEs toward the 961 multi-homed site. 963 In order to derive this RT automatically, it is assumed that the 964 system MAC address of the CE is unique in the service provider 965 network. 967 Each MH-ID specific RT extended community is encoded as a 8-octet 968 value as follows: 970 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 971 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 972 | 0x48 | Sub-Type | MH-ID RT | 973 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 974 | MH-ID RT Cont'd | 975 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 977 7. 978 Operation 980 This section describes the detailed operation of R-VPLS. 982 7.1. 983 Auto-discovery 985 The set of PEs participating in a given R-VPLS instance need to 986 discover each other for the purpose of setting up the tree(s) or the 987 tunnels which will be used for the delivery of multi-destination 988 frames. To that end, every PE advertises, on a per R-VPLS instance 989 basis, an R-VPLS MAC NLRI as follows: 991 - MAC Address field is set to the broadcast MAC (FFFF.FFFF.FFFF) 992 - VLAN ID field is set to zero 993 - RD is set as described previously 994 - MPLS Label field is set to a downstream-assigned label which 995 uniquely identifies the R-VPLS service instance on the originating 996 PE. This will be referred to as the VPN label. 998 The above NLRI is advertised along with the RT Extended Community 999 attribute corresponding to the R-VPLS service instance and the PMSI 1000 Tunnel attribute per [MCAST-BGP]. The default operation of R-VPLS is 1001 to use a unique MP2MP MDT per service instance. Therefore, in the 1002 PMSI Tunnel attribute, the Tunnel Type field is set to "mLDP MP2MP 1003 LSP" (value 7) and the MPLS Label field is set to zero. If there is 1004 a need to multiplex more than one R-VPLS instance over the same MDT, 1005 then a non-zero label value can be used in the PMSI Tunnel 1006 attribute. 1008 Optionally, the network operator may choose to use P2MP MDTs 1009 instead. If so, then the Tunnel Type field is set to "mLDP P2MP LSP" 1010 (value 2) and the MPLS label field is set as described above. 1012 If the MPLS network does not support LSM, then ingress replication 1013 is used instead. In this case, the PMSI Tunnel attribute would have 1014 the Tunnel Type field set to "Ingress Replication" (value 6) and the 1015 MPLS Label field is set to the same value as the MPLS Label field in 1016 the associated R-VPLS MAC NLRI. 1018 7.2. 1019 Setup of Multicast Tunnels 1021 In order to automate the setup of the default MP2MP MDT, the 1022 following procedure is to be followed: The first PE to come up in an 1023 R-VPLS instance advertises an R-VPLS MAC NLRI (as described in 1024 section 7.1) with the Tunnel-id field of the PMSI Tunnel attribute 1025 set to NULL. The BGP Route Reflector chooses a root (based on some 1026 policy) and re-advertises the NLRI with the PMSI Tunnel attribute 1027 modified to include the selected Tunnel-id. This advertisement is 1028 then sent to all PEs in the R-VPLS instance. To ensure that the 1029 original advertising PE receives the assigned Tunnel-ID, BGP Route 1030 Reflector shall modify its route advertisement procedure such that 1031 the Originator attribute shall be set to the router-id of the Route 1032 Reflector and the Next-hop attribute shall be set to the local 1033 address of the BGP session for such R-VPLS MAC NLRI announcements. 1034 Upon receiving the NLRIs with non-NULL Tunnel-id, the PEs initiate 1035 the setup of the MP2MP tunnel towards the root using the procedures 1036 in [MLDP]. 1038 If the PEs are configured to use the optional P2MP MDT instead of 1039 MP2MP MDT, then the PE itself sets the Tunnel-id field in the PMSI 1040 Tunnel attribute associated with the R-VPLS MAC NLRI described in 1041 section 7.1. 1043 7.3. 1044 Host MACs Distribution over Core 1046 Upon learning a host MAC in its bridge module, the PE advertises the 1047 newly learned MAC over MPLS core to other remote PEs using the MAC 1048 NLRI. If the MAC address is originated from a multi-homed site, then 1049 the MPLS label field in the MAC NLRI is set to NULL because the 1050 remote PEs know that they MUST use the MPLS label associated with 1051 the broadcast MAC, which is advertised during auto-discovery phase, 1052 as the VPN label. Furthermore, the MH-ID is set as part of a 1053 separate new MH-ID attribute for this MAC NLRI to indicate that this 1054 MAC is associated with that site ID. However, if the MAC address is 1055 originated from a single-homed site, then the MPLS label field in 1056 the MAC NLRI is set to the downstream assigned label representing 1057 the R-VPLS instance and the MH-ID is not set for that MAC NLRI 1058 (indicating to the remote PEs that this MAC is associated with this 1059 advertising PE only). 1061 The MH-ID attribute is a new optionally transitive attribute of type 1062 [TBD] and is defined as: 1064 0 1 2 3 1065 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1066 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1067 | Type=1 | Length | | 1068 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1069 ~ ~ 1070 | One or More MH-ID 6 bytes Values | 1071 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1073 7.4. 1074 Device Multi-homing 1076 7.4.1. 1077 Special Considerations for Multi-homing 1079 In the case where a set of VPLS PEs offer flexible multi-homing for 1080 a number of CEs, special considerations are required to prevent the 1081 creation of forwarding loops and delivery of duplicate frames when 1082 forwarding multi-destination frames. 1084 Consider the example network shown in figure 3 below. In this 1085 network, it is assumed that the ACs from all CEs to their 1086 corresponding PEs are active and forwarding, i.e. all-active 1087 redundancy model. 1089 +-----+ 1090 +--------------+ | 1091 | +-----------+ PE1 | 1092 | | +----+ | 1093 | | CE1 / +-----+ 1094 | | \ 1095 | CE2 \ +-----+ 1096 | \ +---+ | 1097 | +--------+ | MPLS Core 1098 | +-----+ PE2 | 1099 | / | | 1100 +---- CE3 +-----+ 1101 \ 1102 \ +-----+ 1103 +----+ | 1104 | PE3 | 1105 | | 1106 +-----+ 1108 Figure 3: VPLS with Flexible Multi-homing 1110 Take, for instance, the scenario where CE1 transmits a broadcast 1111 frame toward PE1. PE1 will attempt to flood the frame over all its 1112 local ACs and to all remote PEs (PE2 and PE3) in the same VPLS 1113 instance. The R-VPLS solution ensures that these broadcast frames do 1114 not loop back to CE1 by way of PE2. The solution also ensures that 1115 CE2 and CE3 do not receive duplicates of the broadcast, via PE1/PE2 1116 and PE2/PE3, respectively. This is achieved by enforcing the 1117 following behavior: 1119 7.4.1.1. 1120 Filtering Based on Site ID 1122 Every R-VPLS PE is configured with a Site ID on the AC connecting 1123 to a multi-homed CE per [VPLS-BGP-DH]. The PE forwarding a multi- 1124 destination frame tags the flooded traffic with the Site ID that 1125 identifies the originating site, so that traffic from a multi-homed 1126 CE is not re-forwarded back to that CE upon receipt from the MPLS 1127 core. This filtering action is commonly referred to as split- 1128 horizon. This tagging can be achieved by embedding a 'source label' 1129 as the end-of-stack label in the MPLS packets. The source label is 1130 set to the value of the MPLS label field in the RG NLRI for that 1131 site. This source label is matched against the Site-ID label of a 1132 given AC, for traffic received from the MPLS core. If there is a 1133 match, then traffic is filtered on that AC. If there's no match, 1134 then the traffic is allowed to egress that AC, as long as that AC is 1135 the Designated Forwarder for that site. 1137 7.4.1.2. 1138 Defining a Designated Forwarder 1140 A Designated Forwarder (DF) PE is elected for handling all multi- 1141 destination frames received from the MPLS core towards a given 1142 multi-homed device. Only the DF PE is allowed to forward traffic 1143 received from the MPLS core (over the multipoint LSP or full-mesh of 1144 PWs) towards a given MHD. The DF is elected dynamically using the 1145 procedures in [VPLS-BGP-DH]. There can be transient duplicate frames 1146 and loops. The DF election procedure to avoid transient duplicate 1147 frames and loops will be described in the future revision. 1149 7.4.2. 1150 Multi-homed Site Topology Discovery 1152 Given that one of the requirements of R-VPLS is ease of 1153 provisioning, the set of PEs connected to the same CE must discover 1154 each other automatically with minimal to no configuration. To that 1155 end, each PE extracts the following from the [802.1AX] LACPDUs 1156 transmitted by the CE on a given port: 1158 - CE LACP System Identifier comprised of 6 bytes MAC Address and 2 1159 bytes System Priority 1160 - CE LACP Port Key (2 bytes) 1162 The PE uses this information to construct the Site ID associated 1163 with the port, and advertises an R-VPLS RG NLRI for every unique 1164 Site ID. The NLRI is tagged with the RG-RT extended community 1165 discussed in section 6.4.2 above. Furthermore, the PE automatically 1166 enables the import of BGP routes tagged with said RT which is 1167 derived from the Site ID. This allows the PEs connected to the same 1168 CE to discover each other. 1170 As a PE discovers the other members of the RG, it starts building an 1171 ordered list based on PE identifiers (e.g. IP addresses). This list 1172 is used to select a DF and a backup DF (BDF) on a per group of VLAN 1173 basis. For example, the PE with the numerically highest (or lowest) 1174 identifier is considered the DF for a given group of VLANs for that 1175 site and the next PE in the list is considered the BDF. To that end, 1176 the range of VLANs associated with the CE must be partitioned into 1177 disjoint sets. The size of each set is a function of the total 1178 number of CE VLANs and the total number of PEs in the RG. The DF can 1179 employ any distribution function that achieves an even distribution 1180 of VLANs across RG members. The BDF takes over the VLAN set of any 1181 PE encountering either a node failure or a link/port failure causing 1182 that PE to be isolated from the multi-homed site. 1184 It should be noted that once all the PEs participating in a site 1185 have the same ordered list for that site, then VLAN groups can be 1186 assigned to each member of that list deterministically without any 1187 need to explicitly distribute VLAN IDs among the member PEs of that 1188 list. In other words, the DF election for a group of VLANs is a 1189 local matter and can be done deterministically. As an example, 1190 consider, that the ordered list consists of m PEs: (PE1, PE2,..., 1191 PEm), and there are n VLANs for that site (V0, V1, V2, ..., Vn-1). 1192 The PE1 and PE2 can be the DF and the BDF respectively for all the 1193 VLANs corresponding to (i mod m) for i:1 to n. PE2 and PE3 can be 1194 the DF and the BDF respectively for all the VLANs corresponding to 1195 (i mod m) + 1 and so on till the last PE in the order list is 1196 reached and we have PEm and PE1 is the DF and the BDF respectively 1197 for the all the VLANs corresponding to (i mod m) + m-1. 1199 While the discovery of the multi-homed topology is in progress, 1200 different PEs may have inconsistent views of the network. This could 1201 lead to having duplicate packets temporarily delivered to the multi- 1202 homed CE. Procedures for preventing temporary packet duplication 1203 and/or loops will be covered in future revisions of this document. 1205 7.4.3. 1206 Dynamic Assignment of Site-ID Label 1208 In order to automate the assignment of the Site-ID label used as 1209 'source label' for the Site-ID split-horizon filtering, the 1210 following procedure is to be followed: During the multi-homed site 1211 topology discovery, the first PE to come up in a multi-homed site 1212 advertises an RG NLRI (as described in section 6.2) with the MPLS 1213 Label field set to NULL. The BGP Route Reflector chooses a label and 1214 re-advertises the RG NLRI with the MPLS Label field modified to 1215 include the selected value. This advertisement is then sent to all 1216 PEs in that multi-homed site. To ensure that the original 1217 advertising PE receives the assigned label, filtering based on the 1218 node-origin on the Route Reflector is disabled. Upon receiving the 1219 RG NLRIs with non-NULL label, the PEs use that label as the source 1220 label for split-horizon filtering of that site. 1222 It should be noted that this procedure for dynamic assignment of 1223 Site-ID label only assigns a single label per site (and not per site 1224 per PE) which simplifies the implementation of split-horizon 1225 filtering. Furthermore, it is independent from multi-destination 1226 tunnel type and can be equally applied across all different tunnel 1227 types: MP2MP MDT, P2MP MDT, and MP2P ingress replication tunnels. 1229 7.4.4. 1230 Load-balancing 1232 Consider the case where a given CE is multi homed to a set of PEs 1233 {PE1, PE2, ... PEn} over a multi-chassis LAG. For a specific MAC 1234 address M1, the CE may hash the active traffic flow to some PEi 1235 (1<=i<=n) in the set, and there could be a (possibly indefinite) 1236 lapse of time before any traffic from M1 is hashed to the other PEs 1237 in the set. In such a scenario, any remote PE in the same R-VPLS 1238 instance would have received an R-VPLS MAC NLRI for M1 only from 1239 PEi. However, it is desirable to be able to load-balance traffic 1240 from the remote PE (PEr) destined to M1 over the entire set of the 1241 multi-homed site PEs {PE1, PE2, ... PEn}. To facilitate that, R-VPLS 1242 makes use of site routes (MH-ID NLRIs) in addition to MAC routes 1243 (MAC NLRIs). All PEs that are connected to the same multi-homed CE 1244 advertise R-VPLS MH-ID NLRIs, with the CE's Site ID, to all PEs in 1245 the R-VPLS instances that said CE is part of. When any of the PEs in 1246 the RG learns a new MAC address for traffic coming from the CE, it 1247 advertises an R-VPLS MAC NLRI with the Next-Hop attribute set to the 1248 corresponding Site ID. The combination of the MAC route and site 1249 route advertisements allows all the remote PEs to build a BGP path- 1250 list comprising of the set of PEs that have reachability to a given 1251 MAC address via a given multi-homed CE. The remote PEs use the Site 1252 ID in the Next-Hop attribute of the MAC NLRI to determine the list 1253 of member PEs for that site. Furthermore, they retrieve the VPN 1254 label corresponding to the R-VPLS instance on a given PE from the 1255 previously advertised broadcast MAC NLRI as part of auto-discovery. 1256 From the combination of the two, the remote PEs can create a list of 1257 label tuples corresponding to the member PEs of that site for a give 1258 VPN: {(Lt1, Lv1), (Lt2, Lv2), ... (Ltm, Lvm)}; where Lti and Lvi 1259 represent the tunnel and the VPN labels respectively for PEi. The 1260 remote PEs can use this path-list to perform flow-based load- 1261 balancing for traffic destined to that given MAC address. This works 1262 even if only a single PE within the RG learns a given MAC address 1263 from the CE. 1265 7.4.5. 1266 Auto-Derivation of MH-ID RTs 1268 The MH-ID NLRIs corresponding to a given multi-homed CE need to 1269 reach any PE that participates in at least one of the R-VPLS 1270 instances that said CE is part of. Therefore, the choice of the RT 1271 Extended Community used to tag that NLRI must accommodate that. In 1272 order to avoid any manual configuration of this RT, referred to as 1273 MH-RT (section 6.4.3), the remote PEs need to automatically discover 1274 its value from at least one of the PEs in the RG. This is done as 1275 follows: Upon discovering all the connected CEs, a PE starts the 1276 service auto-discovery procedures outlined in section 7.1 above. In 1277 the MAC NLRI sent for discovery, the sending PE embeds the Site IDs 1278 of all CEs that are part of the associated service instance in the 1279 SNPA field of the Next-Hop attribute. When a remote PE receives the 1280 MAC NLRI, it first derives the MH-RT extended communities based on 1281 these Site IDs and then automatically starts importing MH-ID routes 1282 tagged with these MH-RTs extended community attributes. 1284 7.4.6. 1285 Site-ID Label for Single-Homed Sites 1287 For a single-homed site, we shouldn't need to assign a site-ID 1288 label; however, it makes the processing at the disposition PE 1289 simpler if the packet is encapsulated with a site-ID label with a 1290 NULL value. If a site-ID label is not used and the packet is sourced 1291 from a single-homed site and destined to a multi-homed site, then at 1292 the disposition PE, a NULL label needs to get injected into the 1293 packet for frames received over multicast MDT(s) so that the 'source 1294 label' check can be performed on the egress AC. Furthermore, if 1295 ingress replication is used and the use of flow label is optional, 1296 then it is difficult to identify the label that follows the VPN 1297 label - it is difficult to discern between a flow label and a 1298 'source label'. Therefore, in order to avoid such complications on 1299 the disposition PE, we mandate the use of 'source label' with the 1300 value of NULL for packets originating from the single-homed sites. 1302 7.4.7. 1303 LACP State Synchronization 1305 To support CE multi-homing with multi-chassis Ethernet bundles, the 1306 R-VPLS PEs connected to a given CE should synchronize [802.1AX] LACP 1307 state amongst each other. This includes at least the following LACP 1308 specific configuration parameters: 1310 - System Identifier (MAC Address): uniquely identifies a LACP 1311 speaker. 1312 - System Priority: determines which LACP speaker's port priorities 1313 are used in the Selection logic. 1314 - Aggregator Identifier: uniquely identifies a bundle within a LACP 1315 speaker. 1316 - Aggregator MAC Address: identifies the MAC address of the bundle. 1317 - Aggregator Key: used to determine which ports can join an 1318 Aggregator. 1320 - Port Number: uniquely identifies an interface within a LACP 1321 speaker. 1322 - Port Key: determines the set of ports that can be bundled. 1323 - Port Priority: determines a port's precedence level to join a 1324 bundle in case the number of eligible ports exceeds the maximum 1325 number of links allowed in a bundle. 1327 The above information must be synchronized between the R-VPLS PEs 1328 wishing to form a multi-chassis bundle with a given CE, in order for 1329 the former to convey a single LACP peer to that CE. This is required 1330 for initial system bring-up and upon any configuration change. 1331 Furthermore, the PEs must also synchronize operational (run-time) 1332 data, in order for the LACP Selection logic state-machines to 1333 execute. This operational data includes the following LACP 1334 operational parameters, on a per port basis: 1336 - Partner System Identifier: this is the CE System MAC address. 1337 - Partner System Priority: the CE LACP System Priority 1338 - Partner Port Number: CE's AC port number. 1339 - Partner Port Priority: CE's AC Port Priority. 1340 - Partner Key: CE's key for this AC. 1341 - Partner State: CE's LACP State for the AC. 1342 - Actor State: PE's LACP State for the AC. 1343 - Port State: PE's AC port status. 1345 The above state needs to be communicated between R-VPLS PEs forming 1346 a multi-chassis bundle during LACP initial bringup, upon any 1347 configuration change and upon the occurrence of a failure. 1349 It should be noted that the above configuration and operational 1350 state is localized in scope and is only relevant to PEs within a 1351 given Redundancy Group, i.e. which connect to the same multi-homed 1352 CE over a given Ethernet bundle. Furthermore, the communication of 1353 state changes, upon failures, must occur with minimal latency, in 1354 order to minimize the switchover time and consequent service 1355 disruption. [PWE3-ICCP] defines a mechanism for synchronizing LACP 1356 state, using LDP, which can be leveraged for R-VPLS. The use of BGP 1357 for synchronization of LACP state is left for further study. 1359 7.5. 1360 Frame Forwarding over MPLS Core 1362 The VPLS Forwarder module is responsible for handling frame 1363 transmission and reception over the MPLS core. The processing of the 1364 frame differs depending on whether the destination is a unicast or 1365 multicast/broadcast address. The two cases are discussed next. 1367 7.5.1. 1368 Unicast 1370 For known unicast traffic, the VPLS Forwarder sends frames into the 1371 MPLS core using the forwarding information received by BGP from 1372 remote PEs. The frames are tagged with an LSP tunnel label and a VPN 1373 label. If per flow load-balancing over MPLS core is required 1374 between ingress and egress PEs, then a flow label is added after the 1375 VPN label. 1377 For unknown unicast traffic, an R-VPLS PE can optionally forward 1378 these frames over MPLS core; however, the default is not to forward. 1379 If these frames are to be forwarded, then the same set of options 1380 used for forwarding multicast/broadcast frames (as described in next 1381 section) are also used for forwarding these unknown unicast frames. 1383 7.5.2. 1384 Multicast/Broadcast 1386 For multi-destination frames (multicast and broadcast) delivery, R- 1387 VPLS provides the flexibility of using a number of options: 1389 Option 1: the R-VPLS Forwarder can perform ingress replication over 1390 a set of MP2P tunnel LSPs. 1392 Option 2: the R-VPLS Forwarder can use P2MP MDT per the procedures 1393 defined in [VPLS-MCAST]. 1395 Option 3: the R-VPLS Forwarder can use MP2MP MDT per the procedures 1396 described in section 6.4. This option is considered as default mode. 1398 7.6. 1399 MPLS Forwarding at Disposition PE 1401 The general assumption for forwarding frames to customer sites at 1402 disposition PEs is that the received packet from MPLS core is 1403 terminated on the bridge module and a MAC lookup is performed to 1404 forward the frame to the right AC. This requires that the MPLS 1405 encapsulation to carry the VPN label which in turn identifies the 1406 right VSI for forwarding the frame. 1408 It is sometimes desirable to be able to forward L2 frames to the 1409 right AC at the disposition PE without any MAC lookup (e.g., using 1410 only MPLS forwarding). In such scenarios, the MPLS encapsulation 1411 needs to carry a label associated with the egress AC. In vlan-mode 1412 service, this AC label needs to be in addition to the VPN label. 1413 Therefore, for consistency one may want to use both the AC and the 1414 VPN labels for all types of services when doing MPLS forwarding at 1415 the disposition PE. The VPN label is retrieved from the MAC route 1416 during auto-discovery phase and the Site label is retrieved from the 1417 MH-ID route. From these labels, the remote PEs can create a list of 1418 label tuples corresponding to the member PEs of that site for a give 1419 VPN: {(Lt1, Lv1, Ls1), (Lt2, Lv2, Ls2), ... (Ltm, Lvm, Lsm)}; where 1420 Lti, Lvi, and Lsi represent the tunnel, the VPN, and the AC labels 1421 respectively for PEi. 1423 8. 1424 Acknowledgements 1425 The authors would like to acknowledge the valuable inputs received 1426 from Pedro Marques and Robert Raszuk. 1428 9. 1429 Security Considerations 1431 There are no additional security aspects beyond those of VPLS/H-VPLS 1432 that need to be discussed here. 1434 10. 1435 IANA Considerations 1437 This document requires IANA to assign a new SAFI value for L2VPN_MAC 1438 SAFI. 1440 11. 1441 Intellectual Property Considerations 1443 This document is being submitted for use in IETF standards 1444 discussions. 1446 12. 1447 Normative References 1449 [RFC4664] "Framework for Layer 2 Virtual Private Networks (L2VPNs)", 1450 RFC4664, September 2006. 1452 [RFC4761] "Virtual Private LAN Service (VPLS) Using BGP for Auto- 1453 discovery and Signaling", January 2007. 1455 [RFC4762] "Virtual Private LAN Service (VPLS) Using Label 1456 Distribution Protocol (LDP) Signaling", RFC4762, January 2007. 1458 [802.1AX] IEEE Std. 802.1AX-2008, "IEEE Standard for Local and 1459 metropolitan area networks - Link Aggregation", IEEE Computer 1460 Society, November, 2008. 1462 13. 1463 Informative References 1465 [VPLS-BGP-MH] Kothari et al., "BGP based Multi-homing in Virtual 1466 Private LAN Service", draft-ietf-l2vpn-vpls-multihoming-00, work in 1467 progress, November, 2009. 1469 [VPLS-MCAST] Aggarwal et al., "Multicast in VPLS", draft-ietf-l2vpn- 1470 vpls-mcast-06.txt, work in progress, March, 2010. 1472 [PWE3-ICCP] Martini et al., "Inter-Chassis Communication Protocol 1473 for L2VPN PE Redundancy", draft-ietf-pwe3-iccp-02.txt, work in 1474 progress, Octoer, 2009. 1476 [PWE3-FAT-PW] Bryant et al., "Flow Aware Transport of Pseudowires 1477 over an MPLS PSN", draft-ietf-pwe3-fat-pw-03.txt, work in 1478 progress, January 2010. 1480 14. 1481 Authors' Addresses 1483 Ali Sajassi 1484 Cisco 1485 170 West Tasman Drive 1486 San Jose, CA 95134, US 1487 Email: sajassi@cisco.com 1489 Samer Salam 1490 Cisco 1491 595 Burrard Street, Suite 2123 1492 Vancouver, BC V7X 1J1, Canada 1493 Email: ssalam@cisco.com 1495 Keyur Patel 1496 Cisco 1497 170 West Tasman Drive 1498 San Jose, CA 95134, US 1499 Email: keyupate@cisco.com 1501 Nabil Bitar 1502 Verizon Communications 1503 Email : nabil.n.bitar@verizon.com 1505 Pradosh Mohapatra 1506 Cisco 1507 170 West Tasman Drive 1508 San Jose, CA 95134, US 1509 Email: pmohapat@cisco.com 1511 Clarence Filsfils 1512 Cisco 1513 170 West Tasman Drive 1514 San Jose, CA 95134, US 1515 Email: cfilsfil@cisco.com 1517 Sami Boutros 1518 Cisco 1519 170 West Tasman Drive 1520 San Jose, CA 95134, US 1521 Email: sboutros@cisco.com