idnits 2.17.00 (12 Aug 2021) /tmp/idnits64099/draft-merged-nvo3-vm-mobility-scheme-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. (A line matching the expected section header was found, but with an unexpected indentation: ' 8. Security Considerations' ) ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) (A line matching the expected section header was found, but with an unexpected indentation: ' 9. IANA Considerations' ) ** There are 2 instances of too long lines in the document, the longest one being 13 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 3, 2014) is 2780 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC7297' is defined on line 871, but no explicit reference was found in the text == Unused Reference: 'RFC1700' is defined on line 880, but no explicit reference was found in the text == Unused Reference: 'RFC2332' is defined on line 883, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 7297 -- Obsolete informational reference (is this intentional?): RFC 1700 (Obsoleted by RFC 3232) == Outdated reference: draft-ietf-l2vpn-evpn has been published as RFC 7432 Summary: 4 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 NVO3 Working Group Y. Rekhter 2 Internet Draft Juniper Networks 3 Intended status: Standards track L. Dunbar 4 Expires: April 2015 Huawei 5 R. Aggarwal 6 Arktan Inc 7 R. Shekhar 8 Juniper Networks 9 W. Henderickx 10 Alcatel-Lucent 11 L. Fang 12 Microsoft 13 A. Sajassi 14 Cisco 16 October 3, 2014 18 NVO3 VM Mobility Scheme 19 draft-merged-nvo3-vm-mobility-scheme-00.txt 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. This document may not be modified, 28 and derivative works of it may not be created, except to publish it 29 as an RFC and to translate it into languages other than English. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF), its areas, and its working groups. Note that 33 other groups may also distribute working documents as Internet- 34 Drafts. 36 Internet-Drafts are draft documents valid for a maximum of six 37 months and may be updated, replaced, or obsoleted by other documents 38 at any time. It is inappropriate to use Internet-Drafts as 39 reference material or to cite them other than as "work in progress." 41 The list of current Internet-Drafts can be accessed at 42 http://www.ietf.org/ietf/1id-abstracts.txt 43 The list of Internet-Draft Shadow Directories can be accessed at 44 http://www.ietf.org/shadow.html 46 This Internet-Draft will expire on April 3, 2009. 48 Copyright Notice 50 Copyright (c) 2014 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with 58 respect to this document. Code Components extracted from this 59 document must include Simplified BSD License text as described in 60 Section 4.e of the Trust Legal Provisions and are provided without 61 warranty as described in the Simplified BSD License. 63 Abstract 65 This document describes the schemes to overcome the network-related 66 issues to achieve seamless Virtual Machine mobility in the data 67 center and between data centers. 69 Table of Contents 71 1. Introduction...................................................3 72 2. Conventions used in this document..............................4 73 3. Terminology....................................................4 74 4. Scheme to resolve VLAN-IDs usage in L2 access domains..........8 75 5. Layer 2 Extension.............................................10 76 5.1. Layer 2 Extension Problem................................10 77 5.2. NVA based Layer 2 Extension Solution.....................10 78 5.3. E-VPN based Layer 2 Extension Solution...................10 79 6. Optimal IP Routing............................................14 80 6.1. Preserving Policies......................................15 81 6.2. VM Default Gateway solutions.............................16 82 6.2.1. E-VPN based VM Default Gateway Solutions............16 83 6.2.1.1. E-VPN based VM Default Gateway Solution 1......17 84 6.2.1.2. E-VPN based VM Default Gateway Solution 2......18 85 6.2.2. Distributed Proxy Default Gateway Solution..........18 86 6.3. Triangular Routing.......................................19 87 6.3.1. NVA based Intra Data Center Triangular Routing Solution 88 ...........................................................19 89 6.3.2. E-VPN based Intra Data Center Triangular Routing 90 Solution...................................................20 91 7. Manageability Considerations..................................21 92 8. Security Considerations.......................................21 93 9. IANA Considerations...........................................22 94 10. Acknowledgements.............................................22 95 11. References...................................................22 96 11.1. Normative References....................................22 97 11.2. Informative References..................................22 99 1. Introduction 101 An important feature of data centers identified in [nvo3-problem] is 102 the support of Virtual Machine (VM) mobility within the data center 103 and between data centers. This document describes the schemes to 104 overcome the network-related issues to achieve seamless Virtual 105 Machine mobility in the data center and between data centers, where 106 seamless mobility is defined as the ability to move a VM from one 107 server in a data center to another server in the same or different 108 data center, while retaining the IP and MAC address of the VM. In 109 the context of this document the term mobility or a reference to 110 moving a VM should be considered to imply seamless mobility, unless 111 otherwise stated. 113 Note that in the scenario where a VM is moved between servers 114 located in different data centers, there are certain issues related 115 to the current state of the art of the Virtual Machine technology, 116 the bandwidth that may be available between the data centers, the 117 distance between the data centers, the ability to manage and operate 118 such VM mobility, storage-related issues (the moved VM has to have 119 access to the same virtual disk), etc. Discussion of these issues 120 is outside the scope of this document. 122 2. Conventions used in this document 124 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 125 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 126 document are to be interpreted as described in RFC-2119 [RFC2119]. 128 In this document, these words will appear with that interpretation 129 only when in ALL CAPS. Lower case uses of these words are not to be 130 interpreted as carrying RFC-2119 significance. 132 DC: Data Center 134 DCBR: Data Center Bridge Router 136 LAG: Link Aggregation Group 138 POD: Modular Performance Optimized Data Center. POD and Data Center 139 are used interchangeably in this document. 141 ToR: Top of Rack switch 143 VEPA: Virtual Ethernet Port Aggregator (IEEE802.1Qbg) 145 VN: Virtual Network 147 3. Terminology 149 In this document the term "Top of Rack Switch (ToR)" is used to 150 refer to a switch in a data center that is connected to the servers 151 that host VMs. A data center may have multiple ToRs. Some servers 152 may have embedded blade switches, some servers may have virtual 153 switches to interconnect the VMs, and some servers may not have any 154 embedded switches. When External Bridge Port Extenders (as defined 155 by 802.1BR) are used to connect the servers to the data center 156 network, the ToR switch is the Controlling Bridge. 158 Several data centers or PODs could be connected by a network. In 159 addition to providing interconnect among the data centers/PODs, such 160 a network could provide connectivity between the VMs hosted in these 161 data centers and the sites that contain hosts communicating with 162 such VMs. Each data center has one or more Data Center Border Router 163 (DCBR) that connects the data center to the network, and provides 164 (a) connectivity between VMs hosted in the data center and VMs 165 hosted in other data centers, and (b) connectivity between VMs 166 hosted in the data center and hosts communicating with these VMs. 168 The following figure illustrates the above: 170 __________ 171 ( ) 172 ( Data Center) 173 ( Interconnect )------------------------- 174 ( Network ) | 175 (__________) | 176 | | | 177 ---- ---- | 178 | | | 179 --------+--------------+--------------- ------------- 180 | | | Data | | | 181 | ------ ------ Center | | Data Center | 182 | | DCBR | | DCBR | /POD | | /POD | 183 | ------ ------ | ------------- 184 | | | | 185 | --- --- | 186 | ___|______|__ | 187 | ( ) | 188 | ( Data Center ) | 189 | ( Network ) | 190 | (___________) | 191 | | | | 192 | ---- ---- | 193 | | | | 194 | ------------ ----- | 195 | | ToR Switch | | ToR | | 196 | ------------ ----- | 197 | | | | 198 | | ---------- | ---------- | 199 | |--| Server | |--| Server | | 200 | | | vSwitch | | ---------- | 201 | | | ---- | | | 202 | | | | VM | | | ---------- | 203 | | | ----- | --| Server | | 204 | | | | VM | | ---------- | 205 | | | ----- | | 206 | | | | VM | | | 207 | | | ---- | | 208 | | ---------- | 209 | | | 210 | | ---------- | 211 | |--| Server | | 212 | | ---------- | 213 | | | 214 | | ---------- | 215 | --| Server | | 216 | ---------- | 217 | | 218 ---------------------------------------- 220 Figure 1: A Typical Data Center Network 222 The data centers/PODs and the network that interconnects them may be 223 either (a) under the same administrative control, or (b) controlled 224 by different administrations. 226 Consider a set of VMs that (as a matter of policy) are allowed to 227 communicate with each other, and a collection of devices that 228 interconnect these VMs. If communication among any VMs in that set 229 could be accomplished in such a way as to preserve MAC source and 230 destination addresses in the Ethernet header of the packets 231 exchanged among these VMs (as these packets traverse from their 232 sources to their destinations), we will refer to such set of VMs as 233 an Layer 2 based Virtual Network (VN) or Closed User Group (L2-based 234 CUG). In this document, the Closed User Group and Virtual Network 235 (VN) are used interchangeably. 237 A given VM may be a member of more than one VN or L2-based VN. 239 In terms of IP address assignment this document assumes that all VMs 240 of a given L2-based VN have their IP addresses assigned out of a 241 single IP prefix. Thus, in the context of this document a single IP 242 subnet corresponds to a single L2-based VN. If a given VM is a 243 member of more than one L2-based VN, this VM would have multiple IP 244 addresses and multiple logical interfaces, one IP address and one 245 logical interface per each such VN. 247 A VM that is a member of a given L2-based VN may (as a matter of 248 policy) be allowed to communicate with VMs that belong to other L2- 249 based VNs, or with other hosts. Such communication involves IP 250 forwarding, and thus would result in changing MAC source and 251 destination addresses in the Ethernet header of the packets being 252 exchanged. 254 In this document the term "L2 physical domain" refers to a 255 collection of interconnected devices that perform forwarding based 256 on the information carried in the Ethernet header. A trivial L2 257 physical domain consists of just one non-virtualized server. In a 258 non-trivial L2 physical domain (domain that contains multiple 259 forwarding entities) forwarding could be provided by such layer 2 260 technologies as Spanning Tree Protocol (STP), VEPA (IEEE802.1Qbg), 261 etc. Note that any multi-chassis LAG cannot span more than one L2 262 physical domain. This document assumes that a layer 2 access domain 263 is an L2 physical domain. 265 A physical server connected to a given L2 physical domain may host 266 VMs that belong to different L2-based VNs (while each of these VNs 267 may span multiple L2 physical domains). If an L2 physical domain 268 contains servers that host VMs belonging to different L2-based VNs, 269 then enforcing L2-based VNs boundaries among these VMs within that 270 domain is accomplished by relying on Layer 2 mechanisms (e.g. 271 VLANs). 273 We say that an L2 physical domain contains a given VM (or that a 274 given VM is in a given L2 physical domain), if the server presently 275 hosting this VM is part of that domain, or the server is connected 276 to a ToR that is part of that domain. 278 We say that a given L2-based VN is present within a given data 279 center if one or more VMs that are part of that VN are presently 280 hosted by the servers located in that data center. 282 In the context of this document when we talk about VLAN-ID used by a 283 given VM, we refer to the VLAN-ID carried by the traffic that is 284 within the same L2 physical domain as the VM, and that is either 285 originated or destined to that VM - e.g., VLAN-ID only has local 286 significance within the L2 physical domain, unless it is stated 287 otherwise. 289 Some of the VM-mobility solutions described in this document are E- 290 VPN based. When using E-VPN in NVO3 environment, the NVE function is 291 on the PE node. NVE-PE is used to describe the E-VPN PE node that 292 supports the NVE function. 294 4. Scheme to resolve VLAN-IDs usage in L2 access domains 296 To support tens of thousands of virtual networks, the local VID 297 associated with client payload under each NVE has to be locally 298 significant. Therefore, the same L2-based VN MAY have either the 299 same or different VLAN-IDs under different NVEs. Thus when a given 300 VM moves from one non-trivial L2 physical domain to another, the 301 VLAN-ID of the traffic from/to VM in the former may be different 302 than in the latter, and thus cannot assume to stay the same. 304 For data frames traverse through the NVO3 underlay network, if 305 ingress NVE simply encapsulates an outer header to data frames 306 received from VMs and forward the encapsulated data frames to egress 307 NVE via underlay network, the egress NVE can't simply decapsulate 308 the outer header and send the decapsulated data frames to the 309 attached VMs as done by TRILL. 311 It is possible that within a trivial L2 physical domain traffic 312 from/to VMs that are in this domain may not have VLAN-IDs at all. 314 If a given VM's Guest OS sends packets that carry VLAN-ID, then the 315 VLAN-ID used by the Guest OS may not change when the VM moves from 316 one L2 physical domain to another (this is irrespective of whether 317 L2 physical domains are trivial or non-trivial). In other words, the 318 VLAN-IDs used by a tagged VM network interface are part of the VM's 319 state and may not be changed when the VM moves from one L2 physical 320 domain to another. Therefore, it is necessary for an entity, most 321 likely the first switch (virtual or physical) to which the VM is 322 attached, to change the VLAN-ID from the value used by NVE to the 323 value expected by the VM (in contrast, a VLAN tag assigned by a 324 hypervisor for use with an untagged VM network interface can 325 change). If the L2 physical domain is extended to include VM tagged 326 interfaces, the hypervisor virtual switch, and the DC bridged 327 network, then special consideration described below is needed in 328 assignment of VLAN tags for the VMs, the L2 physical domain and 329 other domains into which the VM may move. 331 This document assumes that within a given non-trivial L2 physical 332 domain traffic from/to VMs that are in that domain, and belong to 333 different L2-based VNs MUST have different VLAN-IDs. 335 The above assumptions about VLAN-IDs are driven by (a) the 336 assumption that within a given L2 physical domain VLANs are used to 337 identify individual L2-based VNs, and (b) the need to overcome the 338 limitation on the number of different VLAN-IDs. 340 NVA can facilitate NVE for local VID assignment and dynamic mapping 341 between local VID and global virtual network instances. NVE needs to 342 free up VLAN-IDs when there is no VMs underneath under the VLAN-IDs. 343 Here is the detailed procedure: 344 . NVE should get the specific VNID from NVA for untagged data 345 frames arriving at the each Virtual Access Point [VNo3- 346 framework 3.1.1] of a NVE. 348 Since local VLAN-IDs under each NVE are locally significant, 349 ingress NVE should remove the local VLAN-ID attached to the 350 data frame. So that egress NVE can always assign its own local 351 VLAN-ID to data frame before sending the decapsulated data 352 frame to the attached VMs. 354 If, for whatever reasons, it is necessary to have local VLAN-ID 355 in the data frames before encapsulating outer header (i.e. 356 EgressNVE-DA, IngressNVE-SA, VNID), NVE should get the specific 357 local VLAN-ID from the NVA for those untagged data frames 358 coming to each Virtual Access Point. 360 . If the data frame is already tagged before reaching the NVE's 361 Virtual Access Point, the NVA can inform the first switch port 362 that is responsible for adding VLAN-ID to the untagged data 363 frames of the specific VLAN-ID to be inserted to data frames. 365 . If data frames from VMs are already tagged, the first port 366 facing the VMs has be informed by the NVA of the new local 367 VLAN-ID to replace the VLAN-ID encoded in the data frames. 369 For data frames coming from network side towards VMs (i.e. 370 inbound traffic towards VMs), the first switching port facing 371 VMs have to convert the VLAN-IDs encoded in the data frames to 372 the VLAN-IDs used by VMs. 374 5. Layer 2 Extension 376 5.1. Layer 2 Extension Problem 378 Consider a scenario where a VM that is a member of a given L2-based 379 VN moves from one server to another, and these two servers are in 380 different L2 physical domains, where these domains may be located in 381 the same or different data centers (or PODs). In order to enable 382 communication between this VM and other VMs of that L2-based VN, the 383 new L2 physical domain must become interconnected with the other L2 384 physical domain(s) that presently contain the rest of the VMs of 385 that VN, and the interconnect must not violate the L2-based VN 386 requirement to preserve source and destination MAC addresses in the 387 Ethernet header of the packets exchange between this VM and other 388 members of that VN. 390 Moreover, if the previous L2 physical domain no longer contains any 391 VMs of that VN, the previous domain no longer needs to be 392 interconnected with the other L2 physical domains(s) that contain 393 the rest of the VMs of that VN. 395 Note that supporting VM mobility implies that the set of L2 physical 396 domains that contain VMs that belong to a given L2-based VN may 397 change over time (new domains added, old domains deleted). 399 We will refer to this as the "layer 2 extension problem". 401 Note that the layer 2 extension problem is a special case of 402 maintaining connectivity in the presence of VM mobility, as the 403 former restricts communicating VMs to a single/common L2-based VN, 404 while the latter does not. 406 5.2. NVA based Layer 2 Extension Solution 408 Assume NVO3's NVA has at least the following information for each TS 409 (or VM): 410 . Inner Address: TS (host) Address family (IPv4/IPv6, MAC, 411 virtual network Identifier MPLS/VLAN, etc) 413 . Outer Address: The list of locally attached edges (NVEs); 414 normally one TS is attached to one edge, TS could also be 415 attached to 2 edges for redundancy (dual homing). One TS is 416 rarely attached to more than 2 edges, though it could be 417 possible; 419 . VN Context (VN ID and/or VN Name) 421 . Timer for NVEs to keep the entry when pushed down to or pulled 422 from NVEs. 424 . Optionally the list of interested remote edges (NVEs). This 425 information is for NVA to promptly update relevant edges (NVEs) 426 when there is any change to this TS' attachment to edges 427 (NVEs). However, this information doesn't have to be kept per 428 TS. It can be kept per VN. 430 NVA can offer services in a Push, Pull mode, or the combination of 431 the two. 433 In this solution, the NVEs are connected via underlay IP network. 434 For each VN, the NVA informs all the NVEs to which the VMs of the 435 given VN are attached. 437 When the last VM of a VN is moved out of a NVE, the NVA notifies the 438 NVE for it to remove its connectivity to the VN. When a VM of a 439 given VN is moved into a NVE for the first time (i.e. the NVE didn't 440 have any VMs belonging to this VN yet), the NVA will notify the NVE 441 for it to be connected to VN. 443 The term "NVE being connected to a VN" means that the NVE at least 444 has: 445 . the inner-outer address mapping information for all the VMs in 446 the VN or being able to pull the mapping from the NVA, 448 . the mapping of local VLAN-ID to the VNID used by overlay 449 header, and 451 . has the VN's default gateway IP/MAC address. 453 5.3. E-VPN based Layer 2 Extension Solution 455 This section describes a [E-VPN] based solution for the layer 2 456 extension problem, i.e. the L2 sites that contain VMs of a given L2- 457 based VN are interconnected together using E-VPN. Thus a given E- 458 VPN corresponds/associated with one or more L2-based VNs (e.g., 459 VLANs). An L2-based VN is associated with a single E-VPN Ethernet 460 Tag Identifier. 462 This section provides a brief overview of how E-VPN is used as the 463 solution for the "layer 2 extension problem". Details of E-VPN 464 operations can be found in [E-VPN]. 466 A single L2 site could be as large as the whole network within a 467 single POD or a data center, in which case the DCBRs of that 468 POD/data center, in addition to acting as IP routers for the L2- 469 based VNs present in the POD/data center, also act as PEs. In this 470 scenario E-VPN is used to handle VM migration between servers in 471 different POD/data centers and the PE nodes support the NVE 472 function. 474 A single L2 site could be as small as a single ToR with the servers 475 connected to it or virtual switch with VMs attached, in which case 476 the ToR or the virtual switch acts as a PE-NVE. In this scenario E- 477 VPN is used to handle VM migration between servers that are either 478 in the same or in different data centers. Note that even in this 479 scenario this document assumes that DCBRs, in addition to acting as 480 IP routers for the L2-based VNs present in their data center, also 481 participate in the E-VPN procedures, acting as BGP Route Reflectors 482 for the E-VPN routes originated by the ToRs acting as PE-NVEs. 484 In the case where E-VPN is used to interconnect L2 sites in 485 different data centers, the network that interconnects DCBRs of 486 these data centers could provide either (a) only Ethernet or IP/MPLS 487 connectivity service among these DCBRs, or (b) may offer the E-VPN 488 service. In the former case DCBRs exchange E-VPN routes among 489 themselves relying only on the Ethernet or IP/MPLS connectivity 490 service provided by the network that interconnects these DCBRs. The 491 network does not directly participate in the exchange of these E-VPN 492 routes. In the latter case the routers at the edge of the network 493 may be either co-located with DCBRs, or may establish E-VPN peering 494 with DCBRs. Either way, in this case the network facilitates 495 exchange of E-VPN routes among DCBRs (as in this case DCBRs would 496 not need to exchange E-VPN routes directly with each other). 498 Please note that for the purpose of solving the layer 2 extension 499 problem the propagation scope of E-VPN routes for a given L2-based 500 VN is constrained by the scope of the PEs connected to the L2 sites 501 that presently contain VMs of that VN. This scope is controlled by 502 the Route Target of the E-VPN routes. Controlling propagation scope 503 could be further facilitated by using Route Target Constrain 504 [RFC4684]. 506 Use of E-VPN ensures that traffic among members of the same L2-based 507 VN is optimally forwarded, irrespective of whether members of that 508 VN are within the same or in different data centers/PODs. This 509 follows from the observation that E-VPN inherently enables 510 (disaggregated) forwarding at the granularity of the MAC address of 511 the VM. 513 Optimal forwarding among VMs of a given L2-based VN that are within 514 the same data center requires propagating VM MAC addresses, and 515 comes at the cost of disaggregated forwarding within a given data 516 center. However such disaggregated forwarding is not necessary 517 between data centers if a given L2-based VN spans multiple data 518 centers. For example when a given ToR acts as a PE-NVE, this ToR has 519 to maintain MAC advertisement routes only to the VMs within its own 520 data center (and furthermore, only to the VMs that belong to the L2- 521 based VNs whose site(s) are connected to that ToR), and then point a 522 "default" MAC route to one of the DCBRs of that data center. In 523 this scenario a DCBR of a given data center, when it receives MAC 524 advertisement routes from DCBR(s) in other data centers, does not 525 re-advertise these routes to the PE-NVEs within its own data center, 526 but just advertises a single "default" MAC advertisement route to 527 these PE-NVEs. 529 When a given VM moves to a new L2 site, if in the new site this VM 530 is the only VM from its L2-based VN, then the PE-NVE(s) connected to 531 the new site need to be provisioned with the E-VPN Instances (EVI) 532 of the E-VPN associated with this L2-based VN. Likewise, if after 533 the move the old site no longer has any VMs that are in the same L2- 534 based VN as the VM that moved, the PE-NVE(s) connected to the old 535 site need to be de-provisioned with the EVI of the E-VPN. 536 Procedures to accomplish this are outside the scope of this 537 document. 539 6. Optimal IP Routing 541 In the context of this document optimal IP routing, or just optimal 542 routing, in the presence of VM mobility could be partitioned into 543 two problems: 545 - Optimal routing of a VM's outbound traffic. This means that as a 546 given VM moves from one server to another, the VM's default 547 gateway should be in a close topological proximity to the ToR that 548 connects the server presently hosting that VM. Note that when we 549 talk about optimal routing of the VM's outbound traffic, we mean 550 traffic from that VM to the destinations that are outside of the 551 VM's L2-based VN. This document refers to this problem as the VM 552 default gateway problem. 553 - Optimal routing of VM's inbound traffic. This means that as a 554 given VM moves from one server to another, the (inbound) traffic 555 originated outside of the VM's L2-based VN, and destined to that 556 VM be routed via the router of the VM's L2-based VN that is in a 557 close topological proximity to the ToR that connects the server 558 presently hosting that VM, without first traversing some other 559 router of that L2-based VN (the router of the VM's L2-based VN may 560 be either DCBR or ToR itself). This is also known as avoiding 561 "triangular routing". This document refers to this problem as the 562 triangular routing problem. 564 In order to avoid the "triangular routing", routers in the Wide Area 565 Network have to be aware which DCBRs can reach the designated VMs. 566 When VMs in a single VN are spread across many different DCBRs, all 567 individual VMs' addresses have to be visible to those routers, which 568 can dramatically increase the number of routes in those routers. 570 If a VN is spread across multiple DCBRs and all those DCBRs announce 571 the same IP prefix for the VN, there could be many issues, 572 including: 573 - Traffic could go to DCBR A where target is in DCBR B. and DCBR "A" 574 is connected to DCBR "B" via WAN 576 - If majority of one VN members are under DCBR "A" and rest are 577 spread across X number of DCBRs. Will DCBR "A" have same weight as 578 DCBR "B", "C", etc? 580 If all those DCBRs announce individual IPs that are directly 581 attached and those IPs are not segmented well, then all the VMs IP 582 addresses have to be exposed to the WAN. So overlay hides the VMs IP 583 from the core switches in one DC or one POD, but exposes them to the 584 WAN. There are more routers in the WAN than the number of core 585 switches in one DC/POD. 587 The ability to deliver optimal routing (as defined above) in the 588 presence of stateful devices is outside the scope of this document. 590 6.1. Preserving Policies 592 Moving VM from one L2 physical domain to another means (among other 593 things) that the NVE in the new domain that provides connectivity 594 between this VM and VMs in other L2 physical domains must be able to 595 implement the policies that control connectivity between this VM and 596 VMs in other L2 physical domains. In other words, the policies that 597 control connectivity between a given VM and its peers MUST NOT 598 change as the VM moves from one L2 physical domain to another. 599 Moreover, policies, if any, within the L2 physical domain that 600 contains a given VM MUST NOT preclude realization of the policies 601 that control connectivity between this VM and its peers. All of the 602 above is irrespective of whether the L2 physical domains are trivial 603 or not. 605 There could be policies guarding VMs across different VNs, with some 606 being enforced by Firewall, some enforced by NAT/AntiDDOS/IPS/IDS, 607 etc. It is less about NVE polices to be maintained when VMs move, 608 it is more along the line of dynamically changing policies 609 associated with the "middleware" boxes attached to NVEs (if those 610 middle boxes are distributed). 612 6.2. VM Default Gateway solutions 614 As VM moves to a new L2 site, the default gateway IP address of the 615 VM may not change. Further, while with cold VM mobility one may 616 assume that VM's ARP/ND cache gets flushed once VM moves to another 617 server, one cannot make such an assumption with hot VM mobility. 619 Thus the destination MAC address in the inter-VN/inter-subnet 620 traffic originated by that VM would not change as VM moves to the 621 new site. Given that, how would NVE(s) connected to the new L2 site 622 be able to recognize inter-VN/inter-subnet traffic originated by 623 that VM? The following describes possible solutions. 625 6.2.1. E-VPN based VM Default Gateway Solutions 627 The E-VPN based solutions assume that for inter-VN/inter-subnet 628 traffic between VM and its peers outside of VM's own data center, 629 one or more DCBRs of that data center act as fully functional 630 default gateways for that traffic. 632 Both of these solutions also assume that VLAN-aware VLAN bundling 633 mode of E-VPN is used as the default mode such that different L2-VNs 634 (different subnets) for the same tenant can be accommodated in a 635 single EVI. This facilitates provisioning since E-VPN related 636 provisioning (such as RT configuration) could be done on a per- 637 tenant basis as opposed to on a per-subnet (per L2-VN) basis. In 638 this default mode, VMs' MAC addresses are maintained on a per bridge 639 domain basis (per subnet) within the EVI; however, VM's IP addresses 640 are maintained across all the subnets of that tenant in that EVI. 641 In the scenarios where communications among VMs of different subnets 642 belonging to the same tenant is to be restricted based on some 643 policies, then the VLAN mode of E-VPN should be used with each 644 VLAN/subnet mapping to its own EVI and E-VPN RT filtering can be 645 leveraged to enforce flexible policy-based communications among VMs 646 of different subnets for that tenant. 648 6.2.1.1. E-VPN based VM Default Gateway Solution 1 650 The first solution relies on the use of an anycast default gateway 651 IP address and an anycast default gateway MAC address. 653 If DCBRs act as PE-NVEs for an E-VPN corresponding to a given L2- 654 based VN, then these anycast addresses are configured on these 655 DCBRs. Likewise, if ToRs act as PE-NVEs, then these anycast 656 addresses are configured on these ToRs. All VMs of that L2-based VN 657 are (auto) configured with the (anycast) IP address of the default 658 gateway. 660 DCBRs (or ToRs) acting as PE-NVEs use these anycast addresses as 661 follows: 663 - When a particular DCBR (or ToR) acting as a PE-NVE receives a 664 packet with the (anycast) default gateway MAC address, the DCBR (or 665 ToR) applies IP forwarding to the packet, and perform NVE function 666 if the destination of the packet is attached to another NVE. 668 - When a particular DCBR (or ToR) acting as a PE-NVE receives an 669 ARP/ND Request for the default gateway (anycast) IP address, the 670 DCBR (or ToR) generates ARP/ND Reply. 672 This ensures that a particular DCBR (or ToR), acting as a PE-NVE, 673 can always apply IP forwarding to the packets sent by a VM to the 674 (anycast) default gateway MAC address. It also ensures that such 675 DCBR (or ToR) can respond to the ARP Request generated by a VM for 676 the default gateway (anycast) IP address. 678 DCBRs (or ToRs) acting as PE-NVEs must never use the anycast default 679 gateway MAC address as the source MAC address in the packets 680 originated by these DCBRs (or ToRs), cannot use the anycast default 681 gateway IP address as the source IP address in the overlay header. 683 Note that multiple L2-based VNs may share the same MAC address for 684 the purpose of using as the (anycast) MAC address of the default 685 gateway for these VNs. 687 If the default gateway functionality is not in NVEs (TORs), then the 688 default gateway MAC/IP addresses need to be distributed using E-VPN 689 procedures. Note that with this approach when originating E-VPN MAC 690 advertisement routes for the MAC address of the default gateways of 691 a given L2-based VN, all these routes MUST indicate that this MAC 692 address belongs to the same Ethernet Segment Identifier (ESI). 694 6.2.1.2. E-VPN based VM Default Gateway Solution 2 696 The second solution does not require configuring the anycast default 697 gateway IP and MAC address on the PE-NVEs. 699 Each DCBR (or each ToR) that acts as a default gateway for a given 700 L2-based VN advertises in the E-VPN control plane its default 701 gateway IP and MAC address using the MAC advertisement route, and 702 indicates that such route is associated with the default gateway. 703 The MAC advertisement route MUST be advertised as per procedures in 704 [E-VPN]. The MAC address in such an advertisement MUST be set to the 705 default gateway MAC address of the DCBR (or ToR). The IP address in 706 such an advertisement MUST be set to the default gateway IP address 707 of the DCBR (or ToR). To indicate that such a route is associated 708 with a default gateway, the route MUST carry the Default Gateway 709 extended community [Default-Gateway]. 711 Each PE-NVE that receives this route and imports it as per 712 procedures of [E-VPN] MUST create MAC forwarding state that enables 713 it to apply IP forwarding to the packets destined to the MAC address 714 carried in the route. The PE-NVE that receives this E-VPN route 715 follows procedures in Section 12 of [E-VPN] when replying to ARP/ND 716 Requests that it receives if such Requests are for the IP address in 717 the received E-VPN route. 719 6.2.2. Distributed Proxy Default Gateway Solution 721 In this solution, NVEs perform the function of the default gateway 722 for all the VMs attached. Those NVEs are called "Proxy Default 723 Gateway" in this document because those NVEs might not be the 724 Default Gateways explicitly configured on VMs attaches. Some of 725 those proxy default gateway NVEs might not have the complete inter- 726 subnet communications policies for the attached VNs. 728 In order to ensure that the destination MAC address in the inter- 729 VN/inter-subnet traffic originated by that VM would not change as VM 730 moves to a different NVE, a pseudo MAC address is assigned to all 731 NVE-based Proxy Default Gateways. 733 When a particular NVE acting as Proxy Default Gateway receives an 734 ARP/ND Request from the attached VMs for their default gateway IP 735 addresses, the NVE generates ARP/ND Reply with the pseudo MAC 736 address. 738 When a particular NVE acting as a Proxy Default Gateway receives a 739 packet with the Pseudo default gateway MAC address: 741 - if the NVE has all the needed policies for the Source & 742 Destination VNs, the NVE applies the IP forwarding, i.e. forward 743 the packet from source VN to the destination VN, and apply the NVE 744 encapsulation function with target NVE as destination address and 745 destination VN identifier in the header, 746 - if the NVE doesn't have the needed policies from the source VN to 747 the destination VN, the NVE applies the NVE encapsulation function 748 with real host's default gateway as destination address and source 749 VN identifier in the header 751 This solution assumes that the NVE-based proxy default gateways 752 either get the mapping of hosts' default gateway IP <-> default 753 gateway MAC from the corresponding NVA or via ARP/ND discovery. 755 6.3. Triangular Routing 757 The triangular routing solution could be partitioned into two 758 components: intra data center triangular routing solution, and inter 759 data center triangular routing solution. The former handles the 760 situation where communicating VMs are in the same data center. The 761 latter handles all other cases. This draft only describes the 762 solution for intra data center triangular routing. 764 6.3.1. NVA based Intra Data Center Triangular Routing Solution 766 To be added. 768 6.3.2. E-VPN based Intra Data Center Triangular Routing Solution 770 This solutions assumes that as a PE-NVE originates MAC advertisement 771 routes, such routes, in addition to MAC addresses of the VMs, also 772 carry IP addresses of these VMs. Procedures by which a PE-NVE can 773 learn the IP address associated with a given MAC address are 774 specified in [E-VPN]. 776 Consider a set of L2-based VNs, such that VMs of these VNs, as a 777 matter of policy, are allowed to communicate with each other. To 778 avoid triangular routing among such VMs that are in the same data 779 center this document relies on the E-VPN procedures, as follows. 781 Procedures in this section assume that ToRs act as PE-NVEs, and also 782 able to support IP forwarding functionality. 784 For a given set of L2-based VNs whose VMs are allowed to communicate 785 with each other, consider a set of E-VPN instances (EVIs) of the E- 786 VPNs associated with these VNs. We further restrict this set of EVIs 787 to only the EVIs that are within the same data center. To avoid 788 triangular routing among VMs within the same data center, E-VPN 789 routes originated by one of the EVIs within such set should be 790 imported by all other EVIs in that set, irrespective of whether 791 these other EVIs belong to the same E-VPN as the EVI that originates 792 the routes. 794 One possible way to accomplish this is 796 - for each set of L2-based VNs whose VMs are allowed to communicate 797 with each other, and for each data center that contains such VNs 798 have a distinct RT (distinct RT per set, per data center), 799 - provision each EVI of the E-VPNs associated with these VNs to 800 import routes that carry this RT, and 801 - make the E-VPN routes originated by such EVIs to carry this RT. 802 Note that these RTs are in addition to the RTs used to form 803 individual E-VPNs. Note also, that what is described here is 804 conceptually similar to the notion of "extranets" in BGP/MPLS VPNs 805 [RFC4364]. 807 When a PE imports an E-VPN route into a particular EVI, and this 808 route is associated with a VM that is not part of the L2-based VN 809 associated with the E-VPN of that EVI, the PE-NVE creates IP 810 forwarding state to forward traffic to the IP address present in the 811 NLRI of the route towards the Next Hop, as specified in the route. 813 To illustrate how the above procedures avoid triangular routing, 814 consider the following example. Assume that a particular VM, VM-A, 815 is currently hosted by a server connected to a particular ToR-NVE, 816 ToR-1, and another VM, VM-B, is currently hosted by a server 817 connected to ToR-2 (NVE). Assume that VM-A and VM-B belong to 818 different L2-based VNs, and (as a matter of policy) VMs in these VNs 819 are allowed to communicate with each other. Now assume that VM-B 820 moves to another server, and this server is connected to ToR-3 821 (NVE). Assume that ToR-1, ToR-2, and ToR-3 are in the same data 822 center. While initially ToR-1 would forward data originated by VM-A 823 and destined to VM-B to ToR-2, after VM-B moves to the server 824 connected to ToR-3, using the procedures described above, ToR-1 825 would forward the data to ToR-3 (and not to ToR-2), thus avoiding 826 triangular routing. 828 Note that for the purpose of redistributing E-VPN routes among 829 multiple L2-based VNs, the above procedures limit the propagation 830 scope of routes to individual VMs to a single data center, and 831 furthermore, to only a subset of the PE-NVEs within that data center 832 - the PE-NVEs that have EVIs of the E-VPNs associated with the L2- 833 based VNs whose VMs are allowed to communicate with each other. As a 834 result, the control plane overhead needed to avoid triangular 835 routing within a data center is localized to these PE-NVEs. 837 7. Manageability Considerations 839 Several solutions described in this document depend on the presence 840 of NVA in the data center. 842 8. Security Considerations 844 In addition to the security considerations described in [nvo3- 845 problem], it is clear that allowing VMs migrating across Data Center 846 will require more stringent security enforcement. The traditional 847 placement of security functions, e.g. firewall, at data center 848 gateways is no longer enough. VM mobility will require security 849 functions to enforce policies among east-west traffic among VMs. 850 When VMs move across Data Center, the associated policies have to be 851 updated and enforced. 853 9. IANA Considerations 855 This document requires no IANA actions. RFC Editor: Please remove 856 this section before publication. 858 10. Acknowledgements 860 The authors would like to thank Adrian Farrel, David Black and Larry Kreeger for 861 their review and comments. The authors would also like to thank Ivan Pepelnjak for 862 his contributions to this document. 864 11. References 866 11.1. Normative References 868 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 869 Requirement Levels", BCP 14, RFC 2119, March 1997. 871 [RFC7297] Boucadair, M., "IP Connectivity Provisioning Profile", 872 RFC7297, April 2014. 874 11.2. Informative References 876 [nvo3-problem] Narten T.et al., "Overlays for Network 877 Virtualization", draft-ietf-nvo3-overlay-problem-statement- 878 04, July 2013. 880 [RFC1700] Reynolds J., Postel J., "ASSIGNED NUMBERS", RFC1700, 881 October 1994 883 [RFC2332] "NBMA Next Hop Resolution Protocol (NHRP)", RFC 2332, J. 884 Luciani et. al. 886 [RFC4364] Rosen, Rekhter, et. al., "BGP/MPLS IP VPNs", RFC4364, 887 February 2006 889 [RFC4684] Pedro Marques, et al., "Constrained Route Distribution for 890 Border Gateway Protocol/MultiProtocol Label Switching 891 (BGP/MPLS) Internet Protocol (IP) Virtual Private Networks 892 (VPNs)", RFC4684, November 2006 894 [E-VPN] Aggarwal R., et al., "BGP MPLS Based Ethernet VPN", draft- 895 ietf-l2vpn-evpn, work in progress 897 [Default-Gateway] http://www.iana.org/assignments/bgp-extended- 898 communities 900 Authors' Addresses 902 Yakov Rekhter 903 Juniper Networks 904 1194 North Mathilda Ave. 905 Sunnyvale, CA 94089 906 Email: yakov@juniper.net 908 Linda Dunbar 909 Huawei Technologies 910 5340 Legacy Drive, Suite 175 911 Plano, TX 75024, USA 912 Email: ldunbar@huawei.com 914 Rahul Aggarwal 915 Arktan, Inc 916 Email: raggarwa_1@yahoo.com 918 Wim Henderickx 919 Alcatel-Lucent 920 Email: wim.henderickx@alcatel-lucent.com 922 Ravi Shekhar 923 Juniper Networks 924 1194 North Mathilda Ave. 925 Sunnyvale, CA 94089 926 Email: rshekhar@juniper.net 928 Luyuan Fang 929 Cisco Systems 930 111 Wood Avenue South 931 Iselin, NJ 08830 932 Email: lufang@microsoft.com 934 Ali Sajassi 935 Cisco Systems 936 Email: sajassi@cisco.com