idnits 2.17.00 (12 Aug 2021) /tmp/idnits32470/draft-wkumari-dcops-l3-vmmobility-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (August 11, 2011) is 3929 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group W. Kumari 3 Internet-Draft Google 4 Intended status: Informational J. Halpern 5 Expires: February 12, 2012 Ericsson 6 August 11, 2011 8 Virtual Machine mobility in L3 Networks. 9 draft-wkumari-dcops-l3-vmmobility-00 11 Abstract 13 This document outlines how Virtual Machine mobility can be 14 accomplished in datacenter networks that are based on L3 15 technologies. It is not really intended to solve (or fully define) 16 the problem, but rather to outline it at a very high level to 17 determine if standardization within the IETF makes sense. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on February 12, 2012. 36 Copyright Notice 38 Copyright (c) 2011 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Author Notes . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2.1. Requirements notation . . . . . . . . . . . . . . . . . . . 4 56 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 4 57 4. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 58 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 59 6. Security Considerations . . . . . . . . . . . . . . . . . . . . 7 60 7. Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 61 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 7 62 9. Normative References . . . . . . . . . . . . . . . . . . . . . 8 63 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 8 65 1. Author Notes 67 [ RFC Editor -- Please remove this section before publication! ] 69 1. Fix terminology section! 70 2. Rejigger Introduction into Intro and Background! 71 3. Do we need to extend the mapping service to include 72 (Customer_ID)? This will allow the use of overlapping addresses 73 by customers, but *does* limit the encapsulating technologies. 74 4. Currently I'm envisioning this as IP only. It would be fairly 75 trivial to make the query in be for the MAC address instead of 76 the IP. This does lead to some interesting issues, like what do 77 we do with broadcast, such as ARP? Have the mapping server reply 78 with all of the destinations and then have the source replicate 79 the packet?! 81 2. Introduction 83 There are many ways to design and build a datacenter network (and the 84 definition of what exactly a datacenter network is very vague!), but 85 in general they can be separated into two main classes, Layer-2 based 86 and Layer-3 based. 88 A Layer-2 based datacenter is one in which the majority of the 89 traffic is bridged (or switched) in a large, flat Layer-2 domain or 90 number of Layer-2 domains. VLANs are often employed to provide 91 customer isolation. 93 A Layer-3 based datacenter is one in which much of the communication 94 between hosts is switched. In this architecture there are a large 95 number of separate Layer-3 domains (for example, one subnet per rack) 96 and communication between hosts is usually routed. Communication 97 between hosts in the same subnet is (obviously) bridged / switched. 98 While customer isolation can be provided though careful layout and 99 access control lists, in general this architecture is better suited 100 to a single (or small number ) of users, such as a single 101 organization. 103 This delineation is obviously a huge simplification as the design and 104 build out of a datacenter has many dimensions and most real-world 105 datacenters have properties of both Layer-2 and Layer-3. 107 Virtual Machines are fast gaining popularity as they allow a 108 datacenter operator to more fully leverage their hardware resources 109 and, in essence provide statistical multiplexing of compute 110 resources. By selling multiple VMs on a single physical machine they 111 can maximise their investment, quickly allocate resources to 112 customers and potentially move VMs to other hosts when needed. 114 One of the factors driving the design of datacenters is the desire to 115 provide Virtual Machine Mobility. This allows an operator to move 116 the guest machine state from one machine to another, including all of 117 the network state, including keeping TCP connections alive. This 118 allows a datacenter operator to dynamically move guest machines 119 around to better allocate resources and take devices offline for 120 maintain without negatively impacting customers. VM Mobility can 121 even be used to move running machine around to provide better latency 122 - for example an instance can be moved from the East Coast of the USA 123 to Australia and back on a daily basis to "follow the sun". 125 In many cases VM Mobility requires that the source and destination 126 host machines are on the same layer-2 networks, which has lead to the 127 formation of large Layer-2 networks containing thousands (or tens of 128 thousands) of machines. This has led to some scaling concerns, such 129 as those being addressed in the ARMD Working Group. Some operators 130 are more comfortable running Layer-3 networks (and, to be honest 131 think that big Layer-2 networks are bad JuJu.) 133 This document outlines how VM Mobility can be designed to work in a 134 datacenter (or across datacenters) that are broken up into multiple 135 Layer-3 domains. 137 2.1. Requirements notation 139 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 140 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 141 document are to be interpreted as described in [RFC2119]. 143 3. Terminology 145 There is a whole industry build around these technologies, and as 146 with many new industries each vendor has their own, unique 147 nomenclature for the various parts. This is the terminology that we 148 are using within this document -- it may not line up with what others 149 call something, but "A rose by any other name..." 150 Guest network A virtual network connecting guest instances owned by 151 a customer. This is also referred to as a Guest LAN or Customer 152 Network. 153 Host Machine A machine that "hosts" guest (virtual) machines and 154 runs a Hypervisor. This is usually a powerful server, using 155 software and hardware to provide isolation between the guest 156 machines, often referred to as a Hypervisor. The host machine 157 emulates all of the functions of a "normal" machine so that, 158 ideally, the guest OS is unaware that it is not running on 159 dedicated hardware. 160 Gateway A device that provides access to external networks. It 161 provides services to 162 Guest Machine A "Virtual Machine" that run on a Host Machine. 163 Hypervisor A somewhat loose them that encompasses the hardware and 164 software that provides isolation between guest machines and 165 emulates all of the functions of bare metal server. This usually 166 includes such things as a virtual Network Interface Card (NIC), a 167 virtual CPU (usually assisted by specialized hardware in the host 168 machine's CPU), virtual memory, etc. 169 Mapping Service A service providing a mapping between guest machines 170 and host machines on which those guests are running. This mapping 171 service also provides mappings to Gateways that provide 172 connectivity to devices outside the customer networks. 173 Virtual Machine A synonym for Guest Machine 174 Virtual Switch A vitualized bridge created by the Hypervisor, 175 bridging the virtual NICs in the virtual machines and providing 176 access to the physical network. 178 4. Overview 180 By providing a "shim" layer within the network stack provided by the 181 Hypervisor (or Guest machine) we can create a virtual L2 network 182 connecting the machines belonging to a customer, even if these 183 machines are in different L3 networks (subnets). 185 When an application on a virtual machine sends a packet to a receiver 186 on another virtual machine, the operating system on the sending VM 187 needs to resolve the hardware address of the destination IP address 188 (using ARP in IPv4 or Neighbor Discovery / Neighbor Solicitation in 189 IPv6). To do this, it generates an ARP / NS packet and broadcasts / 190 multicasts this. As with all traffic sent by the VM, this is handed 191 to a virtual network card, which is simulated by the hypervisor (yes, 192 some VM technologies provide direct access to hardware, this will be 193 further discussed later). The hypervisor examines the packet to 194 provide access control (and similar) and then discards or munges or 195 sends the packet on the physical network. So far this describes the 196 current operation of VM networking. 198 In order to provide Layer-2 connectivity between a set of virtual 199 machines that run on host machines in different IP subnets (for 200 example, in a Layer-3 based datacenter (or even owned and operated by 201 different providers) we simply build an overlay network connecting 202 the host machines. 204 When the VM passes the ARP / NS packet to the virtual NIC, the 205 hypervisor intercepts the packet, records which VM generated the 206 request and extracts the IP address to be resolved. It then queries 207 a mapping server with the guest VM identifier and requested address 208 to determine the IP address of the host machine that hosts the 209 requested destination VM, the VM identifier on that host, and the 210 virtual MAC assigned to that virtual machine. Once the source guest 211 VM receives this information it caches it, and either encapsulates 212 the original ARP / NS in an encapsulation / tunneling mechanism 213 (similar to GRE) or simply synthesized an response and hands that 214 back to the source VM. 216 Presumably the source VM initialed resolution of the destination VM 217 because it wanted to send traffic to it, so shortly after the source 218 has resolved the destination it will try and send an data packet to 219 it. Once this data packet reaches the hypervisor on the source host 220 machine, the hypervisor simply encapsulates the packet in a tunneling 221 protocol and ships it over the IP network to the destination. When 222 the packet reaches the destination host machine, the packet is 223 decapsulated, the VM ID is extracted and the packet is passed up to 224 the destination VM. (TODO (WK): We need a tunneling mechanism that 225 has a place to put the VM ID -- find one, extend one or simply define 226 a new one). In many ways much of this is similar to LISP... 228 As the ability to resolve (and so send traffic to) a given machine 229 requires getting the information from a mapping server, communication 230 between hosts can be easily granted and revoked by the mapping 231 server. It is expected that the mapping server will know which VMs 232 are owned by each customer and will, by default, allow access between 233 only those VMs (and a gateway, see below), but if the operator so 234 chooses it can (but probably shouldn't!) allow access between VMs 235 owned by different customers, etc. In addition, because the mapping 236 server uses both the IP address and VM ID to look up the destination 237 information (and the traffic between VMs is encapsulated), 238 overlapping customer space is seamlessly handled (other than in the 239 pathological case where operators allow the customers to interconnect 240 at L2!). 242 Obviously just having a bunch of customer machines communication just 243 amongst themselves isn't very useful - the customer will want to 244 reach them externally, they will be serving traffic to / from the 245 Internet, etc. This functionality is provided by gateway machines - 246 these machines decapsualte traffic that is destined to locations 247 outside the virtual network and encapsulate traffic bound for ht 248 destination network, etc. 250 By encapsulating the packet (for example in a GRE packet) the 251 Hypervisor can provide a virtual, transparent network to the 252 receiver. In order to obtain the necessary information to 253 encapsulate the packet (for example, the IP address of the machine 254 hosting the receiving VM) the sending Hypervisor queries the Mapping 255 Service. This service maps the tuple of (Customer_ID, Destination 256 Address) to the host machine hosting the instance. 258 For example, if guest machine GA, owned by customer CA on host 259 machine HX wishes to send a packet to guest machine GB (also owned by 260 customer CA) on host machine HY it would generate an ARP request (or, 261 in IPv6 land, a neighbor solicitation) for GB. The Hypervisor 262 process on HX would intercept the ARP, and query the Mapping Service 263 for (CA, GB) which would reply with the address of HY (Hypervisor 264 also cache this information) The Hypervisor on HX would then 265 encapsulate the ARP request packet in a GRE packet, setting the 266 destination to be HY and sending the packet. When the Hypervisor 267 process on HY receives the packet it would decapsulate the packet and 268 hand it to the guest instance GB. This process is transparent to GA 269 and GB - as far as they are concerned, they are both connected to a 270 single network. While the above might sound like a heavyweight 271 operation, the hypervisor is (in general) already examining all 272 packets to provide a virtualized switch, performing access control 273 functions and similar - performing the mapping functionality and 274 encapsulation / decapsulation is not expected to be expensive. 276 The Mapping Service contains information about all of the guest 277 machines, which customer they are associated with, and routes to 278 external networks. If a guest machine sends a packet that is 279 destined to an external network (such as a host on the Internet), the 280 mapping server returns the address of a Gateway. 282 5. IANA Considerations 284 No action required. 286 6. Security Considerations 288 7. Privacy 290 There 292 8. Acknowledgements 294 I would like to thank Google for 20% time. 296 9. Normative References 298 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 299 Requirement Levels", BCP 14, RFC 2119, March 1997. 301 Authors' Addresses 303 Warren Kumari 304 Google 305 1600 Amphitheatre Parkway 306 Mountain View, CA 94043 307 US 309 Email: warren@kumari.net 311 Joel M. Halpern 312 Ericsson 314 Email: joel.halpern@ericsson.com