idnits 2.17.00 (12 Aug 2021) /tmp/idnits52928/draft-heitz-idr-msdc-fabric-autoconf-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 22, 2018) is 1307 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 3315 (Obsoleted by RFC 8415) == Outdated reference: draft-ietf-6man-segment-routing-header has been published as RFC 8754 == Outdated reference: draft-ietf-netconf-zerotouch has been published as RFC 8572 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group J. Heitz 3 Internet-Draft K. Majumdar 4 Intended status: Standards Track Cisco 5 Expires: April 25, 2019 October 22, 2018 7 Automatic discovery and configuration of the network fabric in Massive 8 Scale Data Centers 9 draft-heitz-idr-msdc-fabric-autoconf-00 11 Abstract 13 A switching fabric in a massive scale data center can comprise many 14 10,000's of switches and 100,000's of IP hosts. To connect and 15 configure a network of such size needs automation to avoid errors. 16 Zero Touch Provisioning (ZTP) protocols exist. These can configure 17 IP devices that are reachable by the ZTP agents. A method to combine 18 BGP, DHCPv6 and SRv6 with ZTP that can be used to configure an entire 19 network of devices is described. It is designed to scale well, 20 because each networked device is not required to know about more than 21 its directly connected neighborhood. 23 Requirements Language 25 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 26 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 27 document are to be interpreted as described in [RFC2119]. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on April 25, 2019. 46 Copyright Notice 48 Copyright (c) 2018 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 64 2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 3 65 3. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 4 66 4. Solution Details . . . . . . . . . . . . . . . . . . . . . . 4 67 5. Security Considerations . . . . . . . . . . . . . . . . . . . 7 68 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 69 7. Acknowldgements . . . . . . . . . . . . . . . . . . . . . . . 7 70 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 71 8.1. Normative References . . . . . . . . . . . . . . . . . . 7 72 8.2. Informative References . . . . . . . . . . . . . . . . . 8 73 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 75 1. Introduction 77 [RFC7938] defines a massive scale data center as one that contains 78 over one hundred thousand servers. It describes the advantages of 79 using BGP [RFC4271] as a routing protocol in a Clos switching fabric 80 that connects these servers. A fabric design that scales to one 81 million servers is considered enough for the forseeable future and is 82 the design goal of this document. Of course, the design should also 83 work for smaller fabrics. A switch fabric to connect one million 84 servers will consist of between 35000 and 130000 switches and 1.5 85 million to 8 million links, depending on how redundantly the servers 86 are connected to the fabric and the level of oversubscription in the 87 fabric. A switch that needs to store, send and operate on hundreds 88 of routes is clearly cheaper than one that needs to store, send and 89 operate on millions of links. 91 Such a network requires significant configuration on each switch and 92 many cables to connect. This is an onerous task without automation. 94 2. Requirements 96 To configure a fabric network for massive scale data centers. 98 To detect every wiring error. For example, a spine switch that has a 99 different number of links into one pod than into another pod in a 100 Clos fabric. 102 One or multiple controllers exist to control a network. Multiple 103 controllers are used for redundancy and to improve operation in 104 partitioned networks. 106 Any devices with equivalent functionality should be interchangeable 107 without requiring configuration changes. That means if a device 108 breaks, it can be replaced by any other device of equivalent 109 functionality without any changes to its configuration. Even if a 110 replacement device already has configuration, it should still work in 111 its new position. 113 A device may have configuration, but such configuration MUST NOT 114 depend on the location of the device in the network. Therefore, no 115 IP addresses should be pre-configured on any devices. No fabric tier 116 should be needed. 118 For scalability, every device must not need to know how to reach 119 every other device. Only a controller should be expected to know the 120 entire topology. 122 If two such auto-discovering/auto-configuring networks are connected 123 together, the function of discovery/configuration in one network must 124 not disturb this function in the other network. 126 A device must accept configuration only from a well-defined set of 127 controllers. 129 Separate cabling for a management network must not be required. 131 The network should function even if the controllers are disconnected. 132 Link failures and restoration should be dealt with. Device failure 133 should be dealt with. Device restoration should be dealt with as 134 long as it does not require new configuration. A controller should 135 only be needed to discover and configure new devices to the network. 137 The protocol does not need to be fast. 139 A controller must be able to reach any device if there is any way at 140 all to reach it, even if that is multiple hops between spine switches 141 or any other path that may be disallowed in a normal Clos network. 143 At the same time, normal traffic must remain restricted to allowable 144 paths. 146 The routing protocol for normal traffic must be fast and efficient. 148 The network must scale to 1 million connected servers and 8 million 149 links in the fabric. 151 3. Solution Overview 153 DHCPv6 [RFC3315] and ZTP are used to discover and configure devices 154 reachable by the controller. As the controller configures devices, 155 it configures them to be DHCP relay agents. This makes more devices 156 reachable by the new DHCP relay agents, allowing the new devices to 157 be configured. As this configuration process proceeds further away 158 from the controller, it configures BGP to ensure reachabillity to all 159 devices even if links were to fail. Reachability needs to be device 160 to controller and controller to device. Every device does not need 161 to be able to reach every other device during the discovery/ 162 configuration process. Devices close to the controller will be used 163 to forward packets to many more distant devices. These close devices 164 should not store routes to reach all those more distant devices. A 165 possible idea to reduce the routing table on close devices is to 166 aggregate addresses of more distant devices. This is difficult and 167 unreliable, because before discovery completes, the number of devices 168 behind any given device is unknown. Also, if links fail, suddenly, a 169 large number of devices could appear behind a different device, 170 making the previous addressing structure non-aggregatable with the 171 new topology. The chosen method to route traffic from controller to 172 device is segment routing. The controller knows the topology. With 173 that knowledge, it can build a segment list to reach any device. 175 In certain environments, it is required for devices to authenticate 176 the network and for the network to authenticate devices. DHCPv6 177 provides a method to authenticate in both directions using shared 178 keys. TCP-AO [RFC5925] can be used to authenticate BGP sessions. 179 SZTP [I-D.ietf-netconf-zerotouch] provides for authentication during 180 the ZTP process. 182 4. Solution Details 184 Each device needs a unique identifier. This may be printed on the 185 device. For easy servicability, a device must have a single 186 identifier, visible on the outside of the device and by the 187 controller. This will be the DUID in the DHCPv6 Client Identifier 188 Option. 190 In order to discover the topology, a controller needs to know every 191 link in the topology. This means the device ID and interface ID or 192 interface address at each end of every link. DHCPv6 can be used to 193 obtain that information. For each link, one end of the link is the 194 device that requests an address. The other end of the link is either 195 the controller itself or a DHCP relay agent. The DHCP relay agent 196 relays all client requests back to the controller. 198 Configuration proceeds in waves. Each controller may take part in 199 configuring the network. The waves of configuration propagate away 200 from each controller. In the first wave, a controller allocates a 201 routable ipv6 address to each device directly connected to the 202 controller. These devices comprise the first wave. The controller 203 will then configure each of these devices using a ZTP protocol, such 204 as [I-D.ietf-netconf-zerotouch]. The configuration for each device 205 will include the following items: 207 - A routable Ipv6 address for each of its interfaces that have not 208 already acquired one by DHCP. 210 - A routable Ipv6 address for the loopback interface. 212 - Configuration to act as a DHCPv6 relay agent for the next wave of 213 devices. 215 - Configuration for a BGP session to each of its connected 216 neighbors. That BGP session will initially be down, but will 217 establish once the neighbors are connected and configured. 219 - Configuration for a BGP session to the controller. 221 The controller will allocate a different IP address for each 222 interface for each device in the network. When the controller 223 receives DHCP requests from DHCP relay agents, it will recognize the 224 DHCP relay agent end of the link from the link-address field in the 225 relay-forward message. The controller will note the DUID in the DHCP 226 request to keep track of the device making the request. Because it 227 already knows the DUID of the DHCP relay agent from its IP address, 228 it can tie the two devices together by their DUID. 230 The controller must keep track of the DUID in every DHCP request, so 231 that it can recognize different interfaces on the same device. This 232 is needed to detect looped cables and to prevent the controller 233 attempting to use ZTP to configure a single device through multiple 234 links at the same time. 236 Two devices A and B may be connected by a link and be configured at 237 the same time, each through a different link. At this time, the 238 controller does not yet know about the link A-B. In this case, 239 neither A nor B will send a DHCP request across the link A-B. The 240 interfaces on each end will not come up either, because the IP 241 interface addresses will not have a common prefix. This case can be 242 detected, because both A and B will send periodic router- 243 advertisement messages on the link, announcing their interface IP 244 addresses. The device with the lower address MUST send a DHCPv6 245 request to the other device to get a new address. 247 A device SHOULD use the DHCPv6 User Class Option to identify the 248 network it is attempting to reach. This is to prevent the controller 249 from configuring devices attached to the network that are not part of 250 the network to be configured. A string should be used that is not 251 likely to match that of any other network that this network is 252 connecting to. However, even if it matches by some small chance, the 253 DHCPv6 authentication key will likely not match or the subsequent ZTP 254 will fail. Inadvertently getting an IP address is not a terrible 255 thing. 257 The controller should allocate a different BGP AS number for each 258 device. There are plenty of private 4-octet ASNs available. 260 The controller will advertise its own loopback address to all the 261 directly connected BGP neighbors with a community to identify it as a 262 controller address. This IP address will be advertised by all 263 devices to their directly connected BGP neighbors. The devices will 264 use this BGP route to route back to the controller. 266 Each device will announce its interface addresses to the BGP 267 connections of its directly connected neighbors tagged with a 268 community. These routes will be re-announced only to the BGP session 269 to the controller and not to directly connected neighbors. The BGP 270 connections can be made to fail upon interface down or BFD down. BFD 271 should only operate on the BGP sessions to directly connected 272 neighbors, not on the session to the controller. 274 The devices will be segment-routing V6 (SRv6) 275 [I-D.ietf-6man-segment-routing-header] capable. When a device 276 receives an Ipv6 packet, it will first inspect the SRv6 extension 277 header and be able to forward the packet to the next segment. If 278 there is no SRv6 extension header or no more segments, then the 279 packet should be for itself or for a directly connected neighbor or 280 for a controller. If none of those match, then it must drop the 281 packet. 283 The controller, knowing the topology, will be able to send a packet 284 to any device in the network by building the appropriate SRv6 SID 285 list. Thus each device in the network does not need to store a route 286 for every other device. 288 Once the controller has learnt the whole network topology, or at 289 least a large recognizable part of it, it can complete the 290 configuration of the network. This depends on the network. The 291 controller will be programmed with a description of the expected 292 network and applicable constraints. As discovery proceeds, the 293 controller will try to match the discovered topology with the 294 programmed description. An example of a data center description is: 295 "A number of pods. Each pod consists of 384 TORs and 32 spines. 296 Each TOR has 32 south facing ports and 32 north facing ports. Each 297 spine has 384 south facing ports and 192 north facing ports. Super- 298 spines connect the pods. Some of the pods are DCI pods. The devices 299 need aggregatable addresses and BGP sessions." The controller should 300 be able to recognize all the switches, the servers and the DCI 301 routers and match the discovered topology to the description. It 302 should then create configurations for all the devices and report 303 inconsistencies. How the controller does this is out of scope of 304 this document. 306 When a new device joins the network, the controller will detect it, 307 because it will receive a DHCP request from it, relayed by its 308 neighboring DHCP relay agent. 310 5. Security Considerations 312 TBD 314 6. IANA Considerations 316 TBD 318 7. Acknowldgements 320 8. References 322 8.1. Normative References 324 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 325 Requirement Levels", BCP 14, RFC 2119, 326 DOI 10.17487/RFC2119, March 1997, 327 . 329 [RFC3315] Droms, R., Ed., Bound, J., Volz, B., Lemon, T., Perkins, 330 C., and M. Carney, "Dynamic Host Configuration Protocol 331 for IPv6 (DHCPv6)", RFC 3315, DOI 10.17487/RFC3315, July 332 2003, . 334 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 335 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 336 DOI 10.17487/RFC4271, January 2006, 337 . 339 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 340 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 341 June 2010, . 343 8.2. Informative References 345 [I-D.ietf-6man-segment-routing-header] 346 Filsfils, C., Previdi, S., Leddy, J., Matsushima, S., and 347 d. daniel.voyer@bell.ca, "IPv6 Segment Routing Header 348 (SRH)", draft-ietf-6man-segment-routing-header-14 (work in 349 progress), June 2018. 351 [I-D.ietf-netconf-zerotouch] 352 Watsen, K., Abrahamsson, M., and I. Farrer, "Zero Touch 353 Provisioning for Networking Devices", draft-ietf-netconf- 354 zerotouch-25 (work in progress), September 2018. 356 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 357 BGP for Routing in Large-Scale Data Centers", RFC 7938, 358 DOI 10.17487/RFC7938, August 2016, 359 . 361 Authors' Addresses 363 Jakob Heitz 364 Cisco 365 170 West Tasman Drive 366 San Jose, CA, CA 95134 367 USA 369 Email: jheitz@cisco.com 371 Kausik Majumdar 372 Cisco 373 170 West Tasman Drive 374 San Jose, CA, CA 95134 375 USA 377 Email: kmajumda@cisco.com