idnits 2.17.00 (12 Aug 2021) /tmp/idnits49430/draft-heitz-idr-msdc-fabric-autoconf-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 4, 2018) is 1287 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 3315 (Obsoleted by RFC 8415) == Outdated reference: draft-ietf-6man-segment-routing-header has been published as RFC 8754 == Outdated reference: draft-ietf-netconf-zerotouch has been published as RFC 8572 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group J. Heitz 3 Internet-Draft K. Majumdar 4 Intended status: Standards Track A. Lindem 5 Expires: May 8, 2019 Cisco 6 November 4, 2018 8 Automatic discovery and configuration of the network fabric in Massive 9 Scale Data Centers 10 draft-heitz-idr-msdc-fabric-autoconf-01 12 Abstract 14 A switching fabric in a massive scale data center can comprise many 15 10,000's of switches and 100,000's of IP hosts. To connect and 16 configure a network of such size needs automation to avoid errors. 17 Zero Touch Provisioning (ZTP) protocols exist. These can configure 18 IP devices that are reachable by the ZTP agents. A method to combine 19 BGP, DHCPv6 and SRv6 with ZTP that can be used to discover and 20 configure an entire network of devices is described. It is designed 21 to scale well, because each networked device is not required to know 22 about more than its directly connected neighborhood. 24 Requirements Language 26 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 27 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 28 document are to be interpreted as described in [RFC2119]. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on May 8, 2019. 47 Copyright Notice 49 Copyright (c) 2018 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 65 2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 3 66 3. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 4 67 4. Solution Details . . . . . . . . . . . . . . . . . . . . . . 5 68 5. DHCP Procedures . . . . . . . . . . . . . . . . . . . . . . . 6 69 5.1. Inconsistent Endpoints . . . . . . . . . . . . . . . . . 7 70 6. Link State Database . . . . . . . . . . . . . . . . . . . . . 7 71 7. BGP Procedures . . . . . . . . . . . . . . . . . . . . . . . 8 72 8. Segment Routing Procedures . . . . . . . . . . . . . . . . . 9 73 9. Final Configuration . . . . . . . . . . . . . . . . . . . . . 9 74 10. Connecting a New Controller to a Network in Production . . . 10 75 11. Multiple Controllers . . . . . . . . . . . . . . . . . . . . 10 76 12. Security Considerations . . . . . . . . . . . . . . . . . . . 10 77 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 78 14. Acknowldgements . . . . . . . . . . . . . . . . . . . . . . . 11 79 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 80 15.1. Normative References . . . . . . . . . . . . . . . . . . 11 81 15.2. Informative References . . . . . . . . . . . . . . . . . 12 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 84 1. Introduction 86 RFC7938 [RFC7938] defines a massive scale data center as one that 87 contains over one hundred thousand servers. It describes the 88 advantages of using BGP [RFC4271] as a routing protocol in a Clos 89 switching fabric that connects these servers. A fabric design that 90 scales to one million servers is considered enough for the forseeable 91 future and is the design goal of this document. Of course, the 92 design should also work for smaller fabrics. A switch fabric to 93 connect one million servers will consist of between 35000 and 130000 94 switches and 1.5 million to 8 million links, depending on how 95 redundantly the servers are connected to the fabric and the level of 96 oversubscription in the fabric. A switch that needs to store, send 97 and operate on hundreds of routes is clearly cheaper than one that 98 needs to store, send and operate on millions of links. 100 Such a network requires significant configuration on each switch and 101 many cables to connect. This is an onerous task without automation. 103 2. Requirements 105 To configure a fabric network for massive scale data centers. 107 To detect every cabling error. For example, a spine switch that has 108 a different number of links into one pod than into another pod in a 109 Clos fabric. 111 Any devices should be interchangeable with another device of 112 equivalent functionality without requiring configuration changes. 113 That means if a device breaks, it can be replaced by any other device 114 of equivalent functionality without any changes to its configuration. 115 Even if a replacement device already has configuration, it should 116 still work in its new position. 118 A device may have configuration, but such configuration MUST NOT 119 depend on the location of the device in the network. Therefore, no 120 IP addresses should be pre-configured on any devices. No fabric tier 121 should be needed. 123 For scalability, every device must not need to know how to reach 124 every other device. Only the controller should be expected to know 125 the entire topology. 127 If two such auto-discovering/auto-configuring networks are connected 128 together, the function of discovery/configuration in one network must 129 not disturb this function in the other network. 131 Separate cabling for a management network must not be required. 133 The network should function even if the controllers are disconnected. 134 The controller should only be needed to discover and configure 135 devices to the network. Device and link failures and restoration 136 should not require the controller. If a device is moved or 137 reconnected in a way that requires reconfiguration, then the 138 controller is required to discover the new topology and to change the 139 configuration accordingly. 141 The protocol does not need to be fast. 143 The controller must be able to reach any device if there is any way 144 at all to reach it, even if that is multiple hops between spine 145 switches or any other path that may be disallowed in a normal Clos 146 network. At the same time, normal traffic must remain restricted to 147 allowable paths. 149 The routing protocol for normal traffic must be fast and efficient. 151 The network must scale to 1 million connected servers and 8 million 152 links in the fabric. 154 3. Solution Overview 156 DHCPv6 [RFC3315] and ZTP are used to discover and configure devices 157 reachable by the controller. As the controller configures devices, 158 it configures them to be DHCP relay agents. This makes more devices 159 reachable by the new DHCP relay agents, allowing the new devices to 160 be configured. As this configuration process proceeds further away 161 from the controller, it configures BGP to ensure reachabillity to all 162 devices even if links were to fail. For scalability, each device 163 knows only its directly connected neighbors and a route to the 164 controller. Every device can send a packet to the controller, 165 because every device knows a route to the controller. To send a 166 packet from the controller to a specific target device is harder, 167 because the devices between the controller and the target do not know 168 how to reach the target device. The controller is the only device 169 that knows the topology between itself and the devices it needs to 170 reach. To send a packet to a target device, the controller builds an 171 SRv6 (Segment Routing v6) segment list. As each device receives a 172 packet, it will place the next segment IPv6 address into the 173 destination IPv6 address field and forward the packet to the next 174 device. 176 After the network discovery is complete, the controller will validate 177 the discovered topology against an internal description and go back 178 and configure application dependent state into the devices and/or 179 report connection anomalies. An example description might be "Clos 180 fabric connecting servers and DCI pods". Since a Clos fabric looks 181 the same upside down, the controller needs to identify servers, 182 switches and DCI routers. This is done with DHCP vendor class 183 options. 185 In certain environments, it is required for devices to authenticate 186 the network and for the network to authenticate devices. TCP-AO 187 [RFC5925] can be used to authenticate BGP sessions. SZTP 188 [I-D.ietf-netconf-zerotouch] provides for authentication during the 189 ZTP process. Netconf can be used over SSH as described in [RFC6242]. 191 4. Solution Details 193 Each device needs a unique identifier. This may be printed on the 194 device. For easy servicability, a device must have a single 195 identifier, visible on the outside of the device and by the 196 controller. This will be the DUID in the DHCPv6 Client Identifier 197 Option. 199 In order to discover the topology, the controller needs to know every 200 link in the topology. This means the device ID and interface ID or 201 interface address at each end of every link. DHCPv6 can be used to 202 obtain that information. For each link, one end of the link is the 203 device that requests an address. The other end of the link is either 204 the controller itself or a DHCP relay agent. The DHCP relay agent 205 relays all client requests back to the controller. 207 Configuration proceeds in waves. The wave of configuration 208 propagates away from the controller. In the first wave, a controller 209 allocates a routable ipv6 address to each device directly connected 210 to the controller. These devices comprise the first wave. The 211 controller will then configure each of these devices using a ZTP 212 protocol, such as [I-D.ietf-netconf-zerotouch]. The configuration 213 for each device will include the following items: 215 - A routable Ipv6 address for each of its interfaces that have not 216 already acquired one by DHCP. 218 - A routable Ipv6 address for the loopback interface. 220 - Configuration to act as a DHCPv6 relay agent for the next wave of 221 devices. 223 - Configuration for a BGP session to each of its connected 224 neighbors. That BGP session will initially be down, but will 225 establish once the neighbors are connected and configured. These 226 sessions are single hop directly connected EBGP sessions. 228 - Configuration for a BGP session to the controller. This is a 229 multi-hop EBGP session using the loopback address. 231 Each BGP speaker requires an AS number and a router ID. The 232 controller should allocate a different BGP AS number for each device. 233 There are plenty of private 4-octet ASNs available. The value of the 234 router ID is not important. 236 After the first wave of devices is configured, these devices become 237 DHCPv6 relay agents. They are now in a position to accept DHCPv6 238 SOLICIT messages and relay them to the controller. The controller 239 acts as the DHCPv6 server. As each wave is configured, the BGP 240 sessions on each device ensure that every device has a route to the 241 controller. In this way, each DHCPv6 relay agent can communicate 242 with the controller. A DHCP packet relayed by a device in the second 243 wave is not relayed again by a device in the first wave. The device 244 in the second wave has an IP connection to the controller through 245 which it relays the messages. 247 The controller will allocate a different IP address for each 248 interface for each device in the network. When the controller 249 receives DHCP requests from DHCP relay agents, it will recognize the 250 DHCP relay agent end of the link from the link-address field in the 251 relay-forward message. The controller will note the DUID in the DHCP 252 request to keep track of the device making the request. Because it 253 already knows the DUID of the DHCP relay agent from its IP address, 254 it can tie the two devices together by their DUID. 256 The controller must keep track of the DUID in every DHCP request, so 257 that it can recognize different interfaces on the same device. This 258 is needed to detect looped cables and to prevent the controller 259 attempting to use ZTP to configure a single device through multiple 260 links at the same time. 262 5. DHCP Procedures 264 When a switch acquires an IP address on an interface, it starts 265 sending IPv6 Router Advertisements on that interface. It includes 266 the IP address prefix in the Prefix Information Option in the Router 267 Advertisement. The L bit MUST be set and the A bit MUST be clear. 268 If the switch has been configured as a DHCP relay and has a BGP route 269 to the controller, then it will set the M bit in the Router 270 Advertisement, otherwise it clears both the M and O bits. 272 If a device requires an IP address on an interface and it hears a 273 Router Advertisement with the M bit set, it will send a DHCPv6 274 SOLICIT message to request an IP address. Any SOLICIT message sent 275 must include the following items: 277 - Client Identifier Option with the DUID. 279 - User Class Option to indicate the name of the network it is 280 attempting to join. This is to prevent the controller from 281 configuring devices attached to the network that are not part of 282 the network to be configured. 284 - Vendor Class Option to indicate the type of device. 286 - If the link is point-to-point, then the Rapid Commit Option. 288 - A single Identity Association Option. This option must be for a 289 non-temporary address and must be for the address of the 290 interface on which it is being sent. This allows the controller 291 to learn the interface on which the DHCP client is sending the 292 SOLICIT message. 294 When a DHCP Relay Agent receives a SOLICIT message, it encapsulates 295 it into a relay-forward message and sends it to the controller. It 296 puts its loopback IP address into the source IP address field in the 297 IP header of the packet. 299 5.1. Inconsistent Endpoints 301 Two endpoints of a link may have different IP address prefixes that 302 do not overlap. This prevents IP forwarding on the link. The 303 controller will never assign prefixes this way. This condition may 304 occur in the following cases: 306 - The controller assigned addresses to interfaces on two devices 307 via ZTP and it did not know that these devices had a link between 308 them. This is a normal occurrance. 310 - Some cables were unplugged from a device under maintenance and 311 then plugged back in in a different way. 313 - A device was removed from its location in a topology and replaced 314 in another location without having its configuration erased. 316 The controller can repair all these cases automatically. 318 If a device has an IP address on an interface and it hears a Router 319 Advertisement that includes a Prefix Information Option, the prefix 320 of which is different to its own prefix, then the following applies. 321 If the Router Advertisement does not have the M bit set, then the 322 device does nothing further. The interface will not be able to send 323 IP packets. If the Router Advertisement has the M bit set, then it 324 will send a DHCPv6 SOLICIT message to get a new IP address. Both 325 sides of a link may do this and the SOLICIT messages will cross. The 326 controller will receive both of them. When it receives the second 327 SOLICIT, it will recognize it as being from the other end of the same 328 link and allocate the appropriate address. 330 6. Link State Database 332 The controller will maintain a link state database of each link it 333 learns. This is conceptual and implementations may differ. 335 First is the device table. Each device is associated with: 337 - DHCP DUID. The controller learns this from the DHCP SOLICIT 338 message received from the device. 340 - Device type. This is learnt from the DHCP Vendor Class Option 341 from the DHCP SOLICIT message received from the device. It is 342 used to recognize the topology and match it with the description 343 of the required topology after the complete topology is 344 discovered. 346 - Loopback IP address. The controller assigns this to the device 347 during ZTP. It is advertised to BGP sessions to neighboring 348 devices. When those neighors receive it, they advertise it to 349 the controller and install it. They do not advertise it to other 350 neighbors. This address is used as the endpoint for the BGP 351 connection between the device and the controller. When the 352 device is acting as DHCP Relay Agent, this address appears in the 353 source IP address field in the IP header in the relay-forward 354 message. 356 Next is the endpoint table. Each endpoint is associated with: 358 - Reference to the device hosting this endpoint. 360 - IAID. The controller learns this from the DHCP SOLICIT message 361 received from the device. 363 - Reference to the endpoint at the other end of the link if there 364 is one. 366 - Local IP address with prefix length. The controller assigns this 367 address either in a DHCP REPLY message or during ZTP. When the 368 device is acting as DHCP Relay Agent, this address appears in the 369 link-address field in the relay-forward message. This is used as 370 the endpoint of a BGP session to the neighboring device. The 371 host address (/128) is advertised as a network address to the BGP 372 session across the link of this endpoint. When that neighbor 373 receives the route, it will not install the route, but advertise 374 it to the controller only. The controller uses that route, or 375 rather the lack of the route, to know when the link has failed. 376 The controller knows that the link exists from the DHCP SOLICIT 377 message. 379 7. BGP Procedures 381 The controller will advertise its own loopback address to all the 382 directly connected BGP neighbors with a community to identify it as 383 the controller address. This IP address will be advertised by all 384 devices to their directly connected BGP neighbors. The devices will 385 use this BGP route to forward packets to the controller. 387 Each device will announce its interface addresses to the BGP 388 connections of its directly connected neighbors tagged with a 389 community. These routes will be re-announced only to the BGP session 390 to the controller and not to directly connected neighbors. The BGP 391 connections can be made to fail upon interface down or BFD down. BFD 392 should only operate on the BGP sessions to directly connected 393 neighbors, not on the session to the controller. 395 The controller will host one multihop BGP session with every device 396 in the network. This is a lot of sessions. These sessions do not 397 need to be fast. They should have long keepalive timers. 399 8. Segment Routing Procedures 401 The devices will be segment-routing V6 (SRv6) 402 [I-D.ietf-6man-segment-routing-header] capable. When a device 403 receives an Ipv6 packet with its own address in the destination IP 404 address field in the IP address header and there is an SRv6 extension 405 header with more segments, then the device will place the next 406 segment into the destination IP address field and forward the packet 407 to this destination. If a device cannot replace the destination IP 408 address from the SID list in the forwarding hardware, it can punt the 409 packet to the control plane and do it there. 411 The controller, knowing the topology, will be able to send a packet 412 to any device in the network by building the appropriate SRv6 SID 413 list. Thus each device in the network does not need to store a route 414 for every other device. 416 9. Final Configuration 418 Once the controller has learnt the complete network topology, or at 419 least a large recognizable part of it, it can complete the 420 configuration of the network. This depends on the network. The 421 controller will be programmed with a description of the expected 422 network and applicable constraints. As discovery proceeds, the 423 controller will try to match the discovered topology with the 424 programmed description. An example of a data center description is: 425 "A number of pods. Each pod consists of 384 TORs and 32 spines. 426 Each TOR has 32 south facing ports and 32 north facing ports. Each 427 spine has 384 south facing ports and 192 north facing ports. Super- 428 spines connect the pods. Some of the pods are DCI pods. The devices 429 need aggregatable addresses and BGP sessions." The controller should 430 be able to recognize all the switches, the servers and the DCI 431 routers and match the discovered topology to the description. It 432 should then create configurations for all the devices and report 433 inconsistencies. How the controller does this is out of scope of 434 this document. 436 When a new device joins the network, the controller will detect it, 437 because it will receive a DHCP request from it, relayed by its 438 neighboring DHCP relay agent. 440 10. Connecting a New Controller to a Network in Production 442 A network can function without the controller present. The 443 controller is only needed to auto-configure the network when topology 444 changes occur. If a new controller is connected to a network that is 445 already in production, then the controller has to discover the 446 network before it can do anything else. The controller connects to a 447 switch using the link-local address. The controller then uses 448 Netconf to query the configuration of the switch. 450 11. Multiple Controllers 452 Because the controller need only be present to automate configuration 453 changes, its absence is not likely to cause a network outage. If a 454 device interface is incorrectly connected, then it will just not come 455 up. Thus multiple controllers are not required for redundancy. A 456 single controller can be connected to multiple devices in the network 457 in such a way that unreachability of large parts of the network is 458 unlikely even with many failures within the network. 460 Nonetheless, multiple controllers should be possible in a single 461 network if they coordinate control amongst each other. Such 462 coordination is out of scope of this document. 464 12. Security Considerations 466 When the network to be configured is used as an underlay, then it is 467 only used to connect tunnel endpoints together within the network. 468 The network is not accessible from outside the network. The network 469 is accessible to directly connected devices. An adversary can 470 connect directly to a device in the network by being plugged into a 471 port of that device. This and all other threats listed in this 472 section can be avoided by physical barriers to prevent access to the 473 switching hardware. 475 An adversary could inject or intercept packets into tunnels that are 476 being carried by the fabric. This can be avoided by using IPSEC 477 tunnels for all payload traffic. 479 An adversary could impersonate a controller and start a netconf 480 session. To avoid that, the real controller should use netconf over 481 ssh to all devices. 483 An adversary connected to a device in the network could send a DHCP 484 SOLICIT message and get an IP address. It can then start a BGP 485 session with the device it connects to. To avoid the BGP session, 486 TCP-AO is recommended. 488 An adversary connected to a device in the network could impersonate 489 the controller and cause the device to request DHCP services from the 490 adversary. To avoid damage, all DHCP services other than what are 491 required to implement the functionality of this document should be 492 disabled. DHCP Relay agents may use DHCP message authentication as 493 specified in [RFC3315]. DHCP delayed authentication has been 494 deprecated, because of operational complexity in managing shared 495 secret keys. Alternative methods using asymmetric keys are specified 496 in [E-DHCP] and [S-DHCP6]. 498 An adversary that has access to the network could disrupt BGP 499 sessions running in the network. To avoid that, TCP-AO is 500 recommended for the BGP sessions. 502 13. IANA Considerations 504 TBD 506 14. Acknowldgements 508 The careful review and helpful suggestions of the following people 509 significantly steered the direction of this document: 511 Dhananjaya Rao 513 Bernie Volz 515 Robert Raszuk 517 15. References 519 15.1. Normative References 521 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 522 Requirement Levels", BCP 14, RFC 2119, 523 DOI 10.17487/RFC2119, March 1997, 524 . 526 [RFC3315] Droms, R., Ed., Bound, J., Volz, B., Lemon, T., Perkins, 527 C., and M. Carney, "Dynamic Host Configuration Protocol 528 for IPv6 (DHCPv6)", RFC 3315, DOI 10.17487/RFC3315, July 529 2003, . 531 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 532 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 533 DOI 10.17487/RFC4271, January 2006, 534 . 536 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 537 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 538 June 2010, . 540 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 541 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 542 . 544 15.2. Informative References 546 [E-DHCP] Demerjian, J. and A. Serhrouchni, "DHCP Authentication 547 Using Certificates", 2004, 548 . 551 [I-D.ietf-6man-segment-routing-header] 552 Filsfils, C., Previdi, S., Leddy, J., Matsushima, S., and 553 d. daniel.voyer@bell.ca, "IPv6 Segment Routing Header 554 (SRH)", draft-ietf-6man-segment-routing-header-15 (work in 555 progress), October 2018. 557 [I-D.ietf-netconf-zerotouch] 558 Watsen, K., Abrahamsson, M., and I. Farrer, "Zero Touch 559 Provisioning for Networking Devices", draft-ietf-netconf- 560 zerotouch-25 (work in progress), September 2018. 562 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 563 BGP for Routing in Large-Scale Data Centers", RFC 7938, 564 DOI 10.17487/RFC7938, August 2016, 565 . 567 [S-DHCP6] Su, Z., Ma, H., Zhang, X., and B. Zhang, "Secure DHCPv6 568 that uses RSA authentication integrated with Self- 569 Certified Address", 2011, 570 . 572 Authors' Addresses 574 Jakob Heitz 575 Cisco 576 170 West Tasman Drive 577 San Jose, CA, CA 95134 578 USA 580 Email: jheitz@cisco.com 582 Kausik Majumdar 583 Cisco 584 170 West Tasman Drive 585 San Jose, CA, CA 95134 586 USA 588 Email: kmajumda@cisco.com 590 Acee Lindem 591 Cisco 592 301 Midenhall Way 593 Cary, NC 27513 594 USA 596 Email: acee@cisco.com