idnits 2.17.00 (12 Aug 2021) /tmp/idnits41498/draft-van-beijnum-multi-mtu-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: Due to lack of neighbor discovery, with IPv4, it's necessary to use ARP to probe for non-standard MTU capabilities. This is done by simply probing with an ARP packet padded to the desired size. If a reply comes back, the neighbor supports the probed MTU size. A NODEMTU option MAY or MAY NOT be present in the last 8 bytes of the jumbo ARP message. Nodes MUST take care to include either a valid NODEMTU option or bytes that can't be mistaken for a NODEMTU option. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 12, 2010) is 4331 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Unused Reference: 'RFC0826' is defined on line 549, but no explicit reference was found in the text == Unused Reference: 'RFC2461' is defined on line 560, but no explicit reference was found in the text == Unused Reference: 'RFC2462' is defined on line 564, but no explicit reference was found in the text == Unused Reference: 'RFC3315' is defined on line 570, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2461 (Obsoleted by RFC 4861) ** Obsolete normative reference: RFC 2462 (Obsoleted by RFC 4862) ** Obsolete normative reference: RFC 3315 (Obsoleted by RFC 8415) Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group I. van Beijnum 3 Internet-Draft IMDEA Networks 4 Intended status: Experimental July 12, 2010 5 Expires: January 13, 2011 7 Extensions for Multi-MTU Subnets 8 draft-van-beijnum-multi-mtu-03 10 Abstract 12 In the early days of the internet, many different link types with 13 many different maximum packet sizes were in use. For point-to-point 14 or point-to-multipoint links, there are still some other link types 15 (PPP, ATM, Packet over SONET), but multipoint subnets are now almost 16 exclusively implemented as ethernets. Even though the relevant 17 standards mandate a 1500 byte maximum packet size for ethernet, more 18 and more ethernet equipment is capable of handling packets bigger 19 than 1500 bytes. However, since this capability isn't standardized, 20 it is seldom used today, despite the potential performance benefits 21 of using larger packets. This document specifies mechanisms to 22 negotiate per-neighbor maximum packet sizes so that nodes on a 23 multipoint subnet may use the maximum mutually supported packet size 24 between them without being limited by nodes with smaller maximum 25 sizes on the same subnet. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 13, 2011. 44 Copyright Notice 46 Copyright (c) 2010 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2. Notational Conventions . . . . . . . . . . . . . . . . . . . . 4 63 3. Protocol messages and options . . . . . . . . . . . . . . . . 4 64 3.1. The ND/ARP NODEMTU option . . . . . . . . . . . . . . . . 4 65 3.2. The IPv6 ND padding option . . . . . . . . . . . . . . . . 5 66 3.3. IPv4 ethernet jumbo ARP message . . . . . . . . . . . . . 7 67 3.4. Changes to the RA MTU option semantics . . . . . . . . . . 7 68 4. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 4.1. Managing neighbor MTUs . . . . . . . . . . . . . . . . . . 8 70 4.2. Host-to-host keepalives . . . . . . . . . . . . . . . . . 9 71 4.3. Router-to-router keepalives . . . . . . . . . . . . . . . 10 72 4.4. Host-to-router keepalives . . . . . . . . . . . . . . . . 10 73 4.5. Router-to-host keepalives . . . . . . . . . . . . . . . . 10 74 4.6. Determining the MTU . . . . . . . . . . . . . . . . . . . 11 75 4.7. Probe considerations . . . . . . . . . . . . . . . . . . . 11 76 4.8. Neighbor MTU garbage collection . . . . . . . . . . . . . 11 77 5. The TCP MSS option . . . . . . . . . . . . . . . . . . . . . . 11 78 6. IANA considerations . . . . . . . . . . . . . . . . . . . . . 12 79 7. Security considerations . . . . . . . . . . . . . . . . . . . 12 80 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 81 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 82 9.1. Normative References . . . . . . . . . . . . . . . . . . . 12 83 9.2. Informative References . . . . . . . . . . . . . . . . . . 13 84 Appendix A. Document and discussion information . . . . . . . . . 13 85 Appendix B. About of larger packets . . . . . . . . . . . . . . . 13 86 B.1. Delay and jitter . . . . . . . . . . . . . . . . . . . . . 13 87 B.2. Path MTU Discovery problems . . . . . . . . . . . . . . . 14 88 B.3. Packet loss through bit errors . . . . . . . . . . . . . . 15 89 B.4. Undetected bit errors . . . . . . . . . . . . . . . . . . 15 90 B.5. Interaction TCP congestion control . . . . . . . . . . . . 16 91 B.6. IEEE 802.3 compatibility . . . . . . . . . . . . . . . . . 16 92 B.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 17 93 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 17 95 1. Introduction 97 Some protocols inherently generate small packets. Examples are VoIP, 98 where it's necessary to send packets frequently before much data can 99 be gathered to fill up the packet, and the DNS, where the queries are 100 inherently small and the returned results also rarely fill up a full 101 1500-byte packet. However, most data that is transferred across the 102 internet and private networks is part of long-lived sessons and 103 requires segmentation by a transport protocol, which is almost always 104 TCP. These types of data transfers can benefit from larger packets 105 in several ways: 107 1. A higher data-to-header ratio makes for fewer overhead bytes 109 2. Fewer packets means fewer per-packet operations on the source and 110 destination hosts 112 3. Fewer packets also means fewer per-packet operations in routers 113 and middleboxes 115 4. TCP performance increases with larger packet sizes 117 Even though today, the capability to use larger packets (often called 118 jumboframes) is present in a lot of ethernet hardware, this 119 capability typically isn't used because IP assumes a common MTU size 120 for all nodes connected to a link or subnet. In practice, this means 121 that using a larger MTU requires manual configuration of the non- 122 standard MTU size on all hosts and routers and possibly on switches 123 connected to a subnet. Also, the MTU size for a subnet is limited to 124 that of the least capable router, host or switch. 126 In the future, when hosts support packetization layer path MTU 127 discovery ([RFC4821], "Packetization Layer Path MTU Discovery") in 128 all relevant transport protocols, it will be possible to simply 129 ignore MTU limitations by sending at the maximum locally supported 130 size and determining the maximum packet size towards a correspondent 131 from acknowledgements that come back for packets of different sizes. 132 However, [RFC4821] must be implemented in every transport protocol, 133 and problems arise in the case where hosts implementing [RFC4821] 134 interact with hosts that don't implement this mechanism, but do use a 135 larger than standard MTU. 137 This document provides for a set of mechanisms that allow the use of 138 larger packets between nodes that support them which interacts well 139 with both manually configured non-standard MTUs and expected future 140 [RFC4821] operation with larger MTUs. This is done using several new 141 options and messages for both IPv6 and IPv4: 143 1. A neighbor discovery option that allows nodes to inform their 144 neighbors of the maximum packet sizes they are prepared to 145 receive 147 2. An extension to the ARP packet format that allows nodes to inform 148 their neighbors of the maximum packet sizes they are prepared to 149 receive 151 3. A probe/verification message that allows nodes to determine 152 whether jumboframes can be received successfully by the next hop 154 Appendix B discusses several potential issues with larger packets, 155 such as head-of-line blocking delays, path MTU discovery black holes 156 and the strength of the CRC32 with increasing packet sizes. 158 2. Notational Conventions 160 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 161 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 162 document are to be interpreted as described in [RFC2119]. 164 Note that this specification is not standards track, and as such, 165 can't overrule existing specifications. Whenever [RFC2119] language 166 is used, this must be interpreted within the context of this 167 specification: while the specification as a whole is optional and 168 non-standard, whenever it is implemented, such an implementation can 169 only function properly when all MUSTs are observed. 171 3. Protocol messages and options 173 3.1. The ND/ARP NODEMTU option 175 All MTU values are 32-bit unsigned integers in network byte order. 176 All other values are also unsigned and in network byte order. 177 Troughout this document, the term "MTU" is used to denote the maximum 178 packet size that can be sent or received. The term "MRU" (maximum 179 receive unit) is not used. The "standard MTU" or "standard maximum 180 size" refers to the MTU size specified in the IP-over-... or IPv6- 181 over-... document for the link used, which would be 1500 for 182 ethernet. 184 The MTU size and two flags are exchanged as an IPv6 neighbor 185 discovery option. The new option, as well as the MTU value it 186 avertises, are named "NODEMTU". For IPv4 operation, the NODEMTU 187 option is appended to ARP messages, with optional padding between the 188 ARP message and the MTU option. Upon reception of ARP messages, the 189 receiving node checks whether the ARP message is 8 or more bytes 190 longer than a standard ARP message. If so, the NODEMTU option is 191 ignored if the Type and Length fields contain values other than the 192 ones listed below, or if the MTU is smaller than the standard value 193 for the link type. 195 1 2 3 196 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 197 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 198 | Type | Length |R|L| Reserved |A| 199 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 200 | NODEMTU | 201 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 203 Type: TBD 205 Length: 1 207 R (router): Set to 1 if the node is a router, set to 0 if the node 208 is a host or routing functionality is currently disabled. 210 L (large packet detect): Set to 1 if the node is capable of 211 determining the largest size of packets recently received from a 212 link address, set to 0 if the node requires explicit probe 213 messages. 215 Reserved: Set to 0 on transmission, MUST be ignored on reception. 217 A (acknowledgment): Set to 1 if the node received a packet larger 218 than the interface MTU from the node this packet is addressed to 219 in the last 10 seconds. 221 NODEMTU The maximum packet size the node wishes to receive on this 222 interface at this time. 224 When a node's interface speed changes, it MAY advertise adjusted per- 225 neighbor MTUs, but it SHOULD remain prepared to receive packets of 226 the maximum size indicated to neighbors previously (if this maximum 227 size is larger than the newly adjusted one). 229 3.2. The IPv6 ND padding option 231 The format of the neighbor discovery padding option is as follows: 233 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 234 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 235 | Type | Length | Reserved | 236 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 237 | Padding | 238 ~ ~ 239 | | 240 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 242 Type: TBD 244 Length: see below. 246 Reserved: set to 0 on transmission, ignored on reception. 248 Padding: 0 or more all-zero bytes. 250 There are two possible ways to determine the value of the length 251 field in the padding option: 253 1. Set it to 0. Since the option is in fact larger than 0, this 254 means that nodes that don't implement the option will silently 255 discard the packet. Setting the length to 0 makes it possible to 256 have packets with the padding option that aren't a multiple of 8 257 bytes long. Since there is now no way to determine where the 258 next option begins, if the length is set to 0, the padding option 259 MUST be the last option. 261 2. If the intended packet length allows a valid value for the length 262 field, the length field MAY be set to that value. The node MAY 263 reduce the size of the intended packet to accommodate the 264 requirement that the size field is a multiple of 8 bytes. I.e., 265 if the intended packet size is 4470 bytes with 40 and 24 bytes 266 for the IPv6 and neighbor solicitation headers, respectively, the 267 padding option would have to be 4406 bytes long, which can't be 268 expressed in the length field. The node may choose to use a 269 packet size of 4464 instead, which results in a length field 270 value of 550. This of course means that subsequent data packets 271 MUST be no larger than 4464 bytes. 273 Nodes that support probing MUST support reception of both types of 274 probes, but MAY be limited to generating only one type. 276 Since presumably, some equipment may react badly to a large number of 277 out-of-spec packets, it's important that nodes limit the number of 278 oversized packets to destinations that aren't yet known to be capable 279 of receiving them. An upper limit would be to allow only 5 280 unacknowledged oversized packets per 300 second period. 282 3.3. IPv4 ethernet jumbo ARP message 284 Due to lack of neighbor discovery, with IPv4, it's necessary to use 285 ARP to probe for non-standard MTU capabilities. This is done by 286 simply probing with an ARP packet padded to the desired size. If a 287 reply comes back, the neighbor supports the probed MTU size. A 288 NODEMTU option MAY or MAY NOT be present in the last 8 bytes of the 289 jumbo ARP message. Nodes MUST take care to include either a valid 290 NODEMTU option or bytes that can't be mistaken for a NODEMTU option. 292 3.4. Changes to the RA MTU option semantics 294 There may be an MTU option in IPv6 router advertisements. When this 295 option is present, hosts MUST use the value in this option only as a 296 replacement for the standard link MTU size. So for multicast packets 297 and packets sent to nodes for which there is no known NODEMTU, the 298 value in the MTU option is used as the maximum packet size. But if a 299 NODEMTU is known for a node on the link, the NODEMTU is used, NOT the 300 value in the RA MTU option. 302 4. Operation 304 Basic operation is as follows: nodes advertise their interface MTU in 305 a neighbor discovery option or in ARP messages. So for communication 306 between two nodes implementing this specification, each knows what 307 the maximum packet size is that the other node supports, so the 308 minimum of the local and the remote MTU is used when sending packets. 310 Unfortunately, there is the complication that layer 2 devices 311 (switches/bridges) may have a smaller maximum packet size, so packets 312 larger than the standard maximum size may be lost. In order to avoid 313 this issue, this memo specifies a "trust, but verify" approach: 314 whenever packets larger than the standard size are supported between 315 two nodes, each periodically verifies that the other is still capable 316 of receiving packets of the negotiated size. It does this by sending 317 a probe message of the maximum negotiated size, followed by If the 318 verification fails, the node sends a new neighbor advertisement with 319 a reduced MTU size. 321 A number of optimizations reduce the amount of signaling traffic 322 where possible. Most of the optimizations are optional. 324 In the case of a host implementing packetization layer path MTU 325 discovery [RFC4821] for all transport protocols that can generate 326 packets larger than the standard size, the use of outgoing probe and 327 verification messages is unnecessary. However, such a host MUST 328 still process incoming probe and verification messages. 330 The first optimization is for senders that have the capability to 331 determine whether they sent packets that are larger than the standard 332 size, to only send a probe message and a verification message when a 333 data packet larger than the standard size was sent recently. 335 A further optimization may be applied when both the sender and the 336 receiver have the capability to determine whether they sent/received 337 data packets that are larger than the standard size. If both the 338 sender and the receiver have this capability, as indicated by flags 339 in the neighbor discovery option, no explicit MTU probe messages are 340 sent, just a verification message. 342 A final optimization applies between a host and a router. In that 343 case, the host may assume the responsibility for probing with large 344 packets in both directions. This reduces the control channel 345 processing on the router, and allows for the possibility to forego 346 probing when there are no active transport sessions that are capable 347 of generating larger than standard packets. 349 4.1. Managing neighbor MTUs 351 The following does not apply to hosts that support [RFC4821] or a 352 similar mechanism for all transport protocols that can send larger 353 than standard packets. For instance, if a host implements [RFC4821] 354 for TCP and limits UDP packets and packets using other transport 355 protocols to 1500 bytes on its ethernet interface, the host is not 356 requred to perform any probing and per-destination path MTUs can be 357 maintained at the TCP level. It must still respond to incoming 358 probes. Routers are never exempt from what follows. 360 Along with neighbor's link addresses, a node caches an MTU value for 361 each neighbor. This value starts out being undefined. Whenever a 362 packet must be sent (this includes packets that are forwarded), the 363 node consults the neighbor MTU cache. If the cached value is 364 undefined, it applies the interface MTU that is in effect on MTU- 365 related actions such as fragmentation or the generation of "too big" 366 messages. 368 Whenever a value is entered into the neighbor MTU cache, this value 369 is marked as "tentative" and the node MUST start a clock that times 370 out after 500 milliseconds. After 500 milliseconds (or less, 371 depending on the implementation), if the value in the MTU cache is 372 still tentative, it reverts back to being undefined. Values enter 373 the neighbor cache after receiving the NODEMTU option in ND or ARP 374 messages. In this case, the cache is initalized with the minimum of 375 the local and remote NODEMTU values and the clock is started. If no 376 such option is present in ND or ARP messages, the node may insert the 377 standard MTU value + 1 rounded up to the nearest multiple of 8. In 378 this case, the clock is also started. If timer resources are not 379 available, the neighbor MTU value remains undefined. 381 At this point, the node sends a probe message that is the size of the 382 value in the neighbor MTU cache. If this probe message is answered, 383 the neighbor's MTU value in the neighbor MTU cache is marked as 384 "valid" and the timer is stopped. 386 If the probe wasn't answered, or probing started from just above the 387 standard MTU, after some time (such as 30 seconds) a new probe MAY be 388 sent. For unanswered probes, new probes are larger, for answered 389 probes the new probe is larger. If the new probe is answered, the 390 size of the probe is entered in the neighbor MTU cache as a "valid" 391 value. Probing MAY continue for several iterations, but implementers 392 are encouraged to limit probing rather than exhaustively search for 393 the exact supported neighbor MTU value. 395 4.2. Host-to-host keepalives 397 When hosts have active communication sessions with other hosts on the 398 same subnet, they send periodic probes to determine whether large 399 packets continue to be received. Hosts MUST NOT send probes when 400 there are no active communication sessions and SHOULD NOT send probes 401 when there are no active communication sessions that support larger 402 than standard packet sizes. For instance, if the host only supports 403 larger-than-standard packet sizes over TCP, and there are no TCP 404 sessions where the remote host indicated that it supports larger- 405 than-standard packet sizes through the MSS option, probing SHOULD NOT 406 be performed. 408 The probe interval is randomized between 8 and 10 seconds. A host 409 SHOULD NOT send probes if it has not sent any packets larger than the 410 interface MTU size during the previous probe interval. Probes are 411 ARP or neighbor solicitation messages padded to the cached neighbor 412 MTU size. A timer is initialized to 21 seconds (or a slightly larger 413 value, with a maximum of 35 seconds) when a probe is sent. The timer 414 is NOT reinitialized when new probes are sent. The timer is stopped 415 when a probe is answered by an ARP reply or neighbor advertisement. 416 This message does not have to be padded. If the timer is not stopped 417 by an incoming probe reply and it expires, the neighbor MTU cache is 418 cleared and becomes undefined. The next probe is sent no earlier 419 than 8 seconds after the last ARP or neighbor advertisement from the 420 neighbor has been received. 422 The following is optional: 424 If the neighbor included the L flag set to 1 in its NODEMTU option, 425 the host MAY send probes as regular ARP or neighbor solicitation 426 packets, without padding. In this case, ARP replies or neighbor 427 advertisements are only considered valid probe replies when they have 428 a NODEMTU option with the A flag set to 1. 430 If the host detected that it received a packet larger than the 431 interface MTU in the last 7 seconds, it MAY send an unsolicited probe 432 reply, which consists of an ARP reply or a neighbor advertisement. 434 Replies to probes SHOULD and unsolicited probe replies MUST have a 435 NODEMTU option with the A bit set. 437 4.3. Router-to-router keepalives 439 The probe interval is randomized between 8 and 10 seconds. A router 440 SHOULD NOT send probes if it has not sent any packets larger than the 441 interface MTU size during the previous probe interval. Probes are 442 ARP or neighbor solicitation messages padded to the cached neighbor 443 MTU size. A timer is initialized to 21 seconds (or a slightly larger 444 value, with a maximum of 35 seconds) when a probe is sent. The timer 445 is NOT reinitialized when new probes are sent. The timer is stopped 446 when a probe is answered by an ARP reply or neighbor advertisement. 447 This message MUST NOT be padded. If the timer is not stopped by an 448 incoming probe reply and it expires, the neighbor MTU cache is 449 cleared and becomes undefined. 451 4.4. Host-to-router keepalives 453 The probe interval is randomized between 8 and 10 seconds. Probes 454 are ARP or neighbor solicitation messages padded to the cached 455 neighbor MTU size. The destination IP address is the host's own IP 456 address, the link address is the router's link address. A timer is 457 initialized to 21 seconds (or a slightly larger value, with a maximum 458 of 35 seconds) when a probe is sent. The timer is NOT reinitialized 459 when new probes are sent. The timer is stopped when a probe is 460 answered by an ARP reply or neighbor advertisement. This message 461 MUST be the padded message originally sent by the host itself. If 462 the timer is not stopped by an incoming probe reply and it expires, 463 the neighbor MTU cache is cleared and becomes undefined. Also, a 464 neighbor advertisement or ARP reply is sent with a NODEMTU option 465 that contains the current interface MTU. After some time, such as 15 466 minutes, the host MAY attempt probing for larger than standard MTU 467 sizes again. 469 4.5. Router-to-host keepalives 471 Routers do not send keepalives to hosts. Routers MUST adjust their 472 cached neighbor MTU value based on the NODEMTU option in unsolicited 473 neighbor advertisements or ARP replies. 475 4.6. Determining the MTU 477 Nodes SHOULD NOT blindly advertise the maximum MTU that their 478 hardware is capable of. On slow links, a large MTU can easily reduce 479 performance. In general, hosts SHOULD limit the MTU they advertise 480 and impose on packets they send to the standard MTU size on links 481 operating at speeds of 50 Mbps or slower. On links operating at 482 speeds of 500 Mbps and higher, MTUs of 9000 bytes or even as large as 483 64 kilobytes presumably won't cause problems. Between 50 and 500 484 Mbps, larger than standard MTUs SHOULD be used with care. For 485 instance, by limiting MTUs to 9000 bytes and only on full duplex 486 links with low bit error rates (which would exclude wireless links). 488 4.7. Probe considerations 490 In cases where the neighbor's MTU was advertised in a NODEMTU option, 491 it makes sense to try with this size or the local MTU, whichever is 492 smaller. If that probe fails or the neighbor's MTU is unknown, the 493 best choice for a probe size would be the smallest possible non- 494 standard MTU. This could be the StandardMTU + 1, or a slightly 495 larger value that represents the first larger size that is actually 496 useful, such as 1508 or 1520 for ethernet. Failure at this size 497 wastes relatively little bandwidth and indicates that further probes 498 are unnecessary. If this probe is successful, further choices for 499 the probe size may be common MTU sizes such as 1508, 1530, 1536, 500 1546, 1998, 2000, 2018, 4464, 4470, 8092, 8192, 9000, 9176, 9180, 501 9216, 16384, 17976, 64000 and 65280 bytes. (These values were 502 observed in vendor documentation and hands-on experience.) 504 A further consideration is that there is little value in sending many 505 probes to discover a few extra bytes of MTU, and using multiples of 8 506 bytes may streamline copying of data and makes IPv6 probing easier. 507 So values to test if the NODEMTU fails but 1504 succeeds could be 508 1992, 4464, 8088, 9000, 16384 and 64000. 510 Probes MUST be sent as unicast. 512 4.8. Neighbor MTU garbage collection 514 The MTU size for a neighbor is garbage collected along with a 515 neighbor's link address in accordance with regular ARP and neighbor 516 discovery timeouts. Additionally, a neighbor's MTU size is reset to 517 unknown after dead neighbor detection declares a neighbor "dead". 519 5. The TCP MSS option 521 Hosts SHOULD advertise the maximum MTU size they are prepared to use 522 on a link in the TCP MSS value, even during times when probing has 523 failed: should larger neighbor MTUs be established later, it will not 524 be possible to adjust the MSS for ongoing sessions. 526 6. IANA considerations 528 IANA is requested to assign two neighbor discovery option type 529 values. 531 [TO BE REMOVED: This registration should take place at the following 532 location: http://www.iana.org/assignments/icmpv6-parameters 534 7. Security considerations 536 Generating false neighbor discovery and ARP packets with large MTUs 537 may lead to a denial-of-serve condition, just like the advertisement 538 of other false link parameters. 540 8. Acknowledgements 542 This document benefited from feedback by Dave Thaler, Jari Arkko, Joe 543 Touch and others. 545 9. References 547 9.1. Normative References 549 [RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or 550 converting network protocol addresses to 48.bit Ethernet 551 address for transmission on Ethernet hardware", STD 37, 552 RFC 826, November 1982. 554 [RFC0894] Hornig, C., "Standard for the transmission of IP datagrams 555 over Ethernet networks", STD 41, RFC 894, April 1984. 557 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 558 Requirement Levels", BCP 14, RFC 2119, March 1997. 560 [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor 561 Discovery for IP Version 6 (IPv6)", RFC 2461, 562 December 1998. 564 [RFC2462] Thomson, S. and T. Narten, "IPv6 Stateless Address 565 Autoconfiguration", RFC 2462, December 1998. 567 [RFC2464] Crawford, M., "Transmission of IPv6 Packets over Ethernet 568 Networks", RFC 2464, December 1998. 570 [RFC3315] Droms, R., Bound, J., Volz, B., Lemon, T., Perkins, C., 571 and M. Carney, "Dynamic Host Configuration Protocol for 572 IPv6 (DHCPv6)", RFC 3315, July 2003. 574 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 575 Discovery", RFC 4821, March 2007. 577 9.2. Informative References 579 [CRC] Jain, R., "Error Characteristics of Fiber Distributed Data 580 Interface (FDDI), IEEE Transactions on Communications", 581 August 1990. 583 Appendix A. Document and discussion information 585 The latest version of this document will always be available at 586 http://www.muada.com/drafts/. Please direct questions and comments 587 to the int-area mailinglist or directly to the author. 589 Appendix B. About of larger packets 591 Although often desirable, the use of larger packets isn't universally 592 advantageous for the following reasons: 594 1. Increased delay and jitter 596 2. Increased reliance on path MTU discovery 598 3. Increased packet loss through bit errors 600 4. Increased risk of undetected bit errors 602 B.1. Delay and jitter 604 An low-bandwidth links, the additional time it takes to transmit 605 larger packets may lead to unacceptable delays. For instance, 606 transmitting a 9000-byte packet takes 7.23 milliseconds at 10 Mbps, 607 while transmitting a 1500-byte packet takes only 1.23 ms. Once 608 transmission of a packet has started, additional traffic must wait 609 for the transmission to finish, so a larger maximum packet size 610 immediately leads to a higher worst-case head-of-line blocking delay, 611 and thus, to a bigger difference between the best and worst cases 612 (jitter). The increase in average delay depends on the number of 613 packets that are buffered, the average packet size and the queuing 614 strategy in use. Buffer sizes vary greatly between implementations, 615 from only a few buffers in some switches and on low-speed interfaces 616 in routers, to hundreds of megabytes of buffer space on 10 Gbps 617 interfaces in some routers. 619 If we assume that the delays involved with 1500-byte packets on 100 620 Mbps ethernet are acceptable for most, if not all, applications, then 621 the conclusion must be that 15000-byte packets on 1 Gbps ethernet 622 should also be acceptable, as the delay is the same. At 10 Gbps 623 ethernet, much larger packet sizes could be accommodated without 624 adverse impact on delay-sensitive applications. At below 100 Mbps, 625 larger packet sizes are probably not advisable. 627 B.2. Path MTU Discovery problems 629 PMTUD issues arise when routers can't fragment packets in transit 630 because the DF bit is set or because the packet is IPv6, but the 631 packet is too large to be forwarded over the next link, and the 632 resulting "packet too big" ICMP messages from the router don't make 633 it back to the sending host. If there is a PMTUD black hole, this 634 will typically happen when there is an MTU bottleneck somewhere in 635 the middle of the path. If the MTU bottleneck is located at either 636 end, the TCP MSS (maximum segment size) option makes sure that TCP 637 packets conform to the smallest MTU in the path. PMTUD problems are 638 of course possible with non-TCP protocols, but this is rare in 639 practice because non-TCP protocols are generally not capable of 640 adjusting their packet size on the fly and therefore use more 641 conservative packet sizes which won't trigger PMTUD issues. 643 Taking the delay and jitter issues to heart, maximum packet sizes 644 should be larger for faster links and smaller for slower links. This 645 means that in the majority of cases, the MTU bottleneck will tend to 646 be at, or close to, one of the ends of a path, rather than somewhere 647 in the middle, as in today's internet, the core of the network is 648 quite fast, while users usually connect to the core at lower speeds. 650 A crucial difference between PMTUD problems that result from MTUs 651 smaller than the de facto standard 1500 bytes and PMTUD problems that 652 result from MTUs larger than 1500 bytes is that in the latter case, 653 only the party that's actually using the non-standard MTU is 654 affected. This puts potential problems, the potential benefits and 655 the ability to solve any resulting problems in the same place: it's 656 always possible to revert to a 1500-byte MTU if PMTUD problems can't 657 be resolved otherwise. 659 Considering the above and the work that's going on in the IETF to 660 resolve PMTUD issues as they exist today, increasing MTUs where 661 desired doesn't involve undue risks. 663 B.3. Packet loss through bit errors 665 All transmission media are subject to bit errors. In many cases, a 666 bit error leads to a CRC failure, after which the packet is lost. In 667 other cases, packets are retransmitted a number of times, but if 668 error conditions are severe, packets may still be lost because an 669 error occurred at every try. Using larger packets means that the 670 chance of a packet being lost due to errors increases. And when a 671 packet is lost, more data has to be retransmitted. 673 Both per-packet overhead and loss through errors reduce the amount of 674 usable data transferred. The optimum tradeoff is reached when both 675 types of loss are equal. If we make the simplifying assumption that 676 the relationship between the bit error rate of a medium and the 677 resulting number of lost packets is linear with packet size for 678 reasonable bit error rates, the optimum packet size is computed as 679 follows: 681 packet size = sqrt( overhead bytes / bit error rate ) 683 According to this, the optimum packet size is one or more orders of 684 magnitude larger than what's commonly used today. For instance, the 685 maximum BER for 1000BASE-T is 10^-10, which implies an optimum packet 686 size of 312250 bytes with ethernet framing and IP overhead. 688 B.4. Undetected bit errors 690 Nearly all link layers employ some kind of checksum to detect bit 691 errors so that packets with errors can be discarded. In the case of 692 ethernet, this is a frame check sequence in the form of a 32-bit CRC. 693 Assuming a strong frame check sequence algorithm, a 32-bit checksum 694 suggests that there is a 1 in 2^32 chance that a packet with one or 695 more bit errors in it has the same checksum as the original packet, 696 so the bit errors go undetected and data is corrupted. However, 697 according to [CRC] the CRC-32 that's used for FDDI and ethernet has 698 the property that packets between 376 and 11454 bytes long 699 (including) have a Hamming distance of 3. (Smaller packets have a 700 larger Hamming distance, larger packets a smaller Hamming distance.) 701 As a result, all errors where only a single bit is flipped or two 702 bits are flipped, will be detected, because they can't result in the 703 same CRC as the original packet. The probability of a packet having 704 undetected bit errors can be approximated as follows for a 32-bit 705 CRC: 707 PER = (PL * BER) ^ H / 2^32 708 Where PER is the packet error rate, BER is the bit error rate, PL is 709 the packet length in bits and H is the Hamming distance. Another 710 consideration is the impact of packet length on a multi-packet 711 transmission of a given size. This would be: 713 TER = transmission length / PL * PER 715 So 717 TER = transmission length / (PL ^ (H - 1) * BER ^ H) / 2^32 719 Where TER is the transmission error rate. 721 In the case of the ethernet FCS and a Hamming distance of 3 for a 722 large range of packet sizes, this means that the risk of undetected 723 errors goes up with the square of the packet length, but goes down 724 with the third power of the bit error rate. This suggest that for a 725 given acceptable risk of undetected errors, a maximum packet size can 726 be calculated from the expected bit error rate. It also suggests 727 that given the low BER rates mandated for gigabit ethernet, packet 728 sizes of up to 11454 bytes should be acceptable. 730 Additionally, unlike properties such as the packet length, the frame 731 check sequence can be made dependent on the physical media, so it 732 should be possible to define a stronger FCS in future ethernet 733 standards, or to negotiate a stronger FCS between two stations on a 734 point-to-point ethernet link (i.e., a host and a switch or a router 735 and a switch). 737 B.5. Interaction TCP congestion control 739 TCP performance is based on the inverse of the square of the packet 740 loss probability. Using larger and thus fewer packets is therefore a 741 competitative advantage. Larger packets increase burstiness, which 742 can be problematic in some circumstances. Larger packets also allow 743 TCP to ramp up its transmission speed faster, which is helpful on 744 fast links, where large packets will be more common. In general, it 745 would seem advantageous for an individual user to use larger packets, 746 but under some circumstances, users using smaller packets may be put 747 at a slight disadvantage. 749 B.6. IEEE 802.3 compatibility 751 According to the IEEE 802.3 standard, the field following the 752 ethernet addresses is a length field. However, [RFC0894] uses this 753 field as a type field. Ambiguity is largely avoided by numbering 754 type codes above 2048. The mechanisms described in this memo only 755 apply to the standard [RFC0894] and [RFC2464] encapsulation of IPv4 756 and IPv6 in ethernet, not to possible encapsulations of IPv4 or IPv6 757 in IEEE 802.3/IEEE 802.2 frames, so there is no change to the current 758 use of the ethernet length/type field. 760 B.7. Conclusion 762 Larger packets aren't universally desirable. The factors that factor 763 into the decision to use larger packets include: 765 o A link's bit error rate 767 o The number of bits per symbol on a link and hence the likelihood 768 of multiple bit errors in a single packet 770 o The strength of the frame check sequence 772 o The link speed 774 o The number of buffers 776 o Queuing strategy 778 o Number of sessions on shared links and paths 780 This means that choosing a good maximum packet size is, initially at 781 least, the responsibility of hardware builders, and may also be of 782 interest to ISPs. 784 Author's Address 786 Iljitsch van Beijnum 787 IMDEA Networks 788 Avda. del Mar Mediterraneo, 22 789 Leganes, Madrid 28918 790 Spain 792 Email: iljitsch@muada.com