idnits 2.17.00 (12 Aug 2021) /tmp/idnits65492/draft-van-beijnum-multi-mtu-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 725. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 736. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 743. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 749. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 29, 2007) is 5379 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2464' is mentioned on line 494, but not defined == Unused Reference: 'RFC2119' is defined on line 684, but no explicit reference was found in the text == Unused Reference: 'RFC2462' is defined on line 691, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2461 (Obsoleted by RFC 4861) ** Obsolete normative reference: RFC 2462 (Obsoleted by RFC 4862) Summary: 5 errors (**), 0 flaws (~~), 6 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group I. van Beijnum 3 Internet-Draft Consultant 4 Expires: Febrary 29, 2008 August 29, 2007 6 IPv6 Extensions for Multi-MTU Subnets 7 draft-van-beijnum-multi-mtu-01 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on Febrary 28, 2008. 34 Copyright Notice 36 Copyright (C) The IETF Trust (2007). 38 Abstract 40 In the early days of the internet, many different link types with many 41 different maximum packet sizes were in use. For point-to-point or 42 point-to-multipoint links, there are still some other link types (PPP, 43 ATM, Packet over SONET), but shared subnets are almost exclusively 44 implemented as ethernets. Even though the relevant standards madate a 45 1500 octet maximum packet size for ethernet, more and more ethernet 46 equipment is capable of handling packets bigger than 1500 octets. 47 However, since this capability isn't standardized, it's seldom used 48 today, despite the potential performance benefits of using larger 49 packets. This document specifies a mechanism to negotiate per-neighbor 50 maximum packet sizes so that nodes on a shared subnet may use the 51 maximum mutually supported packet size between them without being 52 limited by nodes with smaller maximum sizes on the same subnet. 54 1 Introduction 56 Some protocols inherently generate small packets. Examples are VoIP, 57 where it's necessary to send packets frequently before much data can 58 be gathered to fill up the packet, and the DNS, where the queries are 59 inherently small and the returned results also rarely fill up a full 60 1500-octet packet. However, most data that is transferred across the 61 internet and private networks is at least several kilobytes in size 62 (often much larger) and requires segmentation by TCP or another 63 transport protocol. These types of data transfer can benefit from 64 larger packets in several ways: 66 1. A higher data-to-header ratio makes for fewer overhead bytes 68 2. Fewer packets means fewer per-packet operations on the source and 69 destination hosts 71 3. Fewer packets also means fewer per-packet operations in routers and 72 middleboxes 74 4. TCP performance tends to increase with larger packet sizes 76 Even though today, the capability to use larger packets (often called 77 jumbo frames) is present in a lot of ethernet hardware, this 78 capability isn't used because IP assumes a common MTU size for all 79 nodes connected to a link or subnet. In practice, this means that 80 using a larger MTU requires manual configuration of the the 81 non-standard MTU size on all hosts and routers and possibly on 82 switches. Also, the MTU size for a subnet is limited to that of 83 the least capable router, host or switch. 85 This document proposes to end this situation using several new 86 options and messages: 88 1. An additional router advertisement MTU option to limit higher 89 maximum packet sizes 91 2. A neighbor discovery option that allows nodes to inform their 92 neighbors of the maximum packet size they support 94 3. A neighbor discovery option for padding messages to make them 95 suitable for probing a neighbor's MTU and link-layer MTU 96 limitations 97 4. Padding for ARP messages to make them suitable for probing a 98 neighbor's MTU and link-layer MTU limitations 100 2 Terminology 102 Local MTU: 103 The maximum packet size considered usable on an interface, 104 based on the physical MTU, the MTU advertised by routers and 105 administrative settings. 107 MTU: 108 Maximum Transmission Unit. This is the maximum IP packet size in 109 octets supported on a link, towards a neighbor or towards a remote 110 correspondent. In some cases, the term MRU (maximum receive unit) 111 would be more appropriate, but for consistency, the term MTU is 112 used throughout this document. 114 Neighbor MTU: 115 The maximum packet size that may be used towards a given 116 on-link neighbor. 118 Node: 119 A host or router running IPv4 or IPv6. 121 Oversized packet: 122 A packet exceeding the size defined in the relevant 123 IPv6-over-... or IP-over-... RFC. 125 Physical MTU: 126 The MTU reported by the driver for an interface when operating at 127 a given link speed. 129 Probe: 130 An ARP or neighbor solicitation packet of a specific (oversized) 131 size sent for the purpose of determining whether a neighbor can 132 successfully receive packets of this size sent by the local node. 134 3 Disadvantages of larger packets 136 Although often desirable, the use of larger packets isn't universally 137 advantageous for the following reasons: 139 1. Increased delay and jitter 140 2. Increased reliance on path MTU discovery 141 3. Increased packet loss through bit errors 142 4. Increased risk of undetected bit errors 143 3.1 Delay and jitter 145 An low-bandwidth links, the additional time it takes to transmit 146 larger packets may lead to unacceptable delays. For instance, 147 transmitting a 9000-octet packet takes 7.23 milliseconds at 10 Mbps, 148 while transmitting a 1500-octet packet takes only 1.23 ms. Once 149 transmission of a packet has started, additional traffic must wait for 150 the transmission to finish, so a larger maximum packet size 151 immediately leads to a higher worst-case head-of-line blocking delay, 152 and as such, to a bigger difference between the best and worst cases 153 (jitter). The increase in average delay depends on the number of 154 packets that are buffered, the average packet size and the queuing 155 strategy in use. Buffer sizes vary greatly, but assuming 40 buffers 156 (not uncommon) leads to the following results: 158 Speed 500 1500 4500 9000 16384 65535 160 10 Mbps 17.22 49.21 145.22 289.22 525.50 2098.34 161 100 Mbps 1.72 4.92 14.52 28.92 52.55 209.83 162 1 Gbps 0.17 0.49 1.45 2.89 5.26 20.98 163 10 Gbps 0.02 0.05 0.15 0.29 0.52 2.01 165 In milliseconds and counting 38 additional octets of ethernet 166 overhead. 168 If we assume that the delays involved with 1500-octet packets on 100 169 Mbps ethernet are acceptable for most, if not all, applications, then 170 the conclusion must be that 9000-octet packets on 1 Gbps ethernet 171 should also be acceptable. At 10 Gbps ethernet, much larger packet 172 sizes could be accommodated without adverse impact on delay-sensitive 173 applications. Below 100 Mbps, larger packet sizes are probably not 174 advisable. 176 3.2 Path MTU Discovery problems 178 PMTUD issues arise when routers can't fragment packets in transit 179 because the DF bit is set or because the packet is IPv6, but the 180 packet is too large to be forwarded over the next link, and the 181 resulting "packet too big" ICMP messages from the router don't make it 182 back to the sending host. This will typically happen when there is an 183 MTU bottleneck somewhere in the middle of the path. If the MTU 184 bottleneck is located at either end, the TCP MSS (maximum segment 185 size) option makes sure that TCP packets conform to the limited MTU. 186 PMTUD problems are of course possible with non-TCP protocols, but this 187 is rare in practice. 189 Taking the delay and jitter issues to heart, maximum packet sizes 190 should be larger for faster links. This means that in the majority of 191 cases, the MTU bottleneck will tend to be at one of the ends of a 192 path, rather than somewhere in the middle. 194 A crucial difference between PMTUD problems that result from MTUs 195 smaller than the standard 1500 octets and PMTUD problems that result 196 from MTUs larger than the standard 1500 octets is that in the latter 197 case, only a party that's actually using the non-standard MTU is 198 affected. This puts potential problems and potential benefits in the 199 same place so it's always possible to revert to a 1500-octet MTU if 200 PMTUD problems can't be resolved otherwise. 202 Considering the above and the work that's going on in the IETF to 203 resolve PMTUD issues as they exist today, means that increasing MTUs 204 where desired doesn't involve undue risks. 206 3.3 Packet loss through bit errors 208 All transmission media are subject to bit errors. In many cases, a bit 209 error leads to a CRC failure, after which the packet is lost. In other 210 cases, packets are retransmitted a number of times, but if error 211 conditions are severe, packets may still be lost because an error 212 occurred at every try. Using larger packets means that the chance of a 213 packet being lost due to errors increases. And when a packet is lost, 214 more data has to be retransmitted. 216 Both per-packet overhead and loss through errors reduce the amount of 217 usable data transferred. The optimum tradeoff is reached when both 218 types of loss are equal. If we make the simplifying assumption that 219 the relationship between the bit error rate of a medium and the 220 resulting number of lost packets is linear with packet size, the 221 optimum packet size is computed as follows: 223 packet size = sqrt(overhead octets / bit error rate) 225 For IPv6 in ethernet framing, with 14 octets of ethernet header, 40 226 octets of IPv6 header, 20 octets of TCP header and 32 bits of ethernet 227 CRC the total number of octets transmitted is 1538 while the useful 228 data is 1440. (The preamble and inter frame gap are not relevant for 229 error rate purposes.) 78 octets of overhead would result in a 230 1518-octet frame length for a bit error rate of 10^-5.3. 232 Note that the minimum BER for 1000BASE-T is 10^-10, which implies an 233 optimum packet size of 312250 octets. 235 In practice, it's better to err on the side of smaller packets and 236 lower packet loss to avoid triggering TCP congestion mechanisms. 237 However, it's obvious that current maximum packet sizes are far below 238 the optimum size with respect to optimum throughput. 240 3.4 Undetected bit errors 242 Nearly all link layers employ some kind of checksum to detect bit 243 errors so that packets with errors can be discarded. In the case of 244 ethernet, this is a frame check sequence in the form of a 32-bit CRC. 245 The error detecting properties of the CRC are twofold: the minimum 246 Hamming distance and the statistical unlikeliness of two packets 247 resulting in the same CRC. Depending on the size of the packet, there 248 is a minimum Hamming distance between two possible packets that result 249 in the same CRC. For ethernet packets between 376 and 11454 octets 250 long (including), the Hamming distance is 3 [CRC]. So all packets 251 where transmission errors resulted in one or two flipped bits are 252 detected. If 3 or more bits are flipped, most errors are caught 253 because only in very few cases, the new bit pattern results in the 254 same CRC as the old bit pattern. In theory, the chance of two 255 packets having the same CRC-32 is 1 in 2^32, but this assumes the 256 CRC is as strong as it possibly could be. 258 It has been suggested that increasing packet lengths reduce the 259 effectiveness of the CRC-32. For the statistical aspect of the CRC, 260 this isn't true. Again, assuming a linear relationship between the 261 likelihood of bit errors in a packet and the bit error rate, doubling 262 the packet size means doubling the chance of a given number of bit 263 errors in the packet. In turn, this doubles the chance of a packet 264 with bit errors going undetected by the CRC. However, because the 265 packet is twice as long, only half the number of packets is required 266 to transmit any given amount of data. These aspects cancel each other 267 out so the probability of a undetected errors occurring in any given 268 data transfer doesn't vary with packet size when only considering the 269 statistical properties of the CRC. 271 Obviously, choosing a packet size that leads to a reduced Hamming 272 distance greatly increases the risk of undetected bit errors. However, 273 even choosing a larger packet size with a Hamming distance of 3 leads 274 to a reduction in error detection strength. The likelihood of a packet 275 having enough bit errors to satisfy a given Hamming distance (packet 276 error rate) and then generate the same CRC is: 278 PER = (packet length in bits * BER) ^ H / 2^32 280 The likelihood of a packet with enough bit errors to meet the Hamming 281 distance and then generate an identical CRC in a transmission of a 282 certain number of bits is: 284 TER = transmission length / packet length * PER 286 In other words: 288 TER = transmission length / (packet length ^ (H - 1) * BER ^ H) / 2^32 289 (Hence the irrelevance of the packet length for a Hamming distance of 290 1.) 292 For a 400 GB (approximately one hour) transmission over 1000BASE-T 293 with a BER of 10^-10 and a 1518-octet ethernet frame length this 294 means: 296 TER = 3.44*10^12 * 12144 ^ 2 * 10^-10 ^ 3 / 2^32 = 1.18*10^-19 298 For 11454-octet packets this becomes: 300 TER = 3.44*10^12 * 91632 ^ 2 * 10^-10 ^ 3 / 2^32 = 6.73*10^-18 302 Please note that this is 14 orders of magnitude better than the naive 303 assumption of a Hamming distance of 1 suggests for standard 1518-octet 304 ethernet frames: 306 TER = 3.44*10^12 * 12144 ^ 0 * 10^-10 ^ 1 / 2^32 = 9.73*10^-4 308 So the strength of the CRC, assuming a Hamming distance of 3, goes 309 down with the square of the factor by which the packet length is 310 increased. And it goes down with the third power of any increase of 311 the bit error rate. However, this discussion is largely academic 312 because of the assumption that bit errors happen in isolation. For 313 instance, 1000BASE-T transmits two bits per symbol over four wire 314 pairs, so bit errors are much more likely to (at least) happen in 315 pairs rather than isolated. 317 Also, it should be possible to implement stronger frame check 318 sequences for newer versions of ethernet. Unlike the packet length, 319 the FCS is something switches can change when interconnecting 320 different types of ethernet without harming interoperability. 322 3.5 Conclusion 324 Larger packets aren't universally desireable. The factors that factor 325 into the decision to use larger packets include: 327 - A link's bit error rate 328 - The number of bits per symbol on a link and hence the likelihood of 329 multiple bit errors in a single packet 330 - The strength of the Frame Check Sequence 331 - The link speed 332 - The number of buffers 333 - Queuing strategy 335 This means that choosing a good maximum packet size is, initially at 336 least, the responsibility of hardware vendors. On top of that, robust 337 mechanisms must be available to operators to further limit maximum 338 packet sizes where appropriate. 340 4 The protocol mechanisms 342 The basic idea is that nodes are free to negotiate larger MTUs with 343 neighbors on a subnet. However, to avoid problems, probe packets 344 are sent first before larger packets are used for actual traffic, 345 and routers may inform hosts of MTU limitations that should be 346 observed for three common ranges of link speeds. The rationale for 347 having different MTU limitations for different link speeds is that 348 it's common for devices operating at the link layer to support 349 larger MTUs if they support and/or operate at higher link speeds. 350 E.g., a LAN could consist of a gigabit ethernet switch with jumbo 351 frame capabilities connected to a 10/100 Mbps ethernet switch which 352 doesn't support jumbo frames. By limiting the use of oversized 353 packets to nodes operating at 1000 Mbps, the 10/100 Mbps switch 354 isn't exposed to oversized packets which would result in error 355 conditions and use up unnecessary bandwidth. Additionally, it may 356 be desireable to limit packet sizes at lower speeds even if a large 357 MTU is supported for QoS purposes. 359 Additionally, routers send out two flags. One is intended to signal 360 hosts to be conservative in the number of probes they transmit to 361 avoid triggering undesired behavior by link-layer devices seeing a 362 large number of out-of-spec packets. The other flag suppresses 363 probing for compatibility with the existing practice where all 364 nodes on a subnet are administratively configured with a 365 non-standard MTU. 367 Probing consists of sending a large neighbor discovery or ARP 368 packet to a neighbor. If the neighbor sends a reply, it managed to 369 successfully receive the probe so the per-neighbor MTU for this 370 neighbor can be set to the size of the probe packet and data 371 packets of that size can now be sent. 373 4.1 The multi-MTU router advertisement option 375 Routers use this option to inform hosts on connected subnets about the 376 maximum allowed MTU for three ranges of link speeds. 378 1 2 3 379 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 380 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 381 | Type | Length |C|N| Reserved | Pri | 382 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 383 | MAXMTU1000 | 384 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 385 | MAXMTU100 | 386 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 387 | MAXMTU10 | 388 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 390 Type: TBD 392 Length: 393 1 or 2. A length of more than 2 indicates a future extension with 394 additional fields and MUST NOT be treated as an error, the 395 additional fields MUST be ignored. 397 C: 398 "Conservative" flag: when set, nodes should reduce the number of 399 large packets sent by using a conservative timings and probing 400 algorithms, if possible avoiding sending more than one 401 unsuccessful probe per 60 seconds. When the flag is cleared, 402 nodes may send send several oversized packets per second when 403 probing. 405 N: 406 "No probe" flag: when set to 0, hosts MUST probe before using 407 oversized packets towards a neighbor. When set to 1, hosts MUST 408 NOT send probes and use the relevant MAXMTU field as their MTU. 409 If MAXMTU is larger than the physical MTU, an error is logged. 411 Reserved: 0 on transmission, ignored on reception. 413 Pri: 414 Priority. Values have the following meaning: 416 000: Vendor default 417 001: Local override of 000 418 010: Site default 419 011: Local override of 010 420 100: Subnet default 421 101: Local override of 100 422 110: Per-node setting 423 111: Local override of 110 425 Vendors may only use priority 000 in default configurations. 426 Site-wide administrative settings may only use 000 and 010. 428 Subnet-specific administrative settings may use 000, 010 or 110, 429 but not 001, 011, 101 or 111. 431 MAXMTU1000: 432 The maximum packets size allowed on a link operating at a speed 433 of 300 Mbps or more. Packets larger than this value SHOULD NOT 434 be sent over the link in question. The MAXMTU1000 MUST be at 435 least the MTU size specified in the relevant IPv6-over-... RFC. 436 A value of 0 means that the MTU size is undefined and no 437 maximum size is enforced for this link speed. 439 MAXMTU100: 440 The maximum packets size allowed on a link operating at a speed 441 of 30 to 299 Mbps and links operating at an unknown speed if 442 that speed can be 30 Mbps or higher. Packets larger than 443 this value SHOULD NOT be sent over the link in question. The 444 MAXMTU100 MUST be at least the MTU size specified in the 445 relevant IPv6-over-... RFC. A value of 0 means that the MTU 446 size is undefined and no maximum size is enforced for this link 447 speed. 449 MAXMTU10: 450 The maximum packets size allowed on a link operating at a speed 451 of less than 30 Mbps. Packets larger than this value SHOULD NOT 452 be sent over the link in question. The MAXMTU10 MUST be at 453 least the MTU size specified in the relevant IPv6-over-... RFC. 454 A value of 0 means that the MTU size is undefined and no 455 maximum size is enforced for this link speed. 457 When MAXMTU1000, MAXMTU100 and MAXMTU10 all contain the same value, 458 it is allowed to omit MAXMTU100 and MAXMTU10 so the option has a 459 length of 1 (8 octets) rather than 2 (16 octets). The receiver of 460 the option should treat the shorter option the same as a full lenth 461 option where the three MAXMTU fields all contain the value from 462 MAXMTU1000. 464 Hosts are expected to recover the multi-MTU options from the router 465 advertisements of at least the router they select as a default router, 466 but it's encouraged (not required) to recover options from multiple 467 routers. The same option, or data constituting the same information, 468 may be learned from other sources, such as local configuration and/or 469 DHCPv6. Hosts SHOULD use the MAXMTU value relevant for the link 470 speed the interface is currently operating at from the option or 471 equivalent information with the largest priority value. If the 472 relevant MAXMTU field is unspecified (zero) in the option or 473 information with the highest priority, the field from the option 474 or information with the next highest priority is considered, and 475 so on. If no information is available because no option or 476 equivalent is available, or the relevant MAXMTU field never has a 477 non-zero value, the host SHOULD use its physical MTU as the 478 MAXMTU. 480 When a node's interface speed changes, it MAY reinitiate 481 negotiation of per-neighbor MTUs, but it SHOULD remain prepared to 482 receive packets of the maximum size indicated to neighbors 483 previously. 485 Devices not acting as IPv6 routers that need to inform hosts on the 486 local subnet of MTU limitations MAY send out a router advertisement 487 with a Router Lifetime of 0 [RFC2461] and the pertinent information 488 in a multi-MTU option. 490 4.2 Changes to the RA MTU option semantics 492 Hosts are currently supposed to ignore an MTU of more than 1500 in 493 the MTU option in router advertisements on ethernet links 494 [RFC2464]. This makes it impossible to use an MTU larger than 1500 495 octets for multicast packets. In order to lift this limitation, 496 routers and hosts that implement multi-MTU subnets may advertise 497 and accept, respectively, an MTU option with an MTU larger than 498 1500. Hosts should use the minimum of the MAXMTU for their link 499 speed and the MTU in the RA MTU option for the transmission of 500 multicast packets. 502 Note that advertising an MTU option larger than 1500 can only work on 503 subnets where all the hosts implement multi-MTU subnets. 505 4.3 The IPv6 neighbor discovery MTU and padding options 507 A node that implements the multi-MTU subnet capability SHOULD 508 include an MTU option in both neighbor solicitation and neighbor 509 advertisement messages [RFC2461]. A node MAY omit the option if the 510 use of a larger MTU isn't desired at that time or if the MTU it would 511 advertise is equal to or lower than the MTU that would otherwise be 512 used. However, there is no requirement to omit the option depending on 513 the value of the different MTU variables as the receiver must 514 implement the logic required to determine which MTU to use anyway. 516 The format of the neighbor discovery MTU option is as follows: 518 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 519 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 520 | Type | Length | Reserved | 521 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 522 | MTU | 523 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 525 Type: TBD 526 Length: 1 528 Reserved: set to 0 on transmission, ignored on reception. 530 MTU: 531 The maximum packet size in octets that the node is prepared to 532 receive. The minimum valid value is 1280. 534 The format of the neighbor discovery MTU option is as follows: 536 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 537 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 538 | Type | Length |R| Reserved | 539 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 540 | Padding | 541 ~ ~ 542 | | 543 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 545 Type: TBD 547 Length: see below. 549 R: reply flag. 551 Reserved: set to 0 on transmission, ignored on reception. 553 Padding: 0 or more all-zero octects. 555 The MTU option is included in all neighbor advertisement and 556 neighbor solicitation messages. 558 Reception of a neighbor solicitation or a neighbor advertisement 559 triggers for a neighbor for which no per-neighbor MTU is known 560 triggers, in addition to the normal response if it's a neighbor 561 solicitation, the sending of an neighbor solicitation message wih 562 the MTU and padding options in it. The size of this message is may 563 vary between the IPv6-over-... size + 1 for the link and the 564 minimum of the relevant MAXMTU, the physical MTU and the neighbor's 565 MTU as advertised in the MTU option of the packet received. See 566 below for considerations about the packet sizes to choose. The 567 padding option is used to bring the neighbor solicitation message 568 to this size. The padding option MUST be the last option in the 569 packet. 571 There are two possible ways to determine the value of the length 572 field: 574 1. Set it to 0. As the "length" field in options has a granularity 575 of 8 octets and the behavior of nodes when they receive a 576 neighbor solicitation packet which has a total length that 577 doesn't match the length of the packet contents, an option 578 length of 0 is used to make sure that hosts that don't 579 understand the padding option will silently discard the packet. 581 2. If the intended packet length allows a valid value for the 582 length field, the length field MAY be set to that value. The 583 node MAY reduce the size of the intended packet to accommodate 584 the requirement that the size field is a multiple of 8 octets. 585 I.e., if the intended packet size is 4470 octets with 40 and 24 586 octets for the IPv4 and neighbor solicitation headers, 587 respectively, the padding option would have to be 4406 octets 588 long, which can't be expressed in the length field. The node may 589 choose to use a packet size of 4464 instead, which results in a 590 length field value of 550. 592 A neighbor solicitation message with the padding option is always 593 sent in addition to a regular neighbor solicitation message, rather 594 than in place of one. 596 When a node receives a neighbor solicitation message with the 597 padding option, it stops evaluating options when it reaches the 598 padding option and returns a regular neighbor advertisement 599 message, which includes the MTU option with the R flag set to 1. 600 Whenever the neighbor advertisement is not the result of receiving 601 a neighbor solicitation with a padding option, the R flag is set to 602 0. 604 When a node receives a neighbor advertisement message, it must 605 determine whether the message is in reaction to a locally sent 606 neighbor solicitation with the padding option or not. If the MTU 607 option is included in the message received, an R flag of 1 608 indicates that it is indeed a reply. In the absense of the MTU 609 option the node must use heuristics relating to the timing of the 610 messages it sent with and without the option, and the reception of 611 the current message. If the message was a reply, the node sets the 612 neighbor MTU to the size of the neighbor solicitation message that 613 was replied to. 615 If no reply is received after some time, either the neighbor is 616 incapable of receiving packets of the size that was used, or a 617 device operating at the link layer was incapable for forwarding the 618 frame. (Incidental packet loss is also a possibility.) In order to 619 determine a workable MTU even in the presence of unknown 620 limitations, a node may repeat sending a solicitation with the 621 padding option. However, since presumably, some equipment may react 622 badly to a large number of out-of-spec packets, it's important that 623 nodes adjust their behavior in the presence of the C (conservative) 624 flag in router advertisements. 626 The above allows for two strategies in determining a neighbor's 627 MTU: the node can depend on the presence of these mechanisms 628 described in this document, including setting the padding option 629 length field to 0, or it can try to interoperate with nodes that do 630 have the capability of using larger packet sizes, but don't 631 implement any of the mechanisms described. In that case, the 632 padding option must conform to [RFC2461] and care must be taken to 633 avoid overly aggressive probing of nodes that do not support larger 634 packets. 636 Nodes MUST support reception of both types of probes, but MAY be 637 limited to generating only one type. 639 4.4 IPv4 ethernet jumbo ARP message 641 Due to lack of neighbor discovery, with IPv4, it's necessary to use 642 ARP to probe for non-standard MTU capabilities. This is done by 643 simply probing with an ARP packet padded to the desired size. If a 644 reply comes back, the neighbor supports the probed MTU size. 646 4.5 Probe considerations 648 In cases where the neigbor's MTU was advertised in an MTU option, 649 it makes sense to try with this size. If that probe fails or the 650 neighbor's MTU is unknown, the best choice for a probe size would 651 be the smallest possible non-standard MTU. This could be the 652 IPv6-over-... RFC's MTU size + 1, or a slightly larger value that 653 represents the first larger size that is actually useful, such as 654 1508 or 1520 for ethernet. Failure at this size wastes relatively 655 little bandwidth and indicates that further probes are unnecessary. 656 If this probe is successful, further choices for the probe size may 657 be common MTU sizes such as 1508, 1530, 1536, 1546, 1998, 2000, 658 2018, 4464, 4470, 8092, 8192, 9000, 9176, 9180, 9216, 17976, 64000 659 and 65280 octets. 661 There is no requirement that a node tries a number of probes of 662 different sizes; only that before oversized packets are sent, a 663 reply for a probe of that size or larger MUST have been received 664 from the neighbor in question, unless the N flag is set to 1. A 665 simple strategy that would be appropriate when the C flag is set to 666 1, but may also be used otherwise, would be to initially send just 667 one probe sized at the local MTU value, and if unsuccessful, only 668 send a second probe when a probe from the neighbor is received. The 669 second probe is made the same size as the neighbor's probe. 671 Probes MUST be sent as unicast. 673 4.6 Neighbor MTU garbage collection 675 The MTU size for a neighbor is garbage collected along with a 676 neighbor's link address in accordance with regular ARP and neighbor 677 discovery timeouts. Additionally, a neighbor's MTU size is reset to 678 unknown after dead neighbor detection declares a neighbor "dead". 680 5 References 682 5.1 Normative References 684 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 685 Requirement Levels", BCP 14, RFC 2119, March 1997. 687 [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor 688 Discovery for IP Version 6 (IPv6)", RFC 2461, 689 December 1998. 691 [RFC2462] Thomson, S. and T. Narten, "IPv6 Stateless Address 692 Autoconfiguration", RFC 2462, December 1998. 694 5.2 Informative References 696 [CRC] Jain, R., ""Error Characteristics of Fiber Distributed 697 Data Interface (FDDI)", IEEE Transactions on 698 Communications, August 1990. 700 6 Document and Author Information 702 This document expires February, 2008. The latest version will always 703 be available at http://www.muada.com/drafts/. Please direct questions 704 and comments to the ipv6 or int area mailinglists or directly to the 705 author: 707 Iljitsch van Beijnum 709 Email: iljitsch@muada.com 711 Full Copyright Statement 713 Copyright (C) The IETF Trust (2007). 715 This document is subject to the rights, licenses and restrictions 716 contained in BCP 78, and except as set forth therein, the authors 717 retain all their rights. 719 This document and the information contained herein are provided on an 720 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 721 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 722 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 723 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 724 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 725 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 727 Intellectual Property 729 The IETF takes no position regarding the validity or scope of any 730 Intellectual Property Rights or other rights that might be claimed to 731 pertain to the implementation or use of the technology described in 732 this document or the extent to which any license under such rights 733 might or might not be available; nor does it represent that it has 734 made any independent effort to identify any such rights. Information 735 on the procedures with respect to rights in RFC documents can be 736 found in BCP 78 and BCP 79. 738 Copies of IPR disclosures made to the IETF Secretariat and any 739 assurances of licenses to be made available, or the result of an 740 attempt made to obtain a general license or permission for the use of 741 such proprietary rights by implementers or users of this 742 specification can be obtained from the IETF on-line IPR repository at 743 http://www.ietf.org/ipr. 745 The IETF invites any interested party to bring to its attention any 746 copyrights, patents or patent applications, or other proprietary 747 rights that may cover technology that may be required to implement 748 this standard. Please address the information to the IETF at 749 ietf-ipr@ietf.org. 751 Acknowledgment 753 Funding for the RFC Editor function is provided by the IETF 754 Administrative Support Activity (IASA).