idnits 2.17.00 (12 Aug 2021) /tmp/idnits53113/draft-vasilenko-v6ops-ipv6-oversized-analysis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 280: '... [VxLAN] section 4.3 also uses the approach: "it is RECOMMENDED that...' RFC 2119 keyword, line 423: '... [VxLAN] section 4.3 is strict: "VTEPs MUST NOT fragment VXLAN...' RFC 2119 keyword, line 426: '... [NVO3] section 4.4.4 is strict too: "It is strongly RECOMMENDED that...' RFC 2119 keyword, line 533: '... [VxLAN] section 4.3 proposes to use PMTUD: "Path MTU discovery MAY...' RFC 2119 keyword, line 535: '... [NVO3] section 4.4.4 assumes PMTUD too: "It is strongly RECOMMENDED...' (1 more instance...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 19, 2021) is 427 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 IPv6 Operations (v6ops) Working Group E. Vasilenko 2 Internet Draft X. Xiao 3 Intended status: Informational Huawei Technologies 4 Expires: September 2021 D. Khaustov 5 Rostelecom 6 March 19, 2021 8 IPv6 Oversized Packets Analysis 9 draft-vasilenko-v6ops-ipv6-oversized-analysis-00 11 Abstract 13 The IETF has many new initiatives relying on IPv6 Enhanced Headers 14 added in transit: SRv6, SFC, BIERv6, iOAM. Additionally, some recent 15 developments are overlays (SRv6, VxLAN) over IPv6. It could create 16 oversized packets that need to be dealt with. This document analyzes 17 available standards for the resolution of oversized packet drops. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six 30 months and may be updated, replaced, or obsoleted by other documents 31 at any time. It is inappropriate to use Internet-Drafts as 32 reference material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on September 2021. 36 Copyright Notice 38 Copyright (c) 2021 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with 46 respect to this document. Code Components extracted from this 47 document must include Simplified BSD License text as described in 48 Section 4.e of the Trust Legal Provisions and are provided without 49 warranty as described in the Simplified BSD License. 51 Table of Contents 53 1. Terminology and pre-requisite..................................2 54 2. Problem statement..............................................3 55 3. Solutions......................................................5 56 3.1. Provision links with big enough MTU.......................5 57 3.2. Frugal usage of Extension Headers.........................7 58 3.3. Fragmentation and reassembly at the tunnel ends...........8 59 3.4. PMTUD by original packet source..........................12 60 3.5. Packetization Layer MTU Discovery........................14 61 4. Conclusion....................................................15 62 5. Security Considerations.......................................15 63 6. IANA Considerations...........................................15 64 7. References....................................................16 65 7.1. Normative References.....................................16 66 7.2. Informative References...................................18 67 8. Acknowledgments...............................................19 69 1. Terminology and pre-requisite 71 We do assume good knowledge or frequent references to [PMTUD] and 72 [IPv6 Tunneling]. Terminology is inherited from [PMTUD]. 74 Link MTU - the maximum transmission unit, i.e., maximum packet size 75 in octets that can be conveyed over a link. 77 Path MTU (PMTU) - the minimum link MTU of all links in a path 78 between a source node and a destination node. 80 Path MTU Discovery (PMTUD) - the process by which a node learns the 81 PMTU of a path. 83 EMTU_S - Effective MTU for sending; used by upper-layer protocols to 84 limit the size of IP packets they queue for sending. 86 EMTU_R - Effective MTU for receiving; the largest packet that can be 87 reassembled at the receiver. 89 Packetization Layer - the layer of the network stack that segments 90 data into packets. 92 PLPMTUD - Packetization Layer Path MTU Discovery, the method of 93 detecting path MTU at packetization layer, which is an 94 extension of classical PMTU Discovery. 96 PTB (Packet Too Big) message - an ICMPv6 message reporting that an 97 IPv6 packet is too large to forward through some link. 99 MSS - the TCP Maximum Segment Size, the maximum payload size 100 available to the TCP layer. This is typically the Path MTU 101 minus the size of the IP and TCP headers. 103 2. Problem statement 105 IPv6 is strict regarding fragmentation - it must NOT be done in 106 transit (section 4.5 of [IPv6]). 108 IPv6 sees rapid developments in recent years. A lot of additional 109 functionality has been added primarily by adding options to 110 Extension Headers and/or using overlay encapsulation. All of the 111 above expand the packet size. This could lead to oversized packets 112 that would be dropped on some links. 114 Massive parallelism in traffic delivery is the additional challenge 115 developed in the last 10 years: ECMP on one hop could reach 16 (or 116 even more), which creates the end-to-end possibility for 64k paths 117 on just 5 hops (example from big production network). Different 118 paths could have a different set of Extension Headers and different 119 PMTU as a result. PMTU is effectively becoming dynamic: we could 120 never know how many additional headers would be added at a 121 particular time to the particular packet on the particular path. 123 The old classical PMTUD problems are still with us: filtered ICMPv6 124 messages, drops related to Extension Headers before next hop MTU has 125 been evaluated (no Packet Too Big message sent). 127 Standards have two important numbers that we would need for our 128 discussion: 130 o [IPv6] chapter 5 requires that every link should have the MTU of 131 1280 octets or greater (2^10+2^8 - it probably explains the 132 choice of this size) 134 o [IPv6] requests minimum EMTU_R (reassembly buffer) in 1500 135 octets. An upper-layer protocol or application that depends on 136 IPv6 fragmentation to send packets larger than the MTU of a path 137 should not send packets larger than 1500 octets unless it has the 138 assurance that the destination is capable of reassembling packets 139 of that larger size 141 There is only one solution by [IPv6] architecture for the PMTU 142 problem - decrease packet size on the original source. It is 143 workable up to the minimum limit for IPv6 packet (1280B). The 144 typical transit link had MTU not much bigger than 1500B for a long 145 time, only the space for a few additional MPLS labels was reserved. 146 220B left could be considered as guaranteed for additional 147 functionality in Enhanced and Encapsulation headers. It could be 148 enough for the next decade if we would make some precautions - see 149 discussion below. 151 [Huston-2016] did an investigation on a different topic, but he has 152 good statistics related to MTU drops up to 1500B that did show a 5% 153 drop for MTU as small as 1455B. Additionally, [Huston-2016] has 154 found the big drop spike (69% from all drops!) at 1480B, 20B less is 155 presumable for IPv6 encapsulation into IPv4. As you can see - 1500B 156 is not always available now, probably because of the different forms 157 of tunneling. Hence, we do not have 220B for additional headers in 158 all situations. We could be reasonably optimistic that such type of 159 tunneling would disappear in the long term. Our optimistic approach 160 is that we expect 220B to be available in most situations. It is 161 still possible to have the more pessimistic estimation (200B? 175B?) 162 and keep it in mind reading the rest of the document. 164 The hungriest protocol known is SRv6 that could add 40B of IPv6 165 underlay tunnel header (called "outer IP header" in [SRH]), 16B of 166 SRH header itself, and additionally up to 10 IPv6 addresses in the 167 SID stack (potentially even more). It is already 216B - very close 168 to 220B optimistic limit. It makes the introduction of any 169 additional functionality without rigorous expansion of all links to 170 bigger MTU quite challenging. 172 Initial SRv6 implementations that trespassed safe limit in 220B are 173 the reason for recent activities in MTU problem research. We see 174 many recent efforts to improve Path MTU Discovery (which would be 175 mentioned in the document) - let us find the rationale behind it. 177 3. Solutions 179 There is a low probability that the Internet community would agree 180 to decrease the minimal IPv6 packet size (1280B). Minimal buffer for 181 packet reassembly (1500B) is potentially possible to increase in new 182 standard updates, but then would be the problem with the transition, 183 because this limitation is programmed into billions of hosts - it 184 would need big time to be sure that we do not have old 185 implementation anymore. 187 There is no good solution for the problem of bloating headers above 188 220B for hosts. We need to keep headers below the 220B limit. 189 Fortunately, we are far from this problem yet - very limited 190 additional functionality is implemented directly on the hosts (like 191 [PMTU by HbH] or APN6). This problem should be looked at again in 5 192 years, it may be that in the future we would have to increase 193 default EMTU_R on all hosts to give the possibility for new 194 functionality. 196 It is possible to partially alleviate the MTU problem in some 197 network zones where all transit nodes have big enough MTU. Transit 198 nodes should delete enhanced headers before packets would leave 199 "high MTU network zone". Leakage of a big header to a host could 200 overflow EMTU_R buffer. The majority of RFCs recommend carriers 201 delete additional headers before forwarding traffic to the client - 202 this practice should be strictly followed. 204 The SPRING working group is actively developing a compressed version 205 of SRv6 that should leave space for other functionality, even on 206 current transit routers that sometimes do not support much above 207 1500B. 209 All solutions for packet drop avoidance as a result of oversized 210 packets could be classified into 4 classes. They are examined one by 211 one. 213 3.1. Provision links with big enough MTU 215 MTU supported by the host's links is typically 1500 Bytes. 216 Backbone link's MTU could be up to 9000 Bytes on modern hardware. 217 PMTUD is not needed in an ideal world. 219 Reality is not that good: 221 o Some old devices still support just a few additional MPLS labels 222 above 1500B on Ethernet. It was historically a problem to cross 223 1536B because IEEE specification for 802.3 assumes that a bigger 224 number in the Length field means Type of the payload. 225 o We could have middleboxes that would not support MTU much bigger 226 than 1500B MTU for a long time. 227 o Ethernet is very mature now in the relation of big MTU support, 228 but that could be a challenge for other link-layer technologies 229 (for example WiFi, satellite links, radio links, etc.). 230 o Packet Links could be rented from 3rd party - no possibility to 231 change the MTU. 232 o Big MTU influences buffer size - see below. 233 o The majority of vendors set the default MTU to 1500B (with 234 variations on what is counted inside MTU). It is time-consuming 235 to change the MTU, as it should be coordinated at least on one 236 link. 237 o Some hosts (especially for storage traffic in Data Centers) could 238 use 2500B or 9000B MTU that challenges the possibility of having 239 a bigger MTU in the backbone. 241 Cost-optimized equipment architecture (especially used for switches, 242 but applicable for many routers as well) would not split packets in 243 the buffer memory. So small packet would occupy a bigger buffer 244 space reserved for the packet with maximum MTU. This limitation 245 effectively decreases the potential number of packets that could be 246 buffered. Most of host packets are still limited to 1500B size. MTU 247 9000B would just lead to wasting buffer memory about 6:1 in the 248 worst case. Buffer memory could be up to 30% of the router cost. It 249 is not acceptable to increase buffer memory cost 6 times. Hence, in 250 many cases, it does not make sense to increase MTU to the maximum 251 supported by the switch or router. One should always check with the 252 vendor the impact of using a big MTU on buffering for the particular 253 product. MTU should be increased to the number that is bigger than 254 the maximum MTU expected from hosts + the size of all possible 255 network overhead + underlay IPv6 header (if present). 256 There is some potential to use 9000B as the primary packet size in 257 DC and cross-DC environment. 259 [MTU issues in Tunneling] section 3.3 discusses the opposite 260 solution: decrease MTU on links to hosts to be sure that a host 261 would always generate small enough MTU for the backbone. This 262 solution was possible for small tunnel overhead, but now we are 263 talking about the situation when 220B margin is not enough. 265 [L3VPN] and [EVPN] do attach an additional label and could create 266 oversized packets. Still, the MPLS header cannot point to the 267 original MPLS router that has an attached service label. 268 Additionally, VPN IP packet could use private address space or no IP 269 address at all (for EVPN). It blocks the possibility to properly 270 organize the PMTUD process. Hence, [L3VPN] and [EVPN] have been 271 developed under the assumption that all MTUs on the path would be 272 expanded for at least 8 bytes that are needed for services over the 273 MPLS data plane. 274 We have recent [Generic Fragmentation] that may permit fragmentation 275 for MPLS services, but it is a personal draft yet. 277 [Pseudowire Fragmentation] is the rear case when fragmentation is 278 available over MPLS for one type of service. 280 [VxLAN] section 4.3 also uses the approach: "it is RECOMMENDED that 281 the MTUs across the physical network infrastructure be set to a 282 value that accommodates the larger frame size due to the 283 encapsulation". 285 Packet drop statistics and big activity in IETF prove that the PMTUD 286 problem persists. 288 "Raise MTU on transit" is the best solution, if it is available. 290 3.2. Frugal usage of Extension Headers 292 Some new functionality (especially source routing with a big SID 293 stack) could decrease headers size without a big loss of 294 functionality (for example, use loose node appointment in SID 295 stack). Some functionality (like iFIT or iOAM) could be completely 296 omitted in the situation that would lead to packet drop. It is 297 effectively "the tradeoff of functionality to PMTU control". 299 The important point here is that the transit node attaching an 300 additional header should be aware of all MTUs along the assumed 301 packet path to predict how big MTU is still acceptable. 303 [PMTUD] is readily available for tunneling interfaces - tunnel 304 source should be aware of PMTU of the tunnel (by PTB feedback 305 messages). But we have cases when it is not enough: 307 o SDN controller (or management system in general) could assist in 308 provisioning of extension headers (including iFiT, iOAM, BIER) 309 and encapsulation headers (SRv6, VxLAN) - should be the way to 310 report MTUs to Controller. 311 o Some new protocols (iOAM, iFIT, APN6, BIERv6) do not have a sub- 312 interface structure on transit nodes where to store PMTU. 313 It is not a good idea in general to keep in the backbone 314 additional information about states. 316 o ICMPv6 PTB would be directed to the transit control plane only in 317 the case of problems inside the tunnel. PTB messages from outside 318 of the tunnel would be directed to the source node. It is 319 difficult to snoop PTB on transit nodes. 321 Hence, we see many initiatives to collect and manage MTU by many 322 popular protocols for routing and traffic engineering: [PMTU by 323 ISIS], [PMTU by BGP-LS], [PMTU by PCEP], [PMTU by SR-Policy]. 325 Moreover, these protocol extensions would become even more useful in 326 the future when it will not be possible to squeeze all extension 327 headers into 220B anymore. Frugal attachment of new headers on 328 transit nodes would increase the need for awareness of PMTU - it 329 should stimulate MTU collection by all other popular protocols 330 (OSPF, normal BGP on peering borders). 332 This approach has a fundamental problem: full knowledge about all 333 MTUs in the domain could not help to estimate the real path for a 334 packet, because of massive ECMP used by many networks (at least by 335 all Carriers). Non-routing protocols do not have a proper engine to 336 estimate traffic paths and predict PMTU as well. And even more, if 337 L2 ECMP is used or some links are rented from another carrier it 338 will again be impossible to predict the exact path and the PMTU. 340 The second problem of this approach could be classified as "chicken 341 and egg". We already have a much better solution for MTU drop - 342 increase MTU (see the previous section). We are looking for other 343 solutions only because upgrading equipment (to better MTU) is not 344 possible for some reasons. But new protocols introduction would also 345 demand equipment upgrade and thus making frugal headers meaningless. 346 However, upgrade for control plane should be cheaper than upgrade 347 for data plane, if the vendor would support such an approach. 349 Hence, the solution discussed in this section has only limited 350 applicability. 352 3.3. Fragmentation and reassembly at the tunnel ends 354 The tunnel source behaves like a host in respect to the tunnel 355 header. It is possible to properly adjust PMTU for the tunnel by 356 [PMTUD], so it is potentially possible to fragment all packets 357 bigger than PMTU. 359 [IP Encapsulation] is the earliest standard for IP-in-IP 360 encapsulation. Section 5.1 discusses that it is possible to fragment 361 IP packets before tunnel encapsulation, so there is no need to 362 reassemble packets on other tunnel end - reassembly could happen on 363 the destination host. It does not have additional cost implications 364 on tunnel ends. This approach did work for IPv4 in the case of the 365 "don't fragment" bit cleared. It fully contradicts IPv6 architecture 366 that does not permit to fragment packets on transit - no standard 367 has risked proposing such a solution for IPv6. 369 Some standards do propose IPv6 fragmentation (primarily for packets 370 1280B and below), but fragmentation is recommended after 371 encapsulation. It would lead to packet reassembly on other tunnel 372 end to hide (from destination host) the fact of transit 373 fragmentation. It does minimize IPv6 architecture disruption. 375 Many standards discussed below ([MPLS Encapsulation], [L2TPv3], 376 [VxLAN], [NVO3]) forgot to mention that packets 1280 and below 377 should be fragmented. This inaccuracy did not create any problem in 378 real production networks because we typically have 220B for all 379 headers - it is big enough for many tunnels nested into each other. 380 The situation could change in the next years because of Enhanced 381 Headers expansion by different functions. It could create pressure 382 to return to many mature standards and clarify the situation: what 383 to do when 1280B packet could not go through the tunnel. 385 The Fragmentation has a few issues that make it not popular: 387 o Fragmentation could double buffer requirements (we assume split 388 only in 2 fragments). We could ignore small additional buffer 389 requirements for packets that may be lost and need to wait some 390 time before reassembly, the Internet is not productive anyway 391 after a few percentages of packet drops. The buffer memory is 392 about 30% of the router cost. A 30% cost increase would not be 393 accepted by the majority of owners. Albeit, some middleboxes 394 already have enough buffer memory that could be reused for packet 395 reassembly. 396 o In general, IPv6 architecture does not approve fragmentation in 397 transit in all standards (except recent draft [IP Tunnels] - see 398 below). [PMTUD] section 5.1: "packetization layers are encouraged 399 to avoid sending messages that will require fragmentation". 400 We would discuss in this section some situations when tunnel 401 fragmentation is inevitable. 402 o [Fragile Fragmentation] has a good collection of all problems 403 related to fragmentation (additionally to the above: breaks ECMP, 404 stateful processing, policy routing, and has many security attack 405 vectors). [Fragile Fragmentation] strongly recommends avoiding 406 fragmentation, but not deprecating yet. 408 The primary RFC for tunneling is [IPv6 Tunneling] - it is the oldest 409 standard that was later reused by many other standards (including 410 the latest SRH). It permits fragmentation only for the case when the 411 original packet is already minimal (1280B or less) - see section 412 7.1. It mandates dropping the packet and signaling ICMPv6 PTB to the 413 source (request to decrease the PMTU size at the source) for all 414 other cases. 416 [MPLS Encapsulation] Section 5.1 has the name: "Preventing 417 Fragmentation and Reassembly". It does stress again: "IPv6 418 intermediate nodes do not perform fragmentation in any event". 420 [L2TPv3] section 4.1.4 has a similar comment: "Note that IPv6 does 421 not support "in-flight" fragmentation of data packets". 423 [VxLAN] section 4.3 is strict: "VTEPs MUST NOT fragment VXLAN 424 packets." 426 [NVO3] section 4.4.4 is strict too: "It is strongly RECOMMENDED that 427 Path MTU Discovery ([PMTUD]) be used to prevent or minimize 428 fragmentation." 430 [IPv6 GRE] section 3.3 does recommend fragmentation only for packets 431 that are less than 1280B. 433 The most recent draft for all types of tunnels is [IP Tunnels]. It 434 is already referenced by many IETF documents. It is complicated to 435 cover all use cases (any IP over any IP in any situation), but the 436 net result is: much bigger part of the traffic proposed to be 437 fragmented into the tunnel. Section 3.3: "The path between ingress 438 and egress interfaces has a path MTU, but the endpoints can exchange 439 messages as large as can be reassembled at the destination (egress 440 interface), i.e., the EMTU_R of the egress interface". 441 The short explanation of proposed functionality: original host would 442 try to transmit biggest flows (by volume) on maximum PMTU, that 443 tunnel source would not try to correct by PTB messages up to 1500B. 444 Hence, the tunnel source would not have any option except to 445 fragment. The principal problem here is the absence of PTB messages 446 for the packet size between real PMTU and statically appointed 447 EMTU_R. 448 Let's see how it has been formulated in more detail. 449 [IP Tunnels] introduces a new variable "Tunnel MTU" that should not 450 change as a result of PMTUD. The procedure to change "Tunnel MTU" is 451 out of the draft discussion - it is pushed to specifications of 452 particular tunnels in the last paragraph of section 4.2.2. Moreover, 453 it is even assumed that PLPMTUD could be used on the router for 454 "Tunnel MTU" discovery because this parameter is considered as an 455 above network layer (like transport layer on the host). Separate 456 section 4.2.3 is dedicated to the explanation that the newly 457 introduced "Tunnel MTU" cannot be adjusted dynamically. There is a 458 recommendation for the default "Tunnel MTU": typical host EMTU_R 459 (1500B) minus tunnel outer headers overhead. The good question could 460 be: if it is so difficult to manage "Tunnel MTU" dynamically, then 461 why this variable was introduced? 462 The real MTU of the tunnel is renamed into MAP (maximum atomic 463 packet), MAP should be corrected by PMTUD feedback from inside the 464 tunnel. 465 Section 4.2.2 states that everything up to "Tunnel MTU" should be 466 accepted to the tunnel, one long packet (with inner and outer 467 headers) should be created. Then it should be split into fragments 468 below MAP size. 469 Initially, "tunnel MTU" and MAP could be manually synchronized by 470 the administrator (with the difference in tunnel overhead). But any 471 additional overhead on the tunnel path (nested tunnel, smallest 472 Enhanced Header) would result in PMTUD that decreases MAP, but would 473 not change "Tunnel MTU". It would turn on fragmentation for all bulk 474 traffic. This situation is quite probable now (see [Huston-2016] on 475 really available MTU on the Internet) and it would be even more 476 probable in the future when many additional extension headers would 477 be used. Hence, the requirement in section 5.3.1 "do NOT try to 478 deprecate fragmentation" is indeed important. 479 Section 3.6 has the same approach as all other standards to the 480 question when fragmentation should happen: "this document assumes 481 that only outer fragmentation is viable because it is the only 482 approach that works for both IPv4 datagrams with DF=1 and IPv6". 483 a considerable increase in fragmentation is proposed for the reasons 484 of academic purity: the router part of the router should behave as a 485 router, the host part of the router should behave as a host without 486 any deviations. 487 Additional fragmentations would create all of the problems discussed 488 in [Fragile Fragmentation] and substantially increase the cost of 489 tunnel endpoints. There is a high probability that draft [IP 490 Tunnels] would be rejected by the market for cost reasons. 492 It makes sense to remind that fragmentation is not a universal 493 solution for oversized packets, because it is not possible for non- 494 tunneling cases (BIERv6, iFIT, iOAM, APN6). It would be a very bad 495 idea to fragment packets intercepted from the general traffic flow. 497 Additionally, we should point that statistics for fragmented packet 498 drop in the Internet is still very high (20-40% and increased over 499 the last years) - see [Huston-2020]. Some other researchers report 500 even more (50-55%). 502 Fragmentation is the least probable solution for oversized packet 503 drops. 505 3.4. PMTUD by original packet source 507 [PMTUD] is mandatory in IPv6 architecture, because IPv6 does not 508 have fragmentation in transit. We could see recommendations in many 509 RFCs not to block ICMPv6 PTB completely (it could be rate-limited - 510 see [ICMPv6] section 2.4). [DPLPMTUD] section 1.1 has a very good 511 collection of reasons why PTB message may not be delivered to the 512 source - it is used as justification to augment PMTUD by [DPLPMTUD]. 514 We should not see this problem for all non-tunneling protocols in 515 the majority of environments. ICMPv6 PTB should be delivered to 516 packet source, packet source would dynamically decrease PMTU to 517 adapt to new realities. PMTU could change dynamically because some 518 transit nodes could introduce additional extension header ad-hoc or 519 ECMP could switch flow to a different path. 521 [IPv6 Tunneling] mandates to relay ICMPv6 PTB by tunnel ends for 522 ICMPv6 messages received from the inside tunnel. [IPv6 Tunneling] 523 does not use "relay" terminology, but section 8 explains in detail 524 how to reconstruct and retransmit ICMP messages to the original 525 packet source (delete all tunnel-related information). 526 [MTU issues in Tunneling] section 3.2 discusses the same approach. 527 [L2TPv3] section 4.1.4 refers to the [IPv6 Tunneling]. We could 528 assume it as the request for PTB messages relay too. 529 [SRH] section 5.4 confirms full adherence to ICMPv6 PTB relay 530 approach: "For IP packets encapsulated in an outer IPv6 header, ICMP 531 error handling is as defined in [IPv6 Tunneling]". 533 [VxLAN] section 4.3 proposes to use PMTUD: "Path MTU discovery MAY 534 be used to address this requirement as well". 535 [NVO3] section 4.4.4 assumes PMTUD too: "It is strongly RECOMMENDED 536 that Path MTU Discovery ([PMTUD]) be used to prevent or minimize 537 fragmentation". 538 [IPSec] section 8.2.1 requests that PMTU should be maintained for 539 tunnel and signaled to real packet source as soon as any new packet 540 would arrive. 541 [IPv6 GRE] section 3.3 clearly instructs developers to drop the 542 oversized packets and send PTB for packets bigger than tunnel MTU. 543 The method of PMTU detection is fully IPv6 compliant: "the GMTU is 544 equal to the PMTU associated with the path between the GRE ingress 545 and the GRE egress, minus the GRE overhead". 546 [MPLS Encapsulation] section 5.1 specifies the same approach: tunnel 547 head-end should use [PMTUD] to understand tunnel MTU, then "the 548 packet will have to be discarded, but the tunnel head should send 549 the IP source of the discarded packet the proper ICMP error 550 message". 552 [VxLAN], [NVO3], [IPSec], [IPv6 GRE], and [MPLS Encapsulation] do 553 not request for tunnel endpoint to relay PTB messages. PMTUD should 554 be used to set proper MTU for the tunnel, then subsequent packet 555 could trigger PTB message to packet source. It would create an 556 additional round trip delay compared to the original [IPv6 557 Tunneling] relay approach for the first PTB message. This small 558 deficiency could be partially explained by the desire of many 559 standards to be universal for IPv6 as well as IPv4. As a reminder, 560 IPv4 may not have enough information in the ICMP message to properly 561 reconstruct a relay message (64bits of source packet by RFC 792). 563 [IP Tunnels] is the only draft that contradicts to [IPv6 Tunneling] 564 (and every other protocol based on top) - it does clearly prohibit 565 relay PTB messages. It states in section 3.3: "When such messages 566 (PTB) arrive at the ingress interface ("ingress interface" is the 567 tunnel interface in this draft), they may affect the properties of 568 that interface (e.g., its MTU), but they should never directly cause 569 new ICMPs in the outer network". This idea is generalized in section 570 5.1 as "ICMP messages MUST NOT be generated by the tunnel (as a 571 link)". The motivation assumed in the draft is to fully mimic host 572 behavior on the router virtual (tunnel) interface, because the host 573 would not retranslate PTB messages. 575 We see that "Flow Label" is gaining popularity. [IPv6 Tunneling] and 576 [ICMPv6] do not have strong recommendations for "Flow Label" - it 577 was not the important topic at that time. The only small improvement 578 that makes sense to do for [IPv6 Tunneling] is to recommend coping 579 "Flow Label" from source packet to tunnel packet and from source 580 packet to ICMPv6 PTB message. It would permit to properly load 581 balance PTB messages to the same path as original traffic - see the 582 problem [ICMPv6 PTB in ECMP] about hash-based load balancing between 583 many hosts. Copy "Flow Label" to PTB message would not contradict 584 neither IPv6 architecture nor any RFC - it is not mandatory to 585 develop a special standard update for it. 587 [MTU issues in Tunneling] section 3.2 has a concern that in the case 588 of Lawful Intercept additional encapsulation could produce PTB 589 messages that would show the fact to the monitored host. It is not a 590 very realistic concern, because PMTU could change for many other 591 reasons (especially with the proliferation of new protocols). If it 592 is still a concern, then it makes sense to use another solution for 593 this case: bigger MTU (better) or even fragmentation. 595 [MTU issues in Tunneling] section 3.2 raises the question about the 596 applicability of "MSS Clamping". The transit node could snoop 597 transport layer and change MSS exchanged between nodes. This "hack" 598 is not recommended because it breaks the layered model of IETF or 599 OSI. 601 [PMTUD] is the only mechanism that is universal for all cases and 602 fully compliant with IPv6 architecture. Vendors just need to use it, 603 despite some challenges to relay PTB messages on tunnel ends. 604 Moreover, it makes sense to standardize the relay of PTB messages on 605 tunnel ends - it would improve PMTUD time on original traffic 606 sources for round trip time. 607 [IPv6] RFC: "It is strongly recommended that IPv6 nodes implement 608 Path MTU Discovery [PMTUD]". 610 3.5. Packetization Layer MTU Discovery 612 [PLPMTUD] and [DPLPMTUD] have been greatly developed in recent 613 years. Packetization Layer (UDP/TCP) (1) has much more visibility 614 (could see the size of transport layer buffers); (2) could operate 615 under the absence of ICMPv6 PTB (too much filtering); (3) could be 616 very granular (per-flow). It does have its use cases. 618 Albeit, PLPMTUD/DPLPMTUD have their restrictions as they: (1) are 619 not universal for all transport protocols; (2) need more resources 620 from the host; (3) are challenging to share PMTU information between 621 applications; (4) need much more round trip times to find suitable 622 PMTU; (5) do not work well on congested paths (difficult to 623 understand the reason for packet loss). 625 Hence, PLPMTUD is not a replacement for PMTUD - both are needed. As 626 a reminder from [PLPMTUD]: "Packetization Layer Path MTU Discovery 627 (PLPMTUD) is most efficient when used in conjunction with the ICMP- 628 based Path MTU Discovery". 630 PLPMTUD could play as a replacement for PMTUD in the worst-case 631 scenario (ICMP is filtered). It would lead to the original host PMTU 632 decrease too. PLPMTUD could be considered as a redundancy mechanism 633 for PMTUD. 635 Strictly speaking, [PMTU by HbH] is a network layer mechanism, not a 636 packetization layer. It is mentioned in this section because its 637 usage is very similar to PLPMTUD, [PMTU by HbH] could be considered 638 to some degree as the extension to PLPMTUD. It is not expected to 639 principally change the conclusions of this document. 641 4. Conclusion 643 It is better not to have a problem with oversized packets in the 644 first place. One should upgrade all links to a bigger MTU, if 645 possible. 647 The host could have MTU as big as transit node. It would be never 648 possible to deprecate PMTUD. It is important to follow the 649 recommendations of [PMTUD] and [IPv6 Tunneling] for ICMPv6 PTB 650 message delivery to the original traffic source. Tunnel sources 651 should perform the relay function to make sure that the original 652 traffic source would get the PTB message faster. 654 The temporary 220B limit for all headers pushes us to the frugal 655 implementation of new extension headers. This limit would be 656 alleviated after all backbone links would be upgraded to a much 657 bigger MTU than 1500B. Additional protocols to collect MTU 658 information could help in the transition period to attach additional 659 headers frugally. It is true for all new protocols: SRv6, SFC, 660 BIERv6, iFIT, iOAM, APN6. 662 [PLPMTUD] and [DPLPMTUD] are not the replacement for [PMTUD], but 663 could help in some scenarios. 665 Fragmentation is not at all a solution for oversized packet drops. 667 5. Security Considerations 669 [PMTUD], [PLPMTUD], [DPLPMTUD], and [Fragile Fragmentation] have 670 some attack vectors discussed. This document does not introduce 671 additional security vulnerabilities. 673 6. IANA Considerations 675 This document has no request to IANA. 677 7. References 679 7.1. Normative References 681 [IPv6] S. Deering, R. Hinden, "Internet Protocol, Version 6 (IPv6) 682 Specification", RFC 8200, DOI 10.17487/RFC8200, July 2017, 683 . 685 [ICMPv6] A. Conta, S. Deering, M. Gupta, "Internet Control Message 686 Protocol (ICMPv6) for the Internet Protocol Version 6 687 (IPv6) Specification", RFC 4443, DOI 10.17487/RFC4443, 688 March 2006, . 690 [PMTUD] J. McCann, S. Deering, J. Mogul, R. Hinden, "Path MTU 691 Discovery for IP version 6", RFC 8201, DOI 692 10.17487/RFC8201, July 2017, . 695 [IPv6 Tunneling] A. Conta, S. Deering, "Generic Packet Tunneling in 696 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 697 December 1998, . 699 [ICMPv6 PTB in ECMP] M. Byerly, M. Hite, J. Jaeggli, "Close 700 Encounters of the ICMP Type 2 Kind", RFC 7690, DOI 701 10.17487/RFC7690, January 2016, . 704 [MTU issues in Tunneling] P. Savola, "MTU and Fragmentation Issues 705 with In-the-Network Tunneling", RFC 4459, DOI 706 10.17487/RFC4459, April 2006, . 709 [IP Tunnels] J. Touch, M. Townsley, "IP Tunnels in the Internet 710 Architecture", draft-ietf-intarea-tunnels-10 (work in 711 progress), September 2019. 713 [IP Encapsulation] C. Perkins, "IP Encapsulation within IP", RFC 714 2003, DOI 10.17487/RFC2003, October 1996, 715 . 717 [IPSec] S. Kent, K. Seo, "Security Architecture for the Internet 718 Protocol", RFC 4301, DOI 10.17487/RFC4301, December 2005, 719 . 721 [IPv6 GRE] C. Pignataro, R. Bonica, S. Krishnan, "IPv6 Support for 722 Generic Routing Encapsulation (GRE)", RFC 7676, DOI 723 10.17487/RFC7676, October 2015, . 726 [MPLS Encapsulation] T. Worster, Y. Rekhter, E. Rosen, 727 "Encapsulating MPLS in IP or Generic Routing Encapsulation 728 (GRE)", RFC 4023, DOI 10.17487/RFC4023, March 2005, 729 . 731 [L2TPv3] J. Lau, M. Townsley, I. Goyret, "Layer Two Tunneling 732 Protocol - Version 3 (L2TPv3)", RFC 3931, DOI 733 10.17487/RFC3931, March 2005, . 736 [VxLAN] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. 737 Sridhar, M. Bursell, C. Wright, "Virtual eXtensible Local 738 Area Network (VXLAN): A Framework for Overlaying 739 Virtualized Layer 2 Networks over Layer 3 Networks", RFC 740 7348, DOI 10.17487/RFC7348, August 2014, . 743 [NVO3] J. Gross, I. Ganga, T. Sridhar, "Geneve: Generic Network 744 Virtualization Encapsulation", RFC 8926, DOI 745 10.17487/RFC8926, November 2020, . 748 [L3VPN] E. Rosen, Y. Rekhter, "BGP/MPLS IP Virtual Private Networks 749 (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 2006, 750 . 752 [EVPN] A. Sajassi, R. Aggarwal, N. Bitar, A. Isaac, J. Uttaro, J. 753 Drake, W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 754 7432, DOI 10.17487/RFC7432, February 2015, 755 . 757 [Huston-2020] Huston, G., "Measurement of IPv6 Extension Header 758 Support", NPS/CAIDA 2020 Virtual IPv6 Workshop, 2020, 759 . 762 [Huston-2016] Huston, G., "Fragmenting IPv6", Blog Post, 2016, 763 . 765 [Fragile Fragmentation] R. Bonica, F. Baker, G. Huston, R. Hinden, 766 O. Troan, F. Gont, "IP Fragmentation Considered Fragile", 767 RFC 8900, DOI 10.17487/RFC8900, September 2020, 768 . 770 7.2. Informative References 772 [PLPMTUD] M. Mathis, J. Heffner, "Packetization Layer Path MTU 773 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 774 . 776 [DPLPMTUD] G. Fairhurst, T. Jones, M. Tuexen, I. Ruengeler, T. 777 Voelker, "Packetization Layer Path MTU Discovery for 778 Datagram Transports", RFC 8899, DOI 10.17487/RFC8899, 779 March 2020, . 781 [SRH] C. Filsfils, D. Dukes, S. Previdi, J. Leddy, S. Matsushima, D. 782 Voyer, "IPv6 Segment Routing Header (SRH)", RFC 8754, DOI 783 10.17487/RFC8754, March 2020, . 786 [PMTU by HbH] R. Hinden, G. Fairhurst, "IPv6 Minimum Path MTU Hop- 787 by-Hop Option", draft-hinden-6man-mtu-option-02 (work in 788 progress), July 2019. 790 [PMTU by ISIS] Z. Hu, Y. Zhu, Z. Li, L. Dai, "IS-IS Extensions for 791 Path MTU", draft-hu-lsr-isis-path-mtu-00 (work in 792 progress), June 2018. 794 [PMTU by PCEP] S. Peng, C. Li, L. Han, "Support for Path MTU (PMTU) 795 in the Path Computation Element (PCE)communication 796 Protocol (PCEP)", draft-li-pce-pcep-pmtu-03 (work in 797 progress), October 2020. 799 [PMTU by BGP-LS] Y. Zhu, Z. Hu, G. Yan, J. Yao, "BGP-LS Extensions 800 for Advertising Path MTU", draft-zhu-idr-bgp-ls-path-mtu- 801 05 (work in progress), November 2020. 803 [PMTU by SR-Policy] C. Li, Y. Zhu, A. Sawaf, Z. Li, "Segment Routing 804 Path MTU in BGP", draft-li-idr-sr-policy-path-mtu-03 (work 805 in progress), November 2019. 807 [Generic Fragmentation] Z. Zhang, R. Bonica, K. Kompella," Generic 808 Transport Functions", draft-zzhang-tsvwg-generic- 809 transport-functions-00 (work in progress), November 2020. 811 [Pseudowire Fragmentation] A. Malis, M. Townsley, "Pseudowire 812 Emulation Edge-to-Edge (PWE3) Fragmentation and 813 Reassembly", RFC 4623, DOI 10.17487/RFC4623, August 2006, 814 . 816 8. Acknowledgments 818 Thanks to v6ops working group for problem discussion 820 Authors' Addresses 822 Eduard Vasilenko 823 Huawei Technologies 824 17/4 Krylatskaya st, Moscow, Russia 121614 826 Email: Vasilenko.Eduard@huawei.com 828 Xiao Xipeng 829 Huawei Technologies 830 205 Hansaallee, 40549 Dusseldorf, Germany 832 Email: Xipengxiao@huawei.com 834 Dmitriy Khaustov 835 Rostelecom 836 13/2 Nikoloyamskaya st, Moscow, Russia 109240 838 Email: Dmitriy.Khaustov@rt.ru