idnits 2.17.00 (12 Aug 2021) /tmp/idnits6042/draft-stein-pwe3-pwbonding-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 470. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 481. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 488. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 494. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 177: '... MUST use the PW control word [2]. However, as we shall see in the...' RFC 2119 keyword, line 270: '...y implementation MUST support the roun...' RFC 2119 keyword, line 274: '...wn the first leaky bucket mode MUST be...' RFC 2119 keyword, line 276: '... MAY be used....' RFC 2119 keyword, line 350: '...net header there MAY be an MPLS header...' (1 more instance...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 2, 2008) is 4941 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 4447 (ref. '3') (Obsoleted by RFC 8077) == Outdated reference: A later version (-03) exists of draft-bryant-filsfils-fat-pw-02 Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 PWE3 Y(J). Stein 3 Internet-Draft I. Mendelsohn 4 Intended status: Standards Track R. Insler 5 Expires: May 6, 2009 RAD Data Communications 6 November 2, 2008 8 PW Bonding 9 draft-stein-pwe3-pwbonding-01.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on May 6, 2009. 36 Copyright Notice 38 Copyright (C) The IETF Trust (2008). 40 Abstract 42 There are times when pseudowires must be transported over physical 43 links with limited bandwidth. We shall use the term "bonding" (also 44 variously known as inverse multiplexing, link aggregation, trunking, 45 teaming, etc.) to mean an efficient mechanism for separating the PW 46 traffic over several links. Unlike load balancing and equal cost 47 multipath, bonding makes no assumption that the PW traffic can be 48 decomposed into distinguishable flows, and thus bonding requires 49 delay compensation and packet reordering. Furthermore, PW bonding 50 can optionally track bandwidth constraints in order to minimize 51 packet loss. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 2. PW Bonding mechanism . . . . . . . . . . . . . . . . . . . . . 5 57 3. PW Dynamic Bandwidth Allocation . . . . . . . . . . . . . . . 6 58 4. Protocol Extensions . . . . . . . . . . . . . . . . . . . . . 7 59 5. Partial Path PW Bonding . . . . . . . . . . . . . . . . . . . 8 60 6. Applicability . . . . . . . . . . . . . . . . . . . . . . . . 9 61 7. Security Considerations . . . . . . . . . . . . . . . . . . . 10 62 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 63 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 64 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 65 10.1. Normative References . . . . . . . . . . . . . . . . . . 10 66 10.2. Informative References . . . . . . . . . . . . . . . . . 11 67 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11 68 Intellectual Property and Copyright Statements . . . . . . . . . . 12 70 1. Introduction 72 Inverse multiplexing is any mechanism for transporting a single high 73 capacity traffic flow over multiple lower capacity paths. Inverse 74 multiplexing is also known as bonding, link load balancing, link 75 aggregation, trunking, teaming, concatenation, and multipath. In the 76 context of pseudowires we will use the term bonding. 78 Bonding has been defined for many transport technologies (and often 79 more than one mechanism has been developed for a single technology) 80 including TDM (continguous and virtual concatenation VCAT), ATM (ATM 81 forum's IMA and ITU's G.998.1 multi-pair bonding), Ethernet (802.3 82 link aggregation LAG and EFM PME aggregation), xDSL (the previous two 83 and G.998.3 time domain inverse multiplexing TDIM), PPP (MLPPP), and 84 in the context of IP transport, equal cost multiplath (ECMP). 86 Regardless of the transport infrastructure, all bonding mechanisms 87 must confront a fundamental problem, namely that the constituent 88 paths will in general have different (and not necessarily constant) 89 propagation delays. Thus a mechanism must be employed to ensure in- 90 order delivery of the data units. Two solutions have been proposed 91 for this problem, namely performing differential delay compensation, 92 and decomposing the input into mutually distinct flows. Methods 93 using the former solution (e.g., VCAT, TDIM) buffer the data from 94 each path at egress (e.g., VCAT buffers up to 1/2 second), and 95 introduce protocol elements to synchronize the paths before 96 recombining them. Methods using the latter soution (LAG, ECMP) skirt 97 the problem by consistently mapping data units from a given flow onto 98 the same constituent path, assuming that there is only the need to 99 maintain order inside each flow, and not across flows. 101 Methods employing differential delay compensation tend to more 102 complex and to require large buffers, but are universally applicable. 103 Methods decomposing the input into flows depend on the existence of 104 such flows and sniffing the input for their identification. Thus if 105 the input is a single large flow, or if it is not possible to 106 identify flows (e.g., due to lower layer encryption), or if it is 107 undesirably complex to do so, these methods may not be applicable. 109 Furthermore, methods decomposing the input into flows tacitly assume 110 that the hashing of flow identifiers onto tunnels results in fair 111 distribution of traffic. This is generally a good assumption when 112 there are a very large number of independent flows. Incorrect 113 distribution causes some underlying paths to become congested and 114 drop packets, while others are relatively underutlized. Direct 115 inverse multiplexing with differential delay compensation one can 116 ensure fairness, and in fact can adapt to underlying paths with 117 unequal and even time varying capacity. 119 In the context of pseudowires a decomposition mechanism has been 120 previously proposed [5]. The present draft proposes a PW bonding 121 mechanism based on direct inverse multiplexing with differential 122 delay compensation. In particular, the proposed mechanism may be 123 used when PWs are supported by DSL links. 125 The simplest scenario for PW-bonding is depicted in Figure 1. Here 126 the entire PW is transported edge to edge over separate PW 127 components, each inside a distinct transport tunnel. A somewhat more 128 complex scenario is partial path bonding, as depicted in Figure 2, 129 where only a portion of the PW path is bandwidth restricted. Here 130 only the PW components are shown, and not the tunnels into which they 131 are placed. Here it is required to separate the PW into components 132 in separate tunnels at some point inside the network. However, since 133 P device where this happens is not PW aware, the PW components must 134 still be defined by the ingress PE. 136 +--------+ +--------+ 137 | PE | | PE | 138 | | tunnel 1 | | 139 | X========================X | 140 | | PW component 1 | | 141 | X------------------------X | 142 | | | | 143 | X========================X | 144 | | | | 145 AC | | | | AC 146 -------o | | o------- 147 | | | | 148 | | tunnel 2 | | 149 | X========================X | 150 | | PW component 2 | | 151 | X------------------------X | 152 | | | | 153 | X========================X | 154 | | | | 155 +--------+ +--------+ 157 Figure 1. edge-to-edge PW bonding - 2 PW components in tunnels 158 +------+ +-----+ +------+ 159 | PE | | P | | PE | 160 | | | | PW component | | 161 | | | X================X | 162 | | | | | | 163 AC | | | | | | AC 164 ------o | PW | | PW component | o------ 165 | X==========X X================X | 166 | | | | | | 167 | | | | | | 168 | | | | PW component | | 169 | X | X================X | 170 | | | | | | 171 +------+ +-----+ +------+ 173 Figure 2. partial path PW bonding - 3 PW components 175 Each PW component will normally receive a distinct PW label, and thus 176 seem to the network to be a distinct PW. Furthermore, PW components 177 MUST use the PW control word [2]. However, as we shall see in the 178 next section, the sequence number generation and processing is 179 different for PW components that for true PWs. 181 2. PW Bonding mechanism 183 As discussed in the previous section, at the egress PE the traffic 184 from each PW component is buffered, and the protocol is responsible 185 for ensuring that packets constituting the PW are reassembled in 186 correct order. This is accomplished by mandating use of the PW 187 control word, and sharing the same sequence number sequence for all 188 PW components making up the PW. The sequence numbers are used by the 189 egress PE to ensure properly ordering. The idea is depicted in 190 Figure 3, for the simple case of edge-to-edge bonding. Here eight 191 packets are divided amongst three PW components by the ingress PE, 192 according to a bandwidth allocation algorithm to be described later. 193 Due to different link latencies, the packets arrive at the egress out 194 of order, but are easily reordered by the egress PE by observing the 195 sequence number. 197 +------+ +---------------+ 198 | PE | | PE | 199 | | 1 2 7 | | 200 | X==========X | 201 | | | | 202 1 2 3 4 5 6 7 8| | |1 3 2 4 5 7 6 8|1 2 3 4 5 6 7 8 203 ---------------o | 3 4 8 | o--------------- 204 PW | X==========X | 205 | | | | 206 | | | | 207 | | 5 6 | | 208 | X==========X | 209 | | | | 210 +------+ +---------------+ 212 Figure 3. Use of sequence numbers to ensure correct packet ordering 214 In order to enable reordering, the egress PE must allocate sufficient 215 buffer memory to sustain the largest expected differential delay. 216 The differential delay is added to the latencies of all packets, 217 making the effective latency equal to that of the slowest PW 218 component. 220 3. PW Dynamic Bandwidth Allocation 222 In the simplest case, all packets to be sent over the various PW 223 components are of the same size, and all PW components support the 224 same data rates. For this case (but only for this case), a simple 225 round-robin algorithm for distributing the packets onto PW components 226 is optimal in the sense that it minimizes the probability of packet 227 loss due to buffer exhaustion. 229 The simple round-robin algorithm is not optimal when the packets are 230 not all of the same size, or when the PW components do not all 231 support the same data rate, or both. In such cases we need to fairly 232 distribute data bytes over the components in such fashion as to 233 minimize the probability that a packet will be dropped due to over- 234 run of a component's buffer. While the packet sizes are always known 235 before transmission, the state of the buffers are usually unknown, 236 and in some cases the supported data rates may be unknown. The 237 following discussion will be for the edge-to-edge component case; the 238 partial path case is similar, but requires separate consideration of 239 the two directions. 241 If the packet size is not constant, and the component rates are 242 known, but we have no further information (e.g., we do not know the 243 size of the buffers, nor do we have feedback from the egress PE on 244 the actual fill states) the best algorithm for an ingress PE is based 245 on a leaky bucket scheme. In this scheme the ingress PE maintains, 246 for each PW component, a variable Bn that approximately tracks the 247 fill state of the egress PE's buffer for this component. The 248 variable Bn is continually decreased at a rate equal to the data rate 249 of the component n, but always remains non-negative. Each time a 250 packet is sent over PW component n, its size in bytes is added to Bn. 251 When a new packet needs to be sent, the ingress PE sends it on the PW 252 component with minimal Bn. This algorithm can also be used when it 253 can be assumed that the component rates are equal, or approximately 254 so. 256 If in addition to packet size and PW component date rates, the 257 ingress PE knows the buffer size used for differential compensation, 258 a similar, but somewhat better, algorithm can be used. When deciding 259 over which component to send the packet, rather than choosing the 260 minimal Bn, the ingress PE chooses the maximal Bn to which the packet 261 size can be added without overflowing the given buffer size. In 262 practice some extra margin must be applied in order to account for 263 PDV. 265 Finally, if the egress PE can send information on the actual state of 266 its buffers back to the ingress PE, then an algorithm that uses these 267 buffer states instead of the approximated leaky bucket ones can be 268 employed. 270 Any implementation MUST support the round-robin method, and SHOULD 271 support the first leaky bucket mode. Control protocol extensions are 272 needed to enable communication from egress back to ingress of the 273 additional information needed to support more optimal modes. If the 274 rates can be accurately known the first leaky bucket mode MUST be 275 used, and if further information is available then other mechanisms 276 MAY be used. 278 4. Protocol Extensions 280 In order to set up the PW components using the PWE3 control protocol 281 [3] a single PWid or generalized PWid is assigned to the logical PW, 282 and additional PWids or generalized PWids are allocated for the PW 283 components. All PW components are assigned an identical group ID, in 284 order to indicate their relationship, and to enable easy withdrawal 285 of the logical PW. First the logical PW is set up using a label 286 mapping message containing the interface parameters, and a new 287 "bonding" sub-TLV containing the group ID. Subsequently the PW 288 components are configured. Each PW component is assigned to a 289 distinct transport tunnel by mechanisms not specified here. 291 Attachment circuit faults are signaled via PW status messages 292 associated with the PWid or generalized PWid of the logical PW. PW 293 component faults and capacity indicators are sent via status messages 294 per PW component PWid or generalized PWid. 296 Enhancements to the PWE3 control protocol are needed in order to 297 associate PW components with distinct labels in distinct tunnels to a 298 single logical PW, and to communicate component capacity and status 299 information. The format of these LDP extensions will be detailed in 300 the next version of this draft. 302 Standard VCCV mechanisms [4] may be used independently for each PW 303 component, and the resulting connectivity information may be used by 304 the ingress PE in the process of distributing traffic over PW 305 components. VCCV for the partial path scenario is for further study. 307 5. Partial Path PW Bonding 309 When only a portion of the PW's path suffers from bandwidth 310 constriction, the partial path bonding scenario depicted in Figure 2 311 is used. As for the regular bonding case, the ingress PE decomposes 312 the input into multiple PW components, and performs the same 313 algorithm to decide into which component to send a given packet. For 314 those portions of the network where a single tunnel can support the 315 entire service bandwidth, the PW components may all be all placed in 316 the same transport tunnel. For constricted bandwidth segments, each 317 PW component must be placed in a distinct tunnel. The distinct 318 transport tunnels are merged into the single tunnel using label 319 merging, per section 3.26.2 of [1]. 321 Another case of practical interest is when the bandwidth is 322 restricted in a non-MPLS access network, and the PE terminating the 323 MPLS can not inverse multiplex the traffic onto low capacity links 324 based on PW labels alone. This case arises for a DSLAM terminating 325 MPLS (or a PE terminating MPLS upstream from the DSLAM) and 326 forwarding to customers solely based on Ethernet MAC address (and 327 possibly VLAN ID). For such a case a double PW encapsulation may be 328 used. Through the core network we tunnel an Ethernet PW, which 329 itself carries the bonded PW components (which may be of any type 330 supported by PWE encapsulations), see Figure 4. 332 +--------------------+ 333 | MPLS label stack | 334 +--------------------+ 335 | exterior PW label | 336 +--------------------+ 337 | Ethernet header | 338 +--------------------+ 339 | interior PW label | 340 +--------------------+ 341 | control word | 342 +--------------------+ 343 | payload | 344 +--------------------+ 346 Figure 4. packet format for DSL partial path scenario 348 The DSLAM (or PE immediately upstream from the DSLAM) terminates the 349 MPLS and exterior PW protocols, thus exposing the Ethernet header. 350 Under the Ethernet header there MAY be an MPLS header (which the CE 351 negotiates with the immediately upstream PE), and there MUST be an 352 interior PW label (which the CE negotiates with the remote CE or PE). 353 Based purely on the Ethernet addressing the DSLAM distributes the 354 traffic over multiple DSL links following the partition crafted by 355 the ingress PE. All of these DSL links terminate on a single CE 356 device which terminates the Ethernet, exposes the interior PW labels 357 and sequence numbers in the control word. Using these sequence 358 numbers the CE can thus piece together the original traffic stream. 360 6. Applicability 362 PW bonding is a useful mechanism when the bandwidth of available 363 physical links is insufficient to carry the user traffic, but several 364 links can be dedicated. Unlike load balancing and equal cost 365 multipath mechanisms, PW bonding makes no assumption that the PW 366 traffic can be decomposed into distinguishable flows. It is fully 367 applicable for non-IP or encrypted traffic. By using mechanisms 368 described above, PW bonding can approach full utilization of the 369 aggregate link bandwidth. 371 PW bonding involves delay compensation and packet reordering, and 372 thus requires allocation of sufficient memory at the egress PE. The 373 amount of memory needed is proportional to the link speed and to the 374 difference in propagation delay between the fastest and slowest 375 links. Thus PW bonding is most applicable when the link speeds are 376 low (e.g., supported by DSL lines), and the delay differences are 377 small. 379 Only the PEs need to know that the PW components are not full PWs 380 (the only difference being the sequence number processing). Thus PW 381 bonding requires changes only to the PEs and does not require any 382 changes to the intervening PSN. 384 7. Security Considerations 386 PW bonding does not introduce security considerations above those 387 present for regular PWs. In particular, attacks based on sequence 388 number manipulation are of concern. For partial path cases where CE 389 devices participate in the PWE signaling, authentication is required. 391 8. IANA Considerations 393 Required extensions to the PWE3 control protocol, including the sub- 394 TLV type code for the PW component label, and new PW status codes, 395 will be detailed in the next version of this draft. 397 9. Acknowledgments 399 The authors would like to thank Gabriel Zigelboim for fruitful 400 discussions on optimal dynamic allocation mechanisms. 402 10. References 404 10.1. Normative References 406 [1] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol Label 407 Switching Architecture", RFC 3031, January 2001. 409 [2] Bryant, S., Swallow, G., Martini, L., and D. McPherson, 410 "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for Use 411 over an MPLS PSN", RFC 4385, February 2006. 413 [3] Martini, L., Rosen, E., El-Aawar, N., Smith, T., and G. Heron, 414 "Pseudowire Setup and Maintenance Using the Label Distribution 415 Protocol (LDP)", RFC 4447, April 2006. 417 [4] Nadeau, T. and C. Pignataro, "Pseudowire Virtual Circuit 418 Connectivity Verification (VCCV): A Control Channel for 419 Pseudowires", RFC 5085, December 2007. 421 10.2. Informative References 423 [5] Bryant, S., Filsfils, C., and U. Drafz, "Load Balancing Fat MPLS 424 Pseudowires", draft-bryant-filsfils-fat-pw-02 (work in 425 progress), July 2008. 427 Authors' Addresses 429 Yaakov (Jonathan) Stein 430 RAD Data Communications 431 24 Raoul Wallenberg St., Bldg C 432 Tel Aviv 69719 433 ISRAEL 435 Phone: +972 3 645-5389 436 Email: yaakov_s@rad.com 438 Itai Mendelsohn 439 RAD Data Communications 440 24 Raoul Wallenberg St., Bldg C 441 Tel Aviv 69719 442 ISRAEL 444 Phone: +972 3 645-5761 445 Email: itai_m@rad.com 447 Ron Insler 448 RAD Data Communications 449 24 Raoul Wallenberg St., Bldg C 450 Tel Aviv 69719 451 ISRAEL 453 Phone: +972 3 645-5445 454 Email: ron_i@rad.com 456 Full Copyright Statement 458 Copyright (C) The IETF Trust (2008). 460 This document is subject to the rights, licenses and restrictions 461 contained in BCP 78, and except as set forth therein, the authors 462 retain all their rights. 464 This document and the information contained herein are provided on an 465 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 466 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 467 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 468 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 469 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 470 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 472 Intellectual Property 474 The IETF takes no position regarding the validity or scope of any 475 Intellectual Property Rights or other rights that might be claimed to 476 pertain to the implementation or use of the technology described in 477 this document or the extent to which any license under such rights 478 might or might not be available; nor does it represent that it has 479 made any independent effort to identify any such rights. Information 480 on the procedures with respect to rights in RFC documents can be 481 found in BCP 78 and BCP 79. 483 Copies of IPR disclosures made to the IETF Secretariat and any 484 assurances of licenses to be made available, or the result of an 485 attempt made to obtain a general license or permission for the use of 486 such proprietary rights by implementers or users of this 487 specification can be obtained from the IETF on-line IPR repository at 488 http://www.ietf.org/ipr. 490 The IETF invites any interested party to bring to its attention any 491 copyrights, patents or patent applications, or other proprietary 492 rights that may cover technology that may be required to implement 493 this standard. Please address the information to the IETF at 494 ietf-ipr@ietf.org. 496 Acknowledgment 498 Funding for the RFC Editor function is provided by the IETF 499 Administrative Support Activity (IASA).