idnits 2.17.00 (12 Aug 2021) /tmp/idnits29797/draft-ietf-mpls-recovery-frmwrk-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == There are 10 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 31 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The "Author's Address" (or "Authors' Addresses") section title is misspelled. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 2002) is 7310 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '2') -- Possible downref: Non-RFC (?) normative reference: ref. '3' Summary: 9 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MPLS Working Group Vishal Sharma (Metanoia, Inc.) 3 Informational Track Fiffi Hellstrand (Nortel Networks) 4 Expires: November 2002 Ben-Mack Crane (Tellabs) 5 Srinivas Makam 6 Ken Owens (Erlang Technology) 7 Changcheng Huang (Carleton University) 8 Jon Weil (Nortel Networks) 9 Loa Anderson (Utfors) 10 Bilel Jamoussi (Nortel Networks) 11 Brad Cain (Storigen) 12 Angela Chiu (Celion Networks) 14 May 2002 16 Framework for MPLS-based Recovery 17 19 Status of this memo 21 This document is an Internet-Draft and is in full conformance with 22 all provisions of Section 10 of RFC2026. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that other 25 groups may also distribute working documents as Internet-Drafts. 26 Internet-Drafts are draft documents valid for a maximum of six months 27 and may be updated, replaced, or obsoleted by other documents at any 28 time. It is inappropriate to use Internet-Drafts as reference 29 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 Abstract 37 Multi-protocol label switching (MPLS) integrates the label swapping 38 forwarding paradigm with network layer routing. To deliver reliable 39 service, MPLS requires a set of procedures to provide protection of 40 the traffic carried on different paths. This requires that the label 41 switched routers (LSRs) support fault detection, fault notification, 42 and fault recovery mechanisms, and that MPLS signaling, support the 43 configuration of recovery. With these objectives in mind, this 44 document specifies a framework for MPLS based recovery. 46 Table of Contents 48 1. Introduction....................................................3 49 1.1. Background......................................................3 50 1.2. Motivation for MPLS-Based Recovery..............................4 51 1.3. Objectives/Goals................................................4 52 2. Overview........................................................6 53 2.1. Recovery Models.................................................6 54 2.1.1 Rerouting.....................................................6 55 2.1.2 Protection Switching..........................................7 56 2.2. The Recovery Cycles.............................................7 57 2.2.1 MPLS Recovery Cycle Model.....................................7 58 2.2.2 MPLS Reversion Cycle Model....................................9 59 2.2.3 Dynamic Re-routing Cycle Model...............................10 60 2.3. Definitions and Terminology....................................12 61 2.3.1 General Recovery Terminology.................................12 62 2.3.2 Failure Terminology..........................................15 63 2.4. Abbreviations..................................................15 64 3. MPLS-based Recovery Principles.................................16 65 3.1. Configuration of Recovery......................................16 66 3.2. Initiation of Path Setup.......................................16 67 3.3. Initiation of Resource Allocation..............................17 68 3.4. Scope of Recovery..............................................17 69 3.4.1 Topology.....................................................17 70 1.1.1.1 Local Repair................................................18 71 1.1.1.2 Global Repair...............................................18 72 1.1.1.3 Alternate Egress Repair.....................................19 73 1.1.1.4 Multi-Layer Repair..........................................19 74 1.1.1.5 Concatenated Protection Domains.............................19 75 3.4.2 Path Mapping.................................................19 76 3.4.3 Bypass Tunnels...............................................20 77 3.4.4 Recovery Granularity.........................................21 78 1.1.1.6 Selective Traffic Recovery..................................21 79 1.1.1.7 Bundling....................................................21 80 3.4.5 Recovery Path Resource Use...................................21 81 3.5. Fault Detection................................................22 82 3.6. Fault Notification.............................................22 83 3.7. Switch-Over Operation..........................................23 84 3.7.1 Recovery Trigger.............................................23 85 3.7.2 Recovery Action..............................................24 86 3.8. Post Recovery Operation........................................24 87 3.8.1 Fixed Protection Counterparts................................24 88 1.1.1.8 Revertive Mode..............................................24 89 1.1.1.9 Non-revertive Mode..........................................24 90 3.8.2 Dynamic Protection Counterparts..............................25 91 3.8.3 Restoration and Notification.................................25 92 3.8.4 Reverting to Preferred Path (or Controlled Rearrangement)....26 93 3.9. Performance....................................................26 94 4. MPLS Recovery Features.........................................27 95 5. Comparison Criteria............................................27 96 6. Security Considerations........................................29 97 7. Intellectual Property Considerations...........................29 98 8. Acknowledgements...............................................30 99 9. AuthorsÆ Addresses.............................................30 100 10. References.....................................................31 102 1. Introduction 104 This memo describes a framework for MPLS-based recovery. We provide a 105 detailed taxonomy of recovery terminology, and discuss the motivation 106 for, the objectives of, and the requirements for MPLS-based recovery. 107 We outline principles for MPLS-based recovery, and also provide 108 comparison criteria that may serve as a basis for comparing and 109 evaluating different recovery schemes. 111 At points in the document, we provide some thoughts about the 112 operation or viability of certain recovery objectives. These should 113 be viewed as the opinions of the authors, and not the consolidated 114 views of the IETF. 116 1.1. Background 118 Network routing deployed today is focused primarily on connectivity, 119 and typically supports only one class of service, the best effort 120 class. Multi-protocol label switching [1], on the other hand, by 121 integrating forwarding based on label-swapping of a link local label 122 with network layer routing allows flexibility in the delivery of new 123 routing services. MPLS allows for using such media specific 124 forwarding mechanisms as label swapping. This enables some 125 sophisticated features such as quality-of-service (QoS) and traffic 126 engineering [2] to be implemented more effectively. An important 127 component of providing QoS, however, is the ability to transport data 128 reliably and efficiently. Although the current routing algorithms are 129 robust and survivable, the amount of time they take to recover from a 130 fault can be significant, on the order of several seconds or minutes, 131 causing disruption of service for some applications in the interim. 132 This is unacceptable is situations where the aim to provide a highly 133 reliable service, with recovery times that are on the order of 134 seconds down to 10's of milliseconds. 136 MPLS recovery may be motivated by the notion that there are 137 limitations to improving the recovery times of current routing 138 algorithms. Additional improvement can be obtained by augmenting 139 these algorithms with MPLS recovery mechanisms [3]. Since MPLS is a 140 possible technology of choice in future IP-based transport networks, 141 it is useful that MPLS be able to provide protection and restoration 142 of traffic. MPLS may facilitate the convergence of network 143 functionality on a common control and management plane. Further, a 144 protection priority could be used as a differentiating mechanism for 145 premium services that require high reliability. The remainder of this 146 document provides a framework for MPLS based recovery. It is focused 147 at a conceptual level and is meant to address motivation, objectives 148 and requirements. Issues of mechanism, policy, routing plans and 149 characteristics of traffic carried by recovery paths are beyond the 150 scope of this document. 152 1.2. Motivation for MPLS-Based Recovery 154 MPLS based protection of traffic (called MPLS-based Recovery) is 155 useful for a number of reasons. The most important is its ability to 156 increase network reliability by enabling a faster response to faults 157 than is possible with traditional Layer 3 (or IP layer) approaches 158 alone while still providing the visibility of the network afforded by 159 Layer 3. Furthermore, a protection mechanism using MPLS could enable 160 IP traffic to be put directly over WDM optical channels and provide a 161 recovery option without an intervening SONET layer. This would 162 facilitate the construction of IP-over-WDM networks that request a 163 fast recovery ability. 165 The need for MPLS-based recovery arises because of the following: 167 I. Layer 3 or IP rerouting may be too slow for a core MPLS network 168 that needs to support recovery times that are smaller than the 169 convergence times of IP routing protocols. 171 II. Layer 0 (for example, optical layer) or Layer 1 (for example, 172 SONET) mechanisms may be wasteful use of resources. 174 III. The granularity at which the lower layers may be able to protect 175 traffic may be too coarse for traffic that is switched using MPLS- 176 based mechanisms. 178 IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher 179 layer operations. Thus, while they may provide, for example, link 180 protection, they cannot easily provide node protection or protection 181 of traffic transported at layer 3. Further, this may prevent the 182 lower layers from providing restoration based on the trafficÆs needs. 183 For example, fast restoration for traffic that needs it, and slower 184 restoration (with possibly more optimal use of resources) for traffic 185 that does not require fast restoration. In networks where the latter 186 class of traffic is dominant, providing fast restoration to all 187 classes of traffic may not be cost effective from a service 188 providerÆs perspective. 190 V. MPLS has desirable attributes when applied to the purpose of 191 recovery for connectionless networks. Specifically that an LSP is 192 source routed and a forwarding path for recovery can be "pinned" and 193 is not affected by transient instability in SPF routing brought on by 194 failure scenarios. 196 VI. Establishing interoperability of protection mechanisms between 197 routers/LSRs from different vendors in IP or MPLS networks is desired 198 to enable recovery mechanisms to work in a multivendor environment, 199 and to enable the transition of certain protected services to an MPLS 200 core. 202 1.3. Objectives/Goals 203 The following are some important goals for MPLS-based recovery. 205 Ia. MPLS-based recovery mechanisms may be subject to the traffic 206 engineering goal of optimal use of resources. 208 Ib. MPLS based recovery mechanisms should aim to facilitate 209 restoration times that are sufficiently fast for the end user 210 application. That is, that better match the end-userÆs application 211 requirements. In some cases, this may be as short as 10s of 212 milliseconds. 214 We observe that Ia and Ib are conflicting objectives, and a trade off 215 exists between them. The optimal choice depends on the end-user 216 applicationÆs sensitivity to restoration time and the cost impact of 217 introducing restoration in the network, as well as the end-user 218 applicationÆs sensitivity to cost. 220 II. MPLS-based recovery should aim to maximize network reliability 221 and availability. MPLS-based recovery of traffic should aim to 222 minimize the number of single points of failure in the MPLS protected 223 domain. 225 III. MPLS-based recovery should aim to enhance the reliability of the 226 protected traffic while minimally or predictably degrading the 227 traffic carried by the diverted resources. 229 IV. MPLS-based recovery techniques should aim to be applicable for 230 protection of traffic at various granularities. For example, it 231 should be possible to specify MPLS-based recovery for a portion of 232 the traffic on an individual path, for all traffic on an individual 233 path, or for all traffic on a group of paths. Note that a path is 234 used as a general term and includes the notion of a link, IP route or 235 LSP. 237 V. MPLS-based recovery techniques may be applicable for an entire 238 end-to-end path or for segments of an end-to-end path. 240 VI. MPLS-based recovery mechanisms should aim to take into 241 consideration the recovery actions of lower layers. MPLS-based 242 mechanisms should not trigger lower layer protection switching. 244 VII. MPLS-based recovery mechanisms should aim to minimize the loss 245 of data and packet reordering during recovery operations. (The 246 current MPLS specification itself has no explicit requirement on 247 reordering). 249 VIII. MPLS-based recovery mechanisms should aim to minimize the state 250 overhead incurred for each recovery path maintained. 252 IX. MPLS-based recovery mechanisms should aim to preserve the 253 constraints on traffic after switchover, if desired. That is, if 254 desired, the recovery path should meet the resource requirements of, 255 and achieve the same performance characteristics as, the working 256 path. 258 We observe that some of the above are conflicting goals, and real 259 deployment will often involve engineering compromises based on a 260 variety of factors such as cost, end-user application requirements, 261 network efficiency, and revenue considerations. Thus, these goals are 262 subject to tradeoffs based on the above considerations. 264 2. Overview 266 There are several options for providing protection of traffic. The 267 most generic requirement is the specification of whether recovery 268 should be via Layer 3 (or IP) rerouting or via MPLS protection 269 switching or rerouting actions. 271 Generally network operators aim to provide the fastest and the best 272 protection mechanism that can be provided at a reasonable cost. The 273 higher the levels of protection, the more the resources consumed. 274 Therefore it is expected that network operators will offer a spectrum 275 of service levels. MPLS-based recovery should give the flexibility to 276 select the recovery mechanism, choose the granularity at which 277 traffic is protected, and to also choose the specific types of 278 traffic that are protected in order to give operators more control 279 over that tradeoff. With MPLS-based recovery, it can be possible to 280 provide different levels of protection for different classes of 281 service, based on their service requirements. For example, using 282 approaches outlined below, a Virtual Leased Line (VLL) service or 283 real-time applications like Voice over IP (VoIP) may be supported 284 using link/node protection together with pre-established, pre- 285 reserved path protection. Best effort traffic, on the other hand, may 286 use path protection that is established on demand or may simply rely 287 on IP re-route or higher layer recovery mechanisms. As another 288 example of their range of application, MPLS-based recovery strategies 289 may be used to protect traffic not originally flowing on label 290 switched paths, such as IP traffic that is normally routed hop-by- 291 hop, as well as traffic forwarded on label switched paths. 293 2.1. Recovery Models 295 There are two basic models for path recovery: rerouting and 296 protection switching. 298 Protection switching and rerouting, as defined below, may be used 299 together. For example, protection switching to a recovery path may 300 be used for rapid restoration of connectivity while rerouting 301 determines a new optimal network configuration, rearranging paths, as 302 needed, at a later time. 304 2.1.1 Rerouting 305 Recovery by rerouting is defined as establishing new paths or path 306 segments on demand for restoring traffic after the occurrence of a 307 fault. The new paths may be based upon fault information, network 308 routing policies, pre-defined configurations and network topology 309 information. Thus, upon detecting a fault, paths or path segments to 310 bypass the fault are established using signaling. 312 Once the network routing algorithms have converged after a fault, it 313 may be preferable, in some cases, to reoptimize the network by 314 performing a reroute based on the current state of the network and 315 network policies. This is discussed further in Section 3.8. 317 In terms of the principles defined in section 3, reroute recovery 318 employs paths established-on-demand with resources reserved-on- 319 demand. 321 2.1.2 Protection Switching 323 Protection switching recovery mechanisms pre-establish a recovery 324 path or path segment, based upon network routing policies, the 325 restoration requirements of the traffic on the working path, and 326 administrative considerations. The recovery path may or may not be 327 link and node disjoint with the working path. However if the recovery 328 path shares sources of failure with the working path, the overall 329 reliability of the construct is degraded. When a fault is detected, 330 the protected traffic is switched over to the recovery path(s) and 331 restored. 333 In terms of the principles in section 3, protection switching employs 334 pre-established recovery paths, and, if resource reservation is 335 required on the recovery path, pre-reserved resources. The various 336 sub-types of protection switching are detailed in Section 3.4 of this 337 document. 339 2.2. The Recovery Cycles 341 There are three defined recovery cycles: the MPLS Recovery Cycle, the 342 MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first 343 cycle detects a fault and restores traffic onto MPLS-based recovery 344 paths. If the recovery path is non-optimal the cycle may be followed 345 by any of the two latter cycles to achieve an optimized network 346 again. The reversion cycle applies for explicitly routed traffic that 347 that does not rely on any dynamic routing protocols to be converged. 348 The dynamic re-routing cycle applies for traffic that is forwarded 349 based on hop-by-hop routing. 351 2.2.1 MPLS Recovery Cycle Model 353 The MPLS recovery cycle model is illustrated in Figure 1. 354 Definitions and a key to abbreviations follow. 356 --Network Impairment 357 | --Fault Detected 358 | | --Start of Notification 359 | | | -- Start of Recovery Operation 360 | | | | --Recovery Operation Complete 361 | | | | | --Path Traffic Restored 362 | | | | | | 363 | | | | | | 364 v v v v v v 365 ---------------------------------------------------------------- 366 | T1 | T2 | T3 | T4 | T5 | 368 Figure 1. MPLS Recovery Cycle Model 370 The various timing measures used in the model are described below. 371 T1 Fault Detection Time 372 T2 Hold-off Time 373 T3 Notification Time 374 T4 Recovery Operation Time 375 T5 Traffic Restoration Time 377 Definitions of the recovery cycle times are as follows: 379 Fault Detection Time 381 The time between the occurrence of a network impairment and the 382 moment the fault is detected by MPLS-based recovery mechanisms. This 383 time may be highly dependent on lower layer protocols. 385 Hold-Off Time 387 The configured waiting time between the detection of a fault and 388 taking MPLS-based recovery action, to allow time for lower layer 389 protection to take effect. The Hold-off Time may be zero. 391 Note: The Hold-Off Time may occur after the Notification Time 392 interval if the node responsible for the switchover, the Path Switch 393 LSR (PSL), rather than the detecting LSR, is configured to wait. 395 Notification Time 397 The time between initiation of a fault indication signal (FIS) by the 398 LSR detecting the fault and the time at which the Path Switch LSR 399 (PSL) begins the recovery operation. This is zero if the PSL detects 400 the fault itself or infers a fault from such events as an adjacency 401 failure. 403 Note: If the PSL detects the fault itself, there still may be a Hold- 404 Off Time period between detection and the start of the recovery 405 operation. 407 Recovery Operation Time 408 The time between the first and last recovery actions. This may 409 include message exchanges between the PSL and PML to coordinate 410 recovery actions. 412 Traffic Restoration Time 414 The time between the last recovery action and the time that the 415 traffic (if present) is completely recovered. This interval is 416 intended to account for the time required for traffic to once again 417 arrive at the point in the network that experienced disrupted or 418 degraded service due to the occurrence of the fault (e.g. the PML). 419 This time may depend on the location of the fault, the recovery 420 mechanism, and the propagation delay along the recovery path. 422 2.2.2 MPLS Reversion Cycle Model 424 Protection switching, revertive mode, requires the traffic to be 425 switched back to a preferred path when the fault on that path is 426 cleared. The MPLS reversion cycle model is illustrated in Figure 2. 427 Note that the cycle shown below comes after the recovery cycle shown 428 in Fig. 1. 430 --Network Impairment Repaired 431 | --Fault Cleared 432 | | --Path Available 433 | | | --Start of Reversion Operation 434 | | | | --Reversion Operation Complete 435 | | | | | --Traffic Restored on Preferred Path 436 | | | | | | 437 | | | | | | 438 v v v v v v 439 ----------------------------------------------------------------- 440 | T7 | T8 | T9 | T10| T11| 442 Figure 2. MPLS Reversion Cycle Model 444 The various timing measures used in the model are described below. 445 T7 Fault Clearing Time 446 T8 Wait-to-Restore Time 447 T9 Notification Time 448 T10 Reversion Operation Time 449 T11 Traffic Restoration Time 451 Note that time T6 (not shown above) is the time for which the network 452 impairment is not repaired and traffic is flowing on the recovery 453 path. 455 Definitions of the reversion cycle times are as follows: 457 Fault Clearing Time 458 The time between the repair of a network impairment and the time that 459 MPLS-based mechanisms learn that the fault has been cleared. This 460 time may be highly dependent on lower layer protocols. 462 Wait-to-Restore Time 464 The configured waiting time between the clearing of a fault and MPLS- 465 based recovery action(s). Waiting time may be needed to ensure that 466 the path is stable and to avoid flapping in cases where a fault is 467 intermittent. The Wait-to-Restore Time may be zero. 469 Note: The Wait-to-Restore Time may occur after the Notification Time 470 interval if the PSL is configured to wait. 472 Notification Time 474 The time between initiation of a fault recovery signal (FRS) by the 475 LSR clearing the fault and the time at which the path switch LSR 476 begins the reversion operation. This is zero if the PSL clears the 477 fault itself. 478 Note: If the PSL clears the fault itself, there still may be a Wait- 479 to-Restore Time period between fault clearing and the start of the 480 reversion operation. 482 Reversion Operation Time 484 The time between the first and last reversion actions. This may 485 include message exchanges between the PSL and PML to coordinate 486 reversion actions. 488 Traffic Restoration Time 490 The time between the last reversion action and the time that traffic 491 (if present) is completely restored on the preferred path. This 492 interval is expected to be quite small since both paths are working 493 and care may be taken to limit the traffic disruption (e.g., using 494 "make before break" techniques and synchronous switch-over). 496 In practice, the only interesting times in the reversion cycle are 497 the Wait-to-Restore Time and the Traffic Restoration Time (or some 498 other measure of traffic disruption). Given that both paths are 499 available, there is no need for rapid operation, and a well- 500 controlled switch-back with minimal disruption is desirable. 502 2.2.3 Dynamic Re-routing Cycle Model 504 Dynamic rerouting aims to bring the IP network to a stable state 505 after a network impairment has occurred. A re-optimized network is 506 achieved after the routing protocols have converged, and the traffic 507 is moved from a recovery path to a (possibly) new working path. The 508 steps involved in this mode are illustrated in Figure 3. 510 Note that the cycle shown below may be overlaid on the recovery cycle 511 shown in Fig. 1 or the reversion cycle shown in Fig. 2, or both (in 512 the event that both the recovery cycle and the reversion cycle take 513 place before the routing protocols converge), and after the 514 convergence of the routing protocols it is determined (based on on- 515 line algorithms or off-line traffic engineering tools, network 516 configuration, or a variety of other possible criteria) that there is 517 a better route for the working path. 519 --Network Enters a Semi-stable State after an Impairment 520 | --Dynamic Routing Protocols Converge 521 | | --Initiate Setup of New Working Path between PSL 522 | | | and PML 523 | | | --Switchover Operation Complete 524 | | | | --Traffic Moved to New Working Path 525 | | | | | 526 | | | | | 527 v v v v v 528 ----------------------------------------------------------------- 529 | T12 | T13 | T14 | T15 | 531 Figure 3. Dynamic Rerouting Cycle Model 532 The various timing measures used in the model are described below. 533 T12 Network Route Convergence Time 534 T13 Hold-down Time (optional) 535 T14 Switchover Operation Time 536 T15 Traffic Restoration Time 538 Network Route Convergence Time 540 We define the network route convergence time as the time taken for 541 the network routing protocols to converge and for the network to 542 reach a stable state. 544 Holddown Time 546 We define the holddown period as a bounded time for which a recovery 547 path must be used. In some scenarios it may be difficult to determine 548 if the working path is stable. In these cases a holddown time may be 549 used to prevent excess flapping of traffic between a working and a 550 recovery path. 552 Switchover Operation Time 554 The time between the first and last switchover actions. This may 555 include message exchanges between the PSL and PML to coordinate the 556 switchover actions. 558 As an example of the recovery cycle, we present a sequence of events 559 that occur after a network impairment occurs and when a protection 560 switch is followed by dynamic rerouting. 562 I. Link or path fault occurs 563 II. Signaling initiated (FIS) for the detected fault 564 III. FIS arrives at the PSL 565 IV. The PSL initiates a protection switch to a pre-configured 566 recovery path 567 V. The PSL switches over the traffic from the working path to the 568 recovery path 569 VI. The network enters a semi-stable state 570 VII. Dynamic routing protocols converge after the fault, and a new 571 working path is calculated (based, for example, on some of the 572 criteria mentioned in Section 2.1.1). 573 VIII. A new working path is established between the PSL and the PML 574 (assumption is that PSL and PML have not changed) 575 IX. Traffic is switched over to the new working path. 577 2.3. Definitions and Terminology 579 This document assumes the terminology given in [1], and, in addition, 580 introduces the following new terms. 582 2.3.1 General Recovery Terminology 584 Rerouting 586 A recovery mechanism in which the recovery path or path segments are 587 created dynamically after the detection of a fault on the working 588 path. In other words, a recovery mechanism in which the recovery path 589 is not pre-established. 591 Protection Switching 593 A recovery mechanism in which the recovery path or path segments are 594 created prior to the detection of a fault on the working path. In 595 other words, a recovery mechanism in which the recovery path is pre- 596 established. 598 Working Path 600 The protected path that carries traffic before the occurrence of a 601 fault. The working path exists between a PSL and PML. The working 602 path can be of different kinds; a hop-by-hop routed path, a trunk, a 603 link, an LSP or part of a multipoint-to-point LSP. 605 Synonyms for a working path are primary path and active path. 607 Recovery Path 609 The path by which traffic is restored after the occurrence of a 610 fault. In other words, the path on which the traffic is directed by 611 the recovery mechanism. The recovery path is established by MPLS 612 means. The recovery path can either be an equivalent recovery path 613 and ensure no reduction in quality of service, or be a limited 614 recovery path and thereby not guarantee the same quality of service 615 (or some other criteria of performance) as the working path. A 616 limited recovery path is not expected to be used for an extended 617 period of time. 619 Synonyms for a recovery path are: back-up path, alternative path, and 620 protection path. 622 Protection Counterpart 624 The "other" path when discussing pre-planned protection switching 625 schemes. The protection counterpart for the working path is the 626 recovery path and vice-versa. 628 Path Group (PG) 630 A logical bundling of multiple working paths, each of which is routed 631 identically between a Path Switch LSR and a Path Merge LSR. 633 Protected Path Group (PPG) 635 A path group that requires protection. 637 Protected Traffic Portion (PTP) 639 The portion of the traffic on an individual path that requires 640 protection. For example, code points in the EXP bits of the shim 641 header may identify a protected portion. 643 Path Switch LSR (PSL) 645 The PSL is responsible for switching or replicating the traffic 646 between the working path and the recovery path. 648 Path Merge LSR (PML) 650 An LSR that is responsible for receiving the recovery path traffic, 651 and either merges the traffic back onto the working path, or, if it 652 is itself the destination, passes the traffic on to the higher layer 653 protocols. 655 Intermediate LSR 657 An LSR on a working or recovery path that is neither a PSL nor a PML 658 for that path. 660 Bypass Tunnel 662 A path that serves to back up a set of working paths using the label 663 stacking approach [1]. The working paths and the bypass tunnel must 664 all share the same path switch LSR (PSL) and the path merge LSR 665 (PML). 667 Switch-Over 668 The process of switching the traffic from the path that the traffic 669 is flowing on onto one or more alternate path(s). This may involve 670 moving traffic from a working path onto one or more recovery paths, 671 or may involve moving traffic from a recovery path(s) on to a more 672 optimal working path(s). 674 Switch-Back 676 The process of returning the traffic from one or more recovery paths 677 back to the working path(s). 679 Revertive Mode 681 A recovery mode in which traffic is automatically switched back from 682 the recovery path to the original working path upon the restoration 683 of the working path to a fault-free condition. This assumes a failed 684 working path does not automatically surrender resources to the 685 network. 687 Non-revertive Mode 689 A recovery mode in which traffic is not automatically switched back 690 to the original working path after this path is restored to a fault- 691 free condition. (Depending on the configuration, the original working 692 path may, upon moving to a fault-free condition, become the recovery 693 path, or it may be used for new working traffic, and be no longer 694 associated with its original recovery path). 696 MPLS Protection Domain 698 The set of LSRs over which a working path and its corresponding 699 recovery path are routed. 701 MPLS Protection Plan 703 The set of all LSP protection paths and the mapping from working to 704 protection paths deployed in an MPLS protection domain at a given 705 time. 707 Liveness Message 709 A message exchanged periodically between two adjacent LSRs that 710 serves as a link probing mechanism. It provides an integrity check of 711 the forward and the backward directions of the link between the two 712 LSRs as well as a check of neighbor aliveness. 714 Path Continuity Test 716 A test that verifies the integrity and continuity of a path or path 717 segment. The details of such a test are beyond the scope of this 718 draft. (This could be accomplished, for example, by transmitting a 719 control message along the same links and nodes as the data traffic or 720 similarly could be measured by the absence of traffic and by 721 providing feedback.) 723 2.3.2 Failure Terminology 725 Path Failure (PF) 726 Path failure is fault detected by MPLS-based recovery mechanisms, 727 which is define as the failure of the liveness message test or a path 728 continuity test, which indicates that path connectivity is lost. 730 Path Degraded (PD) 731 Path degraded is a fault detected by MPLS-based recovery mechanisms 732 that indicates that the quality of the path is unacceptable. 734 Link Failure (LF) 735 A lower layer fault indicating that link continuity is lost. This may 736 be communicated to the MPLS-based recovery mechanisms by the lower 737 layer. 739 Link Degraded (LD) 740 A lower layer indication to MPLS-based recovery mechanisms that the 741 link is performing below an acceptable level. 743 Fault Indication Signal (FIS) 744 A signal that indicates that a fault along a path has occurred. It is 745 relayed by each intermediate LSR to its upstream or downstream 746 neighbor, until it reaches an LSR that is setup to perform MPLS 747 recovery. The FIS is transmitted periodically by the node/nodes 748 closest to the point of failure, for some configurable length of 749 time. 751 Fault Recovery Signal (FRS) 752 A signal that indicates a fault along a working path has been 753 repaired. Again, like the FIS, it is relayed by each intermediate LSR 754 to its upstream or downstream neighbor, until is reaches the LSR that 755 performs recovery of the original path. The FRS is transmitted 756 periodically by the node/nodes closest to the point of failure, for 757 some configurable length of time. 759 2.4. Abbreviations 761 FIS: Fault Indication Signal. 762 FRS: Fault Recovery Signal. 763 LD: Link Degraded. 764 LF: Link Failure. 765 PD: Path Degraded. 766 PF: Path Failure. 767 PML: Path Merge LSR. 768 PG: Path Group. 769 PPG: Protected Path Group. 770 PTP: Protected Traffic Portion. 771 PSL: Path Switch LSR. 773 3. MPLS-based Recovery Principles 775 MPLS-based recovery refers to the ability to effect quick and 776 complete restoration of traffic affected by a fault in an MPLS- 777 enabled network. The fault may be detected on the IP layer or in 778 lower layers over which IP traffic is transported. Fastest MPLS 779 recovery is assumed to be achieved with protection switching and may 780 be viewed as the MPLS LSR switch completion time that is comparable 781 to, or equivalent to, the 50 ms switch-over completion time of the 782 SONET layer. This section provides a discussion of the concepts and 783 principles of MPLS-based recovery. The concepts are presented in 784 terms of atomic or primitive terms that may be combined to specify 785 recovery approaches. We do not make any assumptions about the 786 underlying layer 1 or layer 2 transport mechanisms or their recovery 787 mechanisms. 789 3.1. Configuration of Recovery 791 An LSR may support any or all of the following recovery options: 793 Default-recovery (No MPLS-based recovery enabled): 794 Traffic on the working path is recovered only via Layer 3 or IP 795 rerouting or by some lower layer mechanism such as SONET APS. This 796 is equivalent to having no MPLS-based recovery. This option may be 797 used for low priority traffic or for traffic that is recovered in 798 another way (for example load shared traffic on parallel working 799 paths may be automatically recovered upon a fault along one of the 800 working paths by distributing it among the remaining working paths). 802 Recoverable (MPLS-based recovery enabled): 803 This working path is recovered using one or more recovery paths, 804 either via rerouting or via protection switching. 806 3.2. Initiation of Path Setup 808 There are three options for the initiation of the recovery path 809 setup. The active and recovery paths may be established by using 810 either RSVP-TE [4][5] or CR-LDP [6]. 812 Pre-established: 814 This is the same as the protection switching option. Here a recovery 815 path(s) is established prior to any failure on the working path. The 816 path selection can either be determined by an administrative 817 centralized tool, or chosen based on some algorithm implemented at 818 the PSL and possibly intermediate nodes. To guard against the 819 situation when the pre-established recovery path fails before or at 820 the same time as the working path, the recovery path should have 821 secondary configuration options as explained in Section 3.3 below. 823 Pre Qualified: 825 A pre-established path need not be created, it may be pre-qualified. 826 A pre-qualified recovery path is not created expressly for protecting 827 the working path, but instead is a path created for other purposes 828 that is designated as a recovery path after determining that it is an 829 acceptable alternative for carrying the working path traffic. 830 Variants include the case where an optical path or trail is 831 configured, but no switches are set. 833 Established-on-Demand: 835 This is the same as the rerouting option. Here, a recovery path is 836 established after a failure on its working path has been detected and 837 notified to the PSL. 839 3.3. Initiation of Resource Allocation 841 A recovery path may support the same traffic contract as the working 842 path, or it may not. We will distinguish these two situations by 843 using different additive terms. If the recovery path is capable of 844 replacing the working path without degrading service, it will be 845 called an equivalent recovery path. If the recovery path lacks the 846 resources (or resource reservations) to replace the working path 847 without degrading service, it will be called a limited recovery path. 848 Based on this, there are two options for the initiation of resource 849 allocation: 851 Pre-reserved: 853 This option applies only to protection switching. Here a pre- 854 established recovery path reserves required resources on all hops 855 along its route during its establishment. Although the reserved 856 resources (e.g., bandwidth and/or buffers) at each node cannot be 857 used to admit more working paths, they are available to be used by 858 all traffic that is present at the node before a failure occurs. 860 Reserved-on-Demand: 862 This option may apply either to rerouting or to protection switching. 863 Here a recovery path reserves the required resources after a failure 864 on the working path has been detected and notified to the PSL and 865 before the traffic on the working path is switched over to the 866 recovery path. 868 Note that under both the options above, depending on the amount of 869 resources reserved on the recovery path, it could either be an 870 equivalent recovery path or a limited recovery path. 872 3.4. Scope of Recovery 874 3.4.1 Topology 875 1.1.1.1 Local Repair 877 The intent of local repair is to protect against a link or neighbor 878 node fault and to minimize the amount of time required for failure 879 propagation. In local repair (also known as local recovery), the node 880 immediately upstream of the fault is the one to initiate recovery 881 (either rerouting or protection switching). Local repair can be of 882 two types: 884 Link Recovery/Restoration 886 In this case, the recovery path may be configured to route around a 887 certain link deemed to be unreliable. If protection switching is 888 used, several recovery paths may be configured for one working path, 889 depending on the specific faulty link that each protects against. 891 Alternatively, if rerouting is used, upon the occurrence of a fault 892 on the specified link, each path is rebuilt such that it detours 893 around the faulty link. 894 In this case, the recovery path need only be disjoint from its 895 working path at a particular link on the working path, and may have 896 overlapping segments with the working path. Traffic on the working 897 path is switched over to an alternate path at the upstream LSR that 898 connects to the failed link. This method is potentially the fastest 899 to perform the switchover, and can be effective in situations where 900 certain path components are much more unreliable than others. 902 Node Recovery/Restoration 904 In this case, the recovery path may be configured to route around a 905 neighbor node deemed to be unreliable. Thus the recovery path is 906 disjoint from the working path only at a particular node and at links 907 associated with the working path at that node. Once again, the 908 traffic on the primary path is switched over to the recovery path at 909 the upstream LSR that directly connects to the failed node, and the 910 recovery path shares overlapping portions with the working path. 912 1.1.1.2 Global Repair 914 The intent of global repair is to protect against any link or node 915 fault on a path or on a segment of a path, with the obvious exception 916 of the faults occurring at the ingress node of the protected path 917 segment. In global repair the PSL is usually distant from the failure 918 and needs to be notified by a FIS. 919 In global repair also, end-to-end path recovery/restoration applies. 920 In many cases, the recovery path can be made completely link and node 921 disjoint with its working path. This has the advantage of protecting 922 against all link and node fault(s) on the working path (end-to-end 923 path or path segment). 924 However, it may, in some cases, be slower than local repair since the 925 fault notification message must now travel to the PSL to trigger the 926 recovery action. 928 1.1.1.3 Alternate Egress Repair 930 It is possible to restore service without specifically recovering the 931 faulted path. 932 For example, for best effort IP service it is possible to select a 933 recovery path that has a different egress point from the working path 934 (i.e., there is no PML). The recovery path egress must simply be a 935 router that is acceptable for forwarding the FEC carried by the 936 working path (without creating looping). In an engineering context, 937 specific alternative FEC/LSP mappings with alternate egresses can be 938 formed. 940 This may simplify enhancing the reliability of implicitly constructed 941 MPLS topologies. A PSL may qualify LSP/FEC bindings as candidate 942 recovery paths as simply link and node disjoint with the immediate 943 downstream LSR of the working path. 945 1.1.1.4 Multi-Layer Repair 947 Multi-layer repair broadens the network designerÆs tool set for those 948 cases where multiple network layers can be managed together to 949 achieve overall network goals. Specific criteria for determining 950 when multi-layer repair is appropriate are beyond the scope of this 951 draft. 953 1.1.1.5 Concatenated Protection Domains 955 A given service may cross multiple networks and these may employ 956 different recovery mechanisms. It is possible to concatenate 957 protection domains so that service recovery can be provided end-to- 958 end. It is considered that the recovery mechanisms in different 959 domains may operate autonomously, and that multiple points of 960 attachment may be used between domains (to ensure there is no single 961 point of failure). Alternate egress repair requires management of 962 concatenated domains in that an explicit MPLS point of failure (the 963 PML) is by definition excluded. Details of concatenated protection 964 domains are beyond the scope of this draft. 966 3.4.2 Path Mapping 968 Path mapping refers to the methods of mapping traffic from a faulty 969 working path on to the recovery path. There are several options for 970 this, as described below. Note that the options below should be 971 viewed as atomic terms that only describe how the working and 972 protection paths are mapped to each other. The issues of resource 973 reservation along these paths, and how switchover is actually 974 performed lead to the more commonly used composite terms, such as 1+1 975 and 1:1 protection, which were described in Section 2.1. 977 1-to-1 Protection 979 In 1-to-1 protection the working path has a designated recovery path 980 that is only to be used to recover that specific working path. 982 n-to-1 Protection 984 In n-to-1 protection, up to n working paths are protected using only 985 one recovery path. If the intent is to protect against any single 986 fault on any of the working paths, the n working paths should be 987 diversely routed between the same PSL and PML. In some cases, 988 handshaking between PSL and PML may be required to complete the 989 recovery, the details of which are beyond the scope of this draft. 991 n-to-m Protection 993 In n-to-m protection, up to n working paths are protected using m 994 recovery paths. Once again, if the intent is to protect against any 995 single fault on any of the n working paths, the n working paths and 996 the m recovery paths should be diversely routed between the same PSL 997 and PML. In some cases, handshaking between PSL and PML may be 998 required to complete the recovery, the details of which are beyond 999 the scope of this draft. n-to-m protection is for further study. 1001 Split Path Protection 1003 In split path protection, multiple recovery paths are allowed to 1004 carry the traffic of a working path based on a certain configurable 1005 load splitting ratio. This is especially useful when no single 1006 recovery path can be found that can carry the entire traffic of the 1007 working path in case of a fault. Split path protection may require 1008 handshaking between the PSL and the PML(s), and may require the 1009 PML(s) to correlate the traffic arriving on multiple recovery paths 1010 with the working path. Although this is an attractive option, the 1011 details of split path protection are beyond the scope of this draft, 1012 and are for further study. 1014 3.4.3 Bypass Tunnels 1016 It may be convenient, in some cases, to create a "bypass tunnel" for 1017 a PPG between a PSL and PML, thereby allowing multiple recovery paths 1018 to be transparent to intervening LSRs [2]. In this case, one LSP 1019 (the tunnel) is established between the PSL and PML following an 1020 acceptable route and a number of recovery paths are supported through 1021 the tunnel via label stacking. A bypass tunnel can be used with any 1022 of the path mapping options discussed in the previous section. 1024 As with recovery paths, the bypass tunnel may or may not have 1025 resource reservations sufficient to provide recovery without service 1026 degradation. It is possible that the bypass tunnel may have 1027 sufficient resources to recover some number of working paths, but not 1028 all at the same time. If the number of recovery paths carrying 1029 traffic in the tunnel at any given time is restricted, this is 1030 similar to the n-to-1 or n-to-m protection cases mentioned in Section 1031 3.4.2. 1033 3.4.4 Recovery Granularity 1035 Another dimension of recovery considers the amount of traffic 1036 requiring protection. This may range from a fraction of a path to a 1037 bundle of paths. 1039 1.1.1.6 Selective Traffic Recovery 1041 This option allows for the protection of a fraction of traffic within 1042 the same path. The portion of the traffic on an individual path that 1043 requires protection is called a protected traffic portion (PTP). A 1044 single path may carry different classes of traffic, with different 1045 protection requirements. The protected portion of this traffic may be 1046 identified by its class, as for example, via the EXP bits in the MPLS 1047 shim header or via the priority bit in the ATM header. 1049 1.1.1.7 Bundling 1051 Bundling is a technique used to group multiple working paths together 1052 in order to recover them simultaneously. The logical bundling of 1053 multiple working paths requiring protection, each of which is routed 1054 identically between a PSL and a PML, is called a protected path group 1055 (PPG). When a fault occurs on the working path carrying the PPG, the 1056 PPG as a whole can be protected either by being switched to a bypass 1057 tunnel or by being switched to a recovery path. 1059 3.4.5 Recovery Path Resource Use 1061 In the case of pre-reserved recovery paths, there is the question of 1062 what use these resources may be put to when the recovery path is not 1063 in use. There are two options: 1065 Dedicated-resource: 1066 If the recovery path resources are dedicated, they may not be used 1067 for anything except carrying the working traffic. For example, in 1068 the case of 1+1 protection, the working traffic is always carried on 1069 the recovery path. Even if the recovery path is not always carrying 1070 the working traffic, it may not be possible or desirable to allow 1071 other traffic to use these resources. 1073 Extra-traffic-allowed: 1074 If the recovery path only carries the working traffic when the 1075 working path fails, then it is possible to allow extra traffic to use 1076 the reserved resources at other times. Extra traffic is, by 1077 definition, traffic that can be displaced (without violating service 1078 agreements) whenever the recovery path resources are needed for 1079 carrying the working path traffic. 1081 Shared-resource: 1082 A shared recovery resource is dedicated for use by multiple primary 1083 resources that (according to SRLGs) are not expected to fail 1084 simultaneously. 1086 3.5. Fault Detection 1088 MPLS recovery is initiated after the detection of either a lower 1089 layer fault or a fault at the IP layer or in the operation of MPLS- 1090 based mechanisms. We consider four classes of impairments: Path 1091 Failure, Path Degraded, Link Failure, and Link Degraded. 1093 Path Failure (PF) is a fault that indicates to an MPLS-based recovery 1094 scheme that the connectivity of the path is lost. This may be 1095 detected by a path continuity test between the PSL and PML. Some, 1096 and perhaps the most common, path failures may be detected using a 1097 link probing mechanism between neighbor LSRs. An example of a probing 1098 mechanism is a liveness message that is exchanged periodically along 1099 the working path between peer LSRs [3]. For either a link probing 1100 mechanism or path continuity test to be effective, the test message 1101 must be guaranteed to follow the same route as the working or 1102 recovery path, over the segment being tested. In addition, the path 1103 continuity test must take the path merge points into consideration. 1104 In the case of a bi-directional link implemented as two 1105 unidirectional links, path failure could mean that either one or both 1106 unidirectional links are damaged. 1108 Path Degraded (PD) is a fault that indicates to MPLS-based recovery 1109 schemes/mechanisms that the path has connectivity, but that the 1110 quality of the connection is unacceptable. This may be detected by a 1111 path performance monitoring mechanism, or some other mechanism for 1112 determining the error rate on the path or some portion of the path. 1113 This is local to the LSR and consists of excessive discarding of 1114 packets at an interface, either due to label mismatch or due to TTL 1115 errors, for example. 1117 Link Failure (LF) is an indication from a lower layer that the link 1118 over which the path is carried has failed. If the lower layer 1119 supports detection and reporting of this fault (that is, any fault 1120 that indicates link failure e.g., SONET LOS), this may be used by the 1121 MPLS recovery mechanism. In some cases, using LF indications may 1122 provide faster fault detection than using only MPLSûbased fault 1123 detection mechanisms. 1125 Link Degraded (LD) is an indication from a lower layer that the link 1126 over which the path is carried is performing below an acceptable 1127 level. If the lower layer supports detection and reporting of this 1128 fault, it may be used by the MPLS recovery mechanism. In some cases, 1129 using LD indications may provide faster fault detection than using 1130 only MPLS-based fault detection mechanisms. 1132 3.6. Fault Notification 1134 MPLS-based recovery relies on rapid and reliable notification of 1135 faults. Once a fault is detected, the node that detected the fault 1136 must determine if the fault is severe enough to require path 1137 recovery. If the node is not capable of initiating direct action 1138 (e.g. as a PSL) the node should send out a notification of the fault 1139 by transmitting a FIS to those of its upstream LSRs that were sending 1140 traffic on the working path that is affected by the fault. This 1141 notification is relayed hop-by-hop by each subsequent LSR to its 1142 upstream neighbor, until it eventually reaches a PSL. A PSL is the 1143 only LSR that can terminate the FIS and initiate a protection switch 1144 of the working path to a recovery path. 1146 Since the FIS is a control message, it should be transmitted with 1147 high priority to ensure that it propagates rapidly towards the 1148 affected PSL(s). Depending on how fault notification is configured in 1149 the LSRs of an MPLS domain, the FIS could be sent either as a Layer 2 1150 or Layer 3 packet [3]. The use of a Layer 2-based notification 1151 requires a Layer 2 path direct to the PSL. An example of a FIS could 1152 be the liveness message sent by a downstream LSR to its upstream 1153 neighbor, with an optional fault notification field set or it can be 1154 implicitly denoted by a teardown message. Alternatively, it could be 1155 a separate fault notification packet. The intermediate LSR should 1156 identify which of its incoming links (upstream LSRs) to propagate the 1157 FIS on. In the case of 1+1 protection, the FIS should also be sent 1158 downstream to the PML where the recovery action is taken. 1160 3.7. Switch-Over Operation 1162 3.7.1 Recovery Trigger 1164 The activation of an MPLS protection switch following the detection 1165 or notification of a fault requires a trigger mechanism at the PSL. 1166 MPLS protection switching may be initiated due to automatic inputs or 1167 external commands. The automatic activation of an MPLS protection 1168 switch results from a response to a defect or fault conditions 1169 detected at the PSL or to fault notifications received at the PSL. It 1170 is possible that the fault detection and trigger mechanisms may be 1171 combined, as is the case when a PF, PD, LF, or LD is detected at a 1172 PSL and triggers a protection switch to the recovery path. In most 1173 cases, however, the detection and trigger mechanisms are distinct, 1174 involving the detection of fault at some intermediate LSR followed by 1175 the propagation of a fault notification back to the PSL via the FIS, 1176 which serves as the protection switch trigger at the PSL. MPLS 1177 protection switching in response to external commands results when 1178 the operator initiates a protection switch by a command to a PSL (or 1179 alternatively by a configuration command to an intermediate LSR, 1180 which transmits the FIS towards the PSL). 1182 Note that the PF fault applies to hard failures (fiber cuts, 1183 transmitter failures, or LSR fabric failures), as does the LF fault, 1184 with the difference that the LF is a lower layer impairment that may 1185 be communicated to - MPLS-based recovery mechanisms. The PD (or LD) 1186 fault, on the other hand, applies to soft defects (excessive errors 1187 due to noise on the link, for instance). The PD (or LD) results in a 1188 fault declaration only when the percentage of lost packets exceeds a 1189 given threshold, which is provisioned and may be set based on the 1190 service level agreement(s) in effect between a service provider and a 1191 customer. 1193 3.7.2 Recovery Action 1195 After a fault is detected or FIS is received by the PSL, the recovery 1196 action involves either a rerouting or protection switching operation. 1197 In both scenarios, the next hop label forwarding entry for a recovery 1198 path is bound to the working path. 1200 3.8. Post Recovery Operation 1202 When traffic is flowing on the recovery path decisions can be made to 1203 whether let the traffic remain on the recovery path and consider it 1204 as a new working path or do a switch to the old or a new working 1205 path. This post recovery operation has two styles, one where the 1206 protection counterparts, i.e. the working and recovery path, are 1207 fixed or "pinned" to its route and one in which the PSL or other 1208 network entity with real time knowledge of failure dynamically 1209 performs re-establishment or controlled rearrangement of the paths 1210 comprising the protected service. 1212 3.8.1 Fixed Protection Counterparts 1214 For fixed protection counterparts the PSL will be pre-configured with 1215 the appropriate behavior to take when the original fixed path is 1216 restored to service. The choices are revertive and non-revertive 1217 mode. The choice will typically be depended on relative costs of the 1218 working and protection paths, and the tolerance of the service to the 1219 effects of switching paths yet again. These protection modes indicate 1220 whether or not there is a preferred path for the protected traffic. 1222 1.1.1.8 Revertive Mode 1224 If the working path always is the preferred path, this path will be 1225 used whenever it is available. Thus, in the event of a fault on this 1226 path, its unused resources will not be reclaimed by the network on 1227 failure. If the working path has a fault, traffic is switched to the 1228 recovery path. In the revertive mode of operation, when the 1229 preferred path is restored the traffic is automatically switched back 1230 to it. 1232 There are a number of implications to pinned working and recovery 1233 paths: 1234 - upon failure and traffic moved to recovery path, the traffic is 1235 unprotected until such time as the path defect in the original 1236 working path is repaired and that path restored to service. 1237 - upon failure and traffic moved to recovery path, the resources 1238 associated with the original path remain reserved. 1240 1.1.1.9 Non-revertive Mode 1242 In the non-revertive mode of operation, there is no preferred path or 1243 it may be desirable to minimize further disruption of the service 1244 brought on by a revertive switching operation. A switch-back to the 1245 original working path is not desired or not possible since the 1246 original path may no longer exist after the occurrence of a fault on 1247 that path. 1248 If there is a fault on the working path, traffic is switched to the 1249 recovery path. When or if the faulty path (the originally working 1250 path) is restored, it may become the recovery path (either by 1251 configuration, or, if desired, by management actions). 1253 In the non-revertive mode of operation, the working traffic may or 1254 may not be restored to a new optimal working path or to the original 1255 working path anyway. This is because it might be useful, in some 1256 cases, to either: (a) administratively perform a protection switch 1257 back to the original working path after gaining further assurances 1258 about the integrity of the path, or (b) it may be acceptable to 1259 continue operation on the recovery path, or (c) it may be desirable 1260 to move the traffic to a new optimal working path that is calculated 1261 based on network topology and network policies. 1263 3.8.2 Dynamic Protection Counterparts 1265 For dynamic protection counterparts when the traffic is switched over 1266 to a recovery path, the association between the original working path 1267 and the recovery path may no longer exist, since the original path 1268 itself may no longer exist after the fault. Instead, when the network 1269 reaches a stable state following routing convergence, the recovery 1270 path may be switched over to a different preferred path either 1271 optimization based on the new network topology and associated 1272 information or based on pre-configured information. 1274 Dynamic protection counterparts assume that upon failure, the PSL or 1275 other network entity will establish new working paths if another 1276 switch-over will be performed. 1278 3.8.3 Restoration and Notification 1280 MPLS restoration deals with returning the working traffic from the 1281 recovery path to the original or a new working path. Reversion is 1282 performed by the PSL either upon receiving notification, via FRS, 1283 that the working path is repaired, or upon receiving notification 1284 that a new working path is established. 1286 For fixed counterparts in revertive mode, an LSR that detected the 1287 fault on the working path also detects the restoration of the working 1288 path. If the working path had experienced a LF defect, the LSR 1289 detects a return to normal operation via the receipt of a liveness 1290 message from its peer. If the working path had experienced a LD 1291 defect at an LSR interface, the LSR could detect a return to normal 1292 operation via the resumption of error-free packet reception on that 1293 interface. Alternatively, a lower layer that no longer detects a LF 1294 defect may inform the MPLS-based recovery mechanisms at the LSR that 1295 the link to its peer LSR is operational. 1296 The LSR then transmits FRS to its upstream LSR(s) that were 1297 transmitting traffic on the working path. At the point the PSL 1298 receives the FRS, it switches the working traffic back to the 1299 original working path. 1301 A similar scheme is for dynamic counterparts where e.g. an update of 1302 topology and/or network convergence may trigger installation or setup 1303 of new working paths and may send notification to the PSL to perform 1304 a switch over. 1306 We note that if there is a way to transmit fault information back 1307 along a recovery path towards a PSL and if the recovery path is an 1308 equivalent working path, it is possible for the working path and its 1309 recovery path to exchange roles once the original working path is 1310 repaired following a fault. This is because, in that case, the 1311 recovery path effectively becomes the working path, and the restored 1312 working path functions as a recovery path for the original recovery 1313 path. This is important, since it affords the benefits of non- 1314 revertive switch operation outlined in Section 3.8.1, without leaving 1315 the recovery path unprotected. 1317 3.8.4 Reverting to Preferred Path (or Controlled Rearrangement) 1319 In the revertive mode, a "make before break" restoration switching 1320 can be used, which is less disruptive than performing protection 1321 switching upon the occurrence of network impairments. This will 1322 minimize both packet loss and packet reordering. The controlled 1323 rearrangement of paths can also be used to satisfy traffic 1324 engineering requirements for load balancing across an MPLS domain. 1326 3.9. Performance 1328 Resource/performance requirements for recovery paths should be 1329 specified in terms of the following attributes: 1331 I. Resource class attribute: 1332 Equivalent Recovery Class: The recovery path has the same resource 1333 reservations and performance guarantees as the working path. In other 1334 words, the recovery path meets the same SLAs as the working path. 1335 Limited Recovery Class: The recovery path does not have the same 1336 resource reservations and performance guarantees as the working path. 1338 A. Lower Class: The recovery path has lower resource requirements or 1339 less stringent performance requirements than the working path. 1341 B. Best Effort Class: The recovery path is best effort. 1343 II. Priority Attribute: 1344 The recovery path has a priority attribute just like the working path 1345 (i.e., the priority attribute of the associated traffic trunks). It 1346 can have the same priority as the working path or lower priority. 1348 III. Preemption Attribute: 1349 The recovery path can have the same preemption attribute as the 1350 working path or a lower one. 1352 4. MPLS Recovery Features 1354 The following features are desirable from an operational point of 1355 view: 1357 I. It is desirable that MPLS recovery provides an option to identify 1358 protection groups (PPGs) and protection portions (PTPs). 1360 II. Each PSL should be capable of performing MPLS recovery upon the 1361 detection of the impairments or upon receipt of notifications of 1362 impairments. 1364 III. A MPLS recovery method should not preclude manual protection 1365 switching commands. This implies that it would be possible under 1366 administrative commands to transfer traffic from a working path to a 1367 recovery path, or to transfer traffic from a recovery path to a 1368 working path, once the working path becomes operational following a 1369 fault. 1371 IV. A PSL may be capable of performing either a switch back to the 1372 original working path after the fault is corrected or a switchover to 1373 a new working path, upon the discovery or establishment of a more 1374 optimal working path. 1376 V. The recovery model should take into consideration path merging at 1377 intermediate LSRs. If a fault affects the merged segment, all the 1378 paths sharing that merged segment should be able to recover. 1379 Similarly, if a fault affects a non-merged segment, only the path 1380 that is affected by the fault should be recovered. 1382 5. Comparison Criteria 1384 Possible criteria to use for comparison of MPLS-based recovery 1385 schemes are as follows: 1387 Recovery Time 1389 We define recovery time as the time required for a recovery path to 1390 be activated (and traffic flowing) after a fault. Recovery Time is 1391 the sum of the Fault Detection Time, Hold-off Time, Notification 1392 Time, Recovery Operation Time, and the Traffic Restoration Time. In 1393 other words, it is the time between a failure of a node or link in 1394 the network and the time before a recovery path is installed and the 1395 traffic starts flowing on it. 1397 Full Restoration Time 1399 We define full restoration time as the time required for a permanent 1400 restoration. This is the time required for traffic to be routed onto 1401 links, which are capable of or have been engineered sufficiently to 1402 handle traffic in recovery scenarios. Note that this time may or may 1403 not be different from the "Recovery Time" depending on whether 1404 equivalent or limited recovery paths are used. 1406 Setup vulnerability 1408 The amount of time that a working path or a set of working paths is 1409 left unprotected during such tasks as recovery path computation and 1410 recovery path setup may be used to compare schemes. The nature of 1411 this vulnerability should be taken into account, e.g.: End to End 1412 schemes correlate the vulnerability with working paths, Local Repair 1413 schemes have a topological correlation that cuts across working paths 1414 and Network Plan approaches have a correlation that impacts the 1415 entire network. 1417 Backup Capacity 1419 Recovery schemes may require differing amounts of "backup capacity" 1420 in the event of a fault. This capacity will be dependent on the 1421 traffic characteristics of the network. However, it may also be 1422 dependent on the particular protection plan selection algorithms as 1423 well as the signaling and re-routing methods. 1425 Additive Latency 1427 Recovery schemes may introduce additive latency to traffic. For 1428 example, a recovery path may take many more hops than the working 1429 path. This may be dependent on the recovery path selection 1430 algorithms. 1432 Quality of Protection 1434 Recovery schemes can be considered to encompass a spectrum of "packet 1435 survivability" which may range from "relative" to "absolute". 1436 Relative survivability may mean that the packet is on an equal 1437 footing with other traffic of, as an example, the same diff-serv code 1438 point (DSCP) in contending for the resources of the portion of the 1439 network that survives the failure. Absolute survivability may mean 1440 that the survivability of the protected traffic has explicit 1441 guarantees. 1443 Re-ordering 1445 Recovery schemes may introduce re-ordering of packets. Also the 1446 action of putting traffic back on preferred paths might cause packet 1447 re-ordering. 1449 State Overhead 1451 As the number of recovery paths in a protection plan grows, the state 1452 required to maintain them also grows. Schemes may require differing 1453 numbers of paths to maintain certain levels of coverage, etc. The 1454 state required may also depend on the particular scheme used to 1455 recover. In many cases the state overhead will be in proportion to 1456 the number of recovery paths. 1458 Loss 1460 Recovery schemes may introduce a certain amount of packet loss during 1461 switchover to a recovery path. Schemes that introduce loss during 1462 recovery can measure this loss by evaluating recovery times in 1463 proportion to the link speed. 1465 In case of link or node failure a certain packet loss is inevitable. 1467 Coverage 1469 Recovery schemes may offer various types of failover coverage. The 1470 total coverage may be defined in terms of several metrics: 1472 I. Fault Types: Recovery schemes may account for only link faults or 1473 both node and link faults or also degraded service. For example, a 1474 scheme may require more recovery paths to take node faults into 1475 account. 1477 II. Number of concurrent faults: dependent on the layout of recovery 1478 paths in the protection plan, multiple fault scenarios may be able to 1479 be restored. 1481 III. Number of recovery paths: for a given fault, there may be one or 1482 more recovery paths. 1484 IV. Percentage of coverage: dependent on a scheme and its 1485 implementation, a certain percentage of faults may be covered. This 1486 may be subdivided into percentage of link faults and percentage of 1487 node faults. 1489 V. The number of protected paths may effect how fast the total set of 1490 paths affected by a fault could be recovered. The ratio of protected 1491 is n/N, where n is the number of protected paths and N is the total 1492 number of paths. 1494 6. Security Considerations 1496 The MPLS recovery that is specified herein does not raise any 1497 security issues that are not already present in the MPLS 1498 architecture. 1500 7. Intellectual Property Considerations 1502 The IETF has been notified of intellectual property rights claimed in 1503 regard to some or all of the specification contained in this 1504 document. For more information consult the online list of claimed 1505 rights. 1507 8. Acknowledgements 1509 We would like to thank members of the MPLS WG mailing list for their 1510 suggestions on the earlier versions of this draft. In particular, 1511 Bora Akyol, Dave Allan, Neil Harrison, and Dave Danenberg whose 1512 suggestions and comments were very helpful in revising the document. 1514 The editors would like to give very special thanks to Curtis 1515 Villamizar for his careful and extremely thorough reading of the 1516 document and for taking the time to provide numerous suggestions, 1517 which were very helpful in our latest revision of the document, and 1518 to Seyhan Civanlar, who provided initial input on the rerouting 1519 section. 1521 9. AuthorsÆ Addresses 1523 Vishal Sharma Fiffi Hellstrand 1524 Metanoia, Inc. Nortel Networks 1525 305 Elan Village Ln., Unit 121 St Eriksgatan 115 1526 San Jose, CA 95134 PO Box 6701 1527 Phone: (408) 955-0910 113 85 Stockholm, Sweden 1528 v.sharma@ieee.org Phone: +46 8 5088 3687 1529 Fiffi@nortelnetworks.com 1531 Ben Mack-Crane Srinivas Makam 1532 Tellabs Operations, Inc. Smakam60540@yahoo.com 1533 4951 Indiana Avenue 1534 Lisle, IL 60532 1535 Phone: (630) 512-7255 1536 Ben.Mack-Crane@tellabs.com 1538 Ken Owens Changcheng Huang 1539 Erlang Technology, Inc. Carleton University 1540 345 Marshall Ave., Suite 300 Minto Center, Rm. 3082 1541 St. Louis, MO 63119 1125 Colonial By Drive 1542 Phone: (314) 918-1579 Ottawa, Ontario K1S 5B6, 1543 Canada 1544 keno@erlangtech.com Phone: (613) 520-2600 x2477 1545 Changcheng.Huang@sce.carlet 1546 on.ca 1548 Jon Weil Brad Cain 1549 Nortel Networks Storigen Systems 1550 Harlow Laboratories London Road 650 Suffolk Street 1551 Harlow Essex CM17 9NA, UK Lowell, MA 01854 1552 Phone: +44 (0)1279 403935 Phone: (978) 323-4454 1553 jonweil@nortelnetworks.com bcain@storigen.com 1555 Loa Andersson Bilel Jamoussi 1556 Utfors AB Nortel Networks 1557 R…sundav„gen 12, Box 525 3 Federal Street, BL3-03 1558 169 29 Solna, Sweden Billerica, MA 01821, USA 1559 Phone: +46 8 5270 5038 Phone:(978) 288-4506 1560 loa.andersson@utfors.se jamoussi@nortelnetworks.com 1562 Angela Chiu 1563 Celion Networks, Inc. 1564 One Shiela Drive, Suite 2 1565 Tinton Falls, NJ 07724 1566 Phone: (732) 345-3441 1567 angela.chiu@celion.com 1569 10. References 1571 [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label 1572 Switching Architecture", RFC 3031, January 2001. 1574 [2] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J., 1575 "Requirements for Traffic Engineering Over MPLS", RFC 2702, 1576 September 1999. 1578 [3] Haung, C., Sharma, V., Owens, K., Makam, V. "Building Reliable 1579 MPLS Networks Using a Path Protection Mechanism", IEEE Commun. 1580 Mag., Vol. 40, Issue 3, March 2002, pp. 156-162. 1582 [4] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource 1583 ReSerVation Protocol (RSVP) -- Version 1 Functional 1584 Specification", RFC 2205, September 1997. 1586 [5] Awduche, D., et al "RSVP-TE Extensions to RSVP for LSP Tunnels", 1587 RFC 3209, December 2001. 1589 [6] Jamoussi, B., et al "Constraint-Based LSP Setup using LDP", RFC 1590 3212, January 2002.