idnits 2.17.00 (12 Aug 2021) /tmp/idnits26976/draft-ietf-mpls-recovery-frmwrk-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == There are 9 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 32 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([2], [3], [4], [5], [6], [7], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The "Author's Address" (or "Authors' Addresses") section title is misspelled. == Line 230 has weird spacing: '...ends on the e...' == Line 529 has weird spacing: '... on the recov...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 3036 (ref. '2') (Obsoleted by RFC 5036) == Outdated reference: draft-ietf-mpls-rsvp-tunnel-applicability has been published as RFC 3210 ** Downref: Normative reference to an Informational draft: draft-ietf-mpls-rsvp-tunnel-applicability (ref. '3') == Outdated reference: draft-ietf-mpls-cr-ldp has been published as RFC 3212 == Outdated reference: draft-ietf-mpls-rsvp-lsp-tunnel has been published as RFC 3209 -- Possible downref: Normative reference to a draft: ref. '7' ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '8') -- Possible downref: Normative reference to a draft: ref. '9' -- Possible downref: Normative reference to a draft: ref. '10' -- Possible downref: Normative reference to a draft: ref. '11' -- Possible downref: Normative reference to a draft: ref. '14' Summary: 12 errors (**), 0 flaws (~~), 9 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MPLS Working Group Vishal Sharma 3 Informational Track Metanoia, Inc. 4 Expires: January 2002 5 Ben-Mack Crane 6 Srinivas Makam 7 Tellabs Operations, Inc. 9 Ken Owens 10 Erlang Technology, Inc. 12 Changcheng Huang 13 Carleton University 15 Fiffi Hellstrand 16 Jon Weil 17 Loa Andersson 18 Bilel Jamoussi 19 Nortel Networks 21 Brad Cain 22 Cereva Networks 24 Seyhan Civanlar 25 Lemur Networks 27 Angela Chiu 28 Celion Networks, Inc. 30 July 2001 32 Framework for MPLS-based Recovery 33 35 Status of this memo 37 This document is an Internet-Draft and is in full conformance with 38 all provisions of Section 10 of RFC2026. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF), its areas, and its working groups. Note that other 41 groups may also distribute working documents as Internet-Drafts. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 46 The list of current Internet-Drafts can be accessed at 47 http://www.ietf.org/ietf/1id-abstracts.txt 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html. 51 Abstract 53 Multi-protocol label switching (MPLS) [1] integrates the label 54 swapping forwarding paradigm with network layer routing. To deliver 55 reliable service, MPLS requires a set of procedures to provide 56 protection of the traffic carried on different paths. This requires 57 that the label switched routers (LSRs) support fault detection, fault 58 notification, and fault recovery mechanisms, and that MPLS signaling 59 [2], [3], [4], [5], [6], [7] support the configuration of recovery. 60 With these objectives in mind, this document specifies a framework 61 for MPLS based recovery. 63 Table of Contents 65 1. Introduction.....................................................3 66 1.1. Background......................................................3 67 1.2. Motivation for MPLS-Based Recovery..............................4 68 1.3. Objectives/Goals................................................5 69 2. Overview.........................................................6 70 2.1. Recovery Models.................................................7 71 2.1.1 Rerouting.......................................................7 72 2.1.2 Protection Switching............................................7 73 2.2. The Recovery Cycles.............................................8 74 2.2.1 MPLS Recovery Cycle Model.......................................8 75 2.2.2 MPLS Reversion Cycle Model......................................9 76 2.2.3 Dynamic Re-routing Cycle Model.................................11 77 2.3. Definitions and Terminology....................................12 78 2.3.1 General Recovery Terminology...................................12 79 2.3.2 Failure Terminology............................................15 80 2.4. Abbreviations..................................................16 81 3. MPLS-based Recovery Principles..................................16 82 3.1. Configuration of Recovery......................................16 83 3.2. Initiation of Path Setup.......................................17 84 3.3. Initiation of Resource Allocation..............................17 85 3.4. Scope of Recovery..............................................18 86 3.4.1 Topology.......................................................18 87 3.4.1.1 Local Repair................................................18 88 3.4.1.2 Global Repair...............................................19 89 3.4.1.3 Alternate Egress Repair.....................................19 90 3.4.1.4 Multi-Layer Repair..........................................19 91 3.4.1.5 Concatenated Protection Domains.............................19 92 3.4.2 Path Mapping...................................................20 93 3.4.3 Bypass Tunnels.................................................21 94 3.4.4 Recovery Granularity...........................................21 95 3.4.4.1 Selective Traffic Recovery..................................21 96 3.4.4.2 Bundling....................................................21 97 3.4.5 Recovery Path Resource Use.....................................21 98 3.5. Fault Detection................................................22 99 3.6. Fault Notification.............................................23 100 3.7. Switch-Over Operation..........................................23 101 3.7.1 Recovery Trigger...............................................23 102 3.7.2 Recovery Action................................................24 103 3.8. Post Recovery Operation........................................24 104 3.8.1 Fixed Protection Counterparts..................................24 105 3.8.1.1 Revertive Mode..............................................25 106 3.8.1.2 Non-revertive Mode..........................................25 107 3.8.2 Dynamic Protection Counterparts................................25 108 3.8.3 Restoration and Notification...................................26 109 3.8.4 Reverting to Preferred Path (or Controlled Rearrangement)......26 110 3.9. Performance....................................................27 111 4. MPLS Recovery Features..........................................27 112 5. Comparison Criteria.............................................28 113 6. Security Considerations.........................................30 114 7. Intellectual Property Considerations............................30 115 8. Acknowledgements................................................30 116 9. AuthorsÆ Addresses..............................................30 117 10. References......................................................31 119 1. Introduction 121 This memo describes a framework for MPLS-based recovery. We provide a 122 detailed taxonomy of recovery terminology, and discuss the motivation 123 for, the objectives of, and the requirements for MPLS-based recovery. 124 We outline principles for MPLS-based recovery, and also provide 125 comparison criteria that may serve as a basis for comparing and 126 evaluating different recovery schemes. 128 1.1. Background 130 Network routing deployed today is focussed primarily on connectivity 131 and typically supports only one class of service, the best effort 132 class. Multi-protocol label switching, on the other hand, by 133 integrating forwarding based on label-swapping of a link local label 134 with network layer routing allows flexibility in the delivery of new 135 routing services. MPLS allows for using such media specific 136 forwarding mechanisms as label swapping. This enables more 137 sophisticated features such as quality-of-service (QoS) and traffic 138 engineering [8] to be implemented more effectively. An important 139 component of providing QoS, however, is the ability to transport data 140 reliably and efficiently. Although the current routing algorithms are 141 very robust and survivable, the amount of time they take to recover 142 from a fault can be significant, on the order of several seconds or 143 minutes, causing serious disruption of service for some applications 144 in the interim. This is unacceptable to many organizations that aim 145 to provide a highly reliable service, and thus require recovery times 146 that are on the order of seconds down to 10's of milliseconds. 148 MPLS recovery may be motivated by the notion that there are inherent 149 limitations to improving the recovery times of current routing 150 algorithms. Additional improvement not obtainable by other means can 151 be obtained by augmenting these algorithms with MPLS recovery 152 mechanisms. Since MPLS is likely to be the technology of choice in 153 the future IP-based transport network, it is useful that MPLS be able 154 to provide protection and restoration of traffic. MPLS may 155 facilitate the convergence of network functionality on a common 156 control and management plane. Further, a protection priority could be 157 used as a differentiating mechanism for premium services that require 158 high reliability. The remainder of this document provides a framework 159 for MPLS based recovery. It is focused at a conceptual level and is 160 meant to address motivation, objectives and requirements. Issues of 161 mechanism, policy, routing plans and characteristics of traffic 162 carried by recovery paths are beyond the scope of this document. 164 1.2. Motivation for MPLS-Based Recovery 166 MPLS based protection of traffic (called MPLS-based Recovery) is 167 useful for a number of reasons. The most important is its ability to 168 increase network reliability by enabling a faster response to faults 169 than is possible with traditional Layer 3 (or IP layer) approaches 170 alone while still providing the visibility of the network afforded by 171 Layer 3. Furthermore, a protection mechanism using MPLS could enable 172 IP traffic to be put directly over WDM optical channels and provide a 173 recovery option without an intervening SONET layer. This would 174 facilitate the construction of IP-over-WDM networks that request fast 175 recovery ability. 177 The need for MPLS-based recovery arises because of the following: 179 I. Layer 3 or IP rerouting may be too slow for a core MPLS network 180 that needs to support high reliability/availability. 182 II. Layer 0 (for example, optical layer) or Layer 1 (for example, 183 SONET) mechanisms may not be deployed in topologies that meet 184 carriersÆ protection goals. Restoration at these layers may also be 185 wasteful use of resources. 187 III. The granularity at which the lower layers may be able to protect 188 traffic may be too coarse for traffic that is switched using MPLS- 189 based mechanisms. 191 IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher 192 layer operations. Thus, while they may provide, for example, link 193 protection, they cannot easily provide node protection or protection 194 of traffic transported at layer 3. Further, this may prevent the 195 lower layers from providing fast restoration for traffic that needs 196 it, while providing slower restoration (with possibly more optimal 197 use of resources) for traffic that does not require fast restoration. 198 In networks where the latter class of traffic is dominant, providing 199 fast restoration to all classes of traffic may not be cost effective 200 from a service providerÆs perspective. 202 V. MPLS has desirable attributes when applied to the purpose of 203 recovery for connectionless networks. Specifically that an LSP is 204 source routed and a forwarding path for recovery can be "pinned" and 205 is not affected by transient instability in SPF routing brought on by 206 failure scenarios. 208 Furthermore, there is a need for open standards. 210 VI. Establishing interoperability of protection mechanisms between 211 routers/LSRs from different vendors in IP or MPLS networks is desired 212 to enable recovery mechanisms to work in a multivendor environment, 213 and to enable the transition of certain protected services to an MPLS 214 core. 216 1.3. Objectives/Goals 218 The following are some important goals for MPLS-based recovery. 220 Ia. MPLS-based recovery mechanisms may be subject to the traffic 221 engineering goal of optimal use of resources. 223 Ib. MPLS based recovery mechanisms should aim to facilitate 224 restoration times that are sufficiently fast for the end user 225 application. That is, that better match the end-user applicationÆs 226 requirements. In some cases, this may be as short as 10s of 227 milliseconds. 229 We observe that Ia and Ib are conflicting objectives, and a trade off 230 exists between them. The optimal choice depends on the end-user 231 application to restoration time and the cost impact of introducing 232 restoration in the network, as well as the end-user applicationÆs 233 sensitivity to cost. 235 II. MPLS-based recovery should aim to maximize network reliability 236 and availability. MPLS-based recovery of traffic should aim to 237 minimize the number of single points of failure in the MPLS protected 238 domain. 240 III. MPLS-based recovery should aim to enhance the reliability of the 241 protected traffic while minimally or predictably degrading the 242 traffic carried by the diverted resources. 244 IV. MPLS-based recovery techniques should aim to be applicable for 245 protection of traffic at various granularities. For example, it 246 should be possible to specify MPLS-based recovery for a portion of 247 the traffic on an individual path, for all traffic on an individual 248 path, or for all traffic on a group of paths. Note that a path is 249 used as a general term and includes the notion of a link, IP route or 250 LSP. 252 V. MPLS-based recovery techniques may be applicable for an entire 253 end-to-end path or for segments of an end-to-end path. 255 VI. MPLS-based recovery mechanisms should aim to take into 256 consideration the recovery actions of lower layers. MPLS-based 257 mechanisms should not trigger lower layer protection switching. 259 VII. MPLS-based recovery mechanisms should aim to minimize the loss 260 of data and packet reordering during recovery operations. (The 261 current MPLS specification itself has no explicit requirement on 262 reordering). 264 VIII. MPLS-based recovery mechanisms should aim to minimize the state 265 overhead incurred for each recovery path maintained. 267 IX. MPLS-based recovery mechanisms should aim to preserve the 268 constraints on traffic after switchover, if desired. That is, if 269 desired, the recovery path should meet the resource requirements of, 270 and achieve the same performance characteristics as the working path. 272 We observe that some of the above are conflicting goals, and real 273 deployment will often involve engineering compromises based on a 274 variety of factors such as cost, end-user application requirements, 275 network efficiency, and revenue considerations. Thus, these goals are 276 subject to tradeoffs based on the above considerations. 278 2. Overview 280 There are several options for providing protection of traffic using 281 MPLS. The most generic requirement is the specification of whether 282 recovery should be via Layer 3 (or IP) rerouting or via MPLS 283 protection switching or rerouting actions. 285 Generally network operators aim to provide the fastest and the best 286 protection mechanism that can be provided at a reasonable cost. The 287 higher the level of protection, the more resources are consumed. 288 Therefore it is expected that network operators will offer a spectrum 289 of service levels. MPLS-based recovery should give the flexibility to 290 select the recovery mechanism, choose the granularity at which 291 traffic is protected, and to also choose the specific types of 292 traffic that are protected in order to give operators more control 293 over that tradeoff. With MPLS-based recovery, it can be possible to 294 provide different levels of protection for different classes of 295 service, based on their service requirements. For example, using 296 approaches outlined below, a Virtual Leased Line (VLL) service or 297 real-time applications like Voice over IP (VoIP) may be supported 298 using link/node protection together with pre-established, pre- 299 reserved path protection. Best effort traffic, on the other hand, may 300 use established-on-demand path protection or simply rely on IP re- 301 route or higher layer recovery mechanisms. As another example of 302 their range of application, MPLS-based recovery strategies may be 303 used to protect traffic not originally flowing on label switched 304 paths, such as IP traffic that is normally routed hop-by-hop, as well 305 as traffic forwarded on label switched paths. 307 2.1. Recovery Models 309 There are two basic models for path recovery: rerouting and 310 protection switching. 312 Protection switching and rerouting, as defined below, may be used 313 together. For example, protection switching to a recovery path may 314 be used for rapid restoration of connectivity while rerouting 315 determines a new optimal network configuration, rearranging paths, as 316 needed, at a later time. 318 2.1.1 Rerouting 320 Recovery by rerouting is defined as establishing new paths or path 321 segments on demand for restoring traffic after the occurrence of a 322 fault. The new paths may be based upon fault information, network 323 routing policies, pre-defined configurations and network topology 324 information. Thus, upon detecting a fault, paths or path segments to 325 bypass the fault are established using signaling. Reroute mechanisms 326 are inherently slower than protection switching mechanisms, since 327 more must be done following the detection of a fault. However reroute 328 mechanisms are simpler and more frugal as no resources are committed 329 until after the fault occurs and the location of the fault is known. 331 Once the network routing algorithms have converged after a fault, it 332 may be preferable, in some cases, to reoptimize the network by 333 performing a reroute based on the current state of the network and 334 network policies. This is discussed further in Section 3.8. 336 In terms of the principles defined in section 3, reroute recovery 337 employs paths established-on-demand with resources reserved-on- 338 demand. 340 2.1.2 Protection Switching 342 Protection switching recovery mechanisms pre-establish a recovery 343 path or path segment, based upon network routing policies, the 344 restoration requirements of the traffic on the working path, and 345 administrative considerations. The recovery path may or may not be 346 link and node disjoint with the working path[9], [14]. However if the 347 recovery path shares sources of failure with the working path, the 348 overall reliability of the construct is degraded. When a fault is 349 detected, the protected traffic is switched over to the recovery 350 path(s) and restored. 352 In terms of the principles in section 3, protection switching employs 353 pre-established recovery paths, and, if resource reservation is 354 required on the recovery path, pre-reserved resources. The various 355 sub-types of protection switching are detailed in Section 3.4 of this 356 document. 358 2.1.2.1 359 2.2. The Recovery Cycles 361 There are three defined recovery cycles; the MPLS Recovery Cycle, the 362 MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first 363 cycle detects a fault and restores traffic onto MPLS-based recovery 364 paths. If the recovery path is non-optimal the cycle may be followed 365 by any of the two latter to achieve an optimized network again. The 366 reversion cycle applies for explicitly routed traffic that that does 367 not rely on any dynamic routing protocols to be converged. The 368 dynamic re-routing cycle applies for traffic that is forwarded based 369 on hop-by-hop routing. 371 2.2.1 MPLS Recovery Cycle Model 373 The MPLS recovery cycle model is illustrated in Figure 1. 374 Definitions and a key to abbreviations follow. 376 --Network Impairment 377 | --Fault Detected 378 | | --Start of Notification 379 | | | -- Start of Recovery Operation 380 | | | | --Recovery Operation Complete 381 | | | | | --Path Traffic Restored 382 | | | | | | 383 | | | | | | 384 v v v v v v 385 ---------------------------------------------------------------- 386 | T1 | T2 | T3 | T4 | T5 | 388 Figure 1. MPLS Recovery Cycle Model 390 The various timing measures used in the model are described below. 391 T1 Fault Detection Time 392 T2 Hold-off Time 393 T3 Notification Time 394 T4 Recovery Operation Time 395 T5 Traffic Restoration Time 397 Definitions of the recovery cycle times are as follows: 399 Fault Detection Time 401 The time between the occurrence of a network impairment and the 402 moment the fault is detected by MPLS-based recovery mechanisms. This 403 time may be highly dependent on lower layer protocols. 405 Hold-Off Time 406 The configured waiting time between the detection of a fault and 407 taking MPLS-based recovery action, to allow time for lower layer 408 protection to take effect. The Hold-off Time may be zero. 410 Note: The Hold-Off Time may occur after the Notification Time 411 interval if the node responsible for the switchover, the Path Switch 412 LSR (PSL), rather than the detecting LSR, is configured to wait. 414 Notification Time 416 The time between initiation of a fault indication signal (FIS) by the 417 LSR detecting the fault and the time at which the Path Switch LSR 418 (PSL) begins the recovery operation. This is zero if the PSL detects 419 the fault itself or infers a fault from such events as an adjacency 420 failure. 422 Note: If the PSL detects the fault itself, there still may be a Hold- 423 Off Time period between detection and the start of the recovery 424 operation. 426 Recovery Operation Time 428 The time between the first and last recovery actions. This may 429 include message exchanges between the PSL and PML to coordinate 430 recovery actions. 432 Traffic Restoration Time 434 The time between the last recovery action and the time that the 435 traffic (if present) is completely recovered. This interval is 436 intended to account for the time required for traffic to once again 437 arrive at the point in the network that experienced disrupted or 438 degraded service due to the occurrence of the fault (e.g. the PML). 439 This time may depend on the location of the fault, the recovery 440 mechanism, and the propagation delay along the recovery path. 442 2.2.2 MPLS Reversion Cycle Model 444 Protection switching, revertive mode, requires the traffic to be 445 switched back to a preferred path when the fault on that path is 446 cleared. The MPLS reversion cycle model is illustrated in Figure 2. 447 Note that the cycle shown below comes after the recovery cycle shown 448 in Fig. 1. 450 --Network Impairment Repaired 451 | --Fault Cleared 452 | | --Path Available 453 | | | --Start of Reversion Operation 454 | | | | --Reversion Operation Complete 455 | | | | | --Traffic Restored on Preferred Path 456 | | | | | | 457 | | | | | | 458 v v v v v v 459 ----------------------------------------------------------------- 460 | T7 | T8 | T9 | T10| T11| 462 Figure 2. MPLS Reversion Cycle Model 464 The various timing measures used in the model are described below. 465 T7 Fault Clearing Time 466 T8 Wait-to-Restore Time 467 T9 Notification Time 468 T10 Reversion Operation Time 469 T11 Traffic Restoration Time 471 Note that time T6 (not shown above) is the time for which the network 472 impairment is not repaired and traffic is flowing on the recovery 473 path. 475 Definitions of the reversion cycle times are as follows: 477 Fault Clearing Time 479 The time between the repair of a network impairment and the time that 480 MPLS-based mechanisms learn that the fault has been cleared. This 481 time may be highly dependent on lower layer protocols. 483 Wait-to-Restore Time 485 The configured waiting time between the clearing of a fault and MPLS- 486 based recovery action(s). Waiting time may be needed to ensure the 487 path is stable and to avoid flapping in cases where a fault is 488 intermittent. The Wait-to-Restore Time may be zero. 490 Note: The Wait-to-Restore Time may occur after the Notification Time 491 interval if the PSL is configured to wait. 493 Notification Time 495 The time between initiation of an FRS by the LSR clearing the fault 496 and the time at which the path switch LSR begins the reversion 497 operation. This is zero if the PSL clears the fault itself. 498 Note: If the PSL clears the fault itself, there still may be a Wait- 499 to-Restore Time period between fault clearing and the start of the 500 reversion operation. 502 Reversion Operation Time 504 The time between the first and last reversion actions. This may 505 include message exchanges between the PSL and PML to coordinate 506 reversion actions. 508 Traffic Restoration Time 509 The time between the last reversion action and the time that traffic 510 (if present) is completely restored on the preferred path. This 511 interval is expected to be quite small since both paths are working 512 and care may be taken to limit the traffic disruption (e.g., using 513 "make before break" techniques and synchronous switch-over). 515 In practice, the only interesting times in the reversion cycle are 516 the Wait-to-Restore Time and the Traffic Restoration Time (or some 517 other measure of traffic disruption). Given that both paths are 518 available, there is no need for rapid operation, and a well- 519 controlled switch-back with minimal disruption is desirable. 521 2.2.3 Dynamic Re-routing Cycle Model 523 Dynamic rerouting aims to bring the IP network to a stable state 524 after a network impairment has occurred. A re-optimized network is 525 achieved after the routing protocols have converged, and the traffic 526 is moved from a recovery path to a (possibly) new working path. The 527 steps involved in this mode are illustrated in Figure 3. 529 Note that the cycle shown below may be overlaid on the recovery 530 cycle shown in Fig. 1 or the reversion cycle shown in Fig. 2, or both 531 (in the event that both the recovery cycle and the reversion cycle 532 take place before the routing protocols converge, and after the 533 convergence of the routing protocols it is determined (based on on- 534 line algorithms or off-line traffic engineering tools, network 535 configuration, or a variety of other possible criteria) that there is 536 a better route for the working path). 538 --Network Enters a Semi-stable State after an Impairment 539 | --Dynamic Routing Protocols Converge 540 | | --Initiate Setup of New Working Path between PSL 541 | | | and PML 542 | | | --Switchover Operation Complete 543 | | | | --Traffic Moved to New Working Path 544 | | | | | 545 | | | | | 546 v v v v v 547 ----------------------------------------------------------------- 548 | T12 | T13 | T14 | T15 | 550 Figure 3. Dynamic Rerouting Cycle Model 551 The various timing measures used in the model are described below. 552 T12 Network Route Convergence Time 553 T13 Hold-down Time (optional) 554 T14 Switchover Operation Time 555 T15 Traffic Restoration Time 557 Network Route Convergence Time 558 We define the network route convergence time as the time taken for 559 the network routing protocols to converge and for the network to 560 reach a stable state. 562 Holddown Time 564 We define the holddown period as a bounded time for which a recovery 565 path must be used. In some scenarios it may be difficult to determine 566 if the working path is stable. In these cases a holddown time may be 567 used to prevent excess flapping of traffic between a working and a 568 recovery path. 570 Switchover Operation Time 572 The time between the first and last switchover actions. This may 573 include message exchanges between the PSL and PML to coordinate the 574 switchover actions. 576 As an example of the recovery cycle, we present a sequence of events 577 that occur after a network impairment occurs and when a protection 578 switch is followed by dynamic rerouting. 580 I. Link or path fault occurs 581 II. Signaling initiated (FIS) for the fault detected 582 III. FIS arrives at the PSL 583 IV. The PSL initiates a protection switch to a pre-configured 584 recovery path 585 V. The PSL switches over the traffic from the working path to the 586 recovery path 587 VI. The network enters a semi-stable state 588 VII. Dynamic routing protocols converge after the fault, and a new 589 working path is calculated (based, for example, on some of the 590 criteria mentioned earlier in Section 2.1.1). 591 VIII. A new working path is established between the PSL and the PML 592 (assumption is that PSL and PML have not changed) 593 IX. Traffic is switched over to the new working path. 595 2.3. Definitions and Terminology 597 This document assumes the terminology given in [1], and, in addition, 598 introduces the following new terms. 600 2.3.1 General Recovery Terminology 602 Rerouting 604 A recovery mechanism in which the recovery path or path segments are 605 created dynamically after the detection of a fault on the working 606 path. In other words, a recovery mechanism in which the recovery path 607 is not pre-established. 609 Protection Switching 610 A recovery mechanism in which the recovery path or path segments are 611 created prior to the detection of a fault on the working path. In 612 other words, a recovery mechanism in which the recovery path is pre- 613 established. 615 Working Path 617 The protected path that carries traffic before the occurrence of a 618 fault. The working path exists between a PSL and PML. The working 619 path can be of different kinds; a hop-by-hop routed path, a trunk, a 620 link, an LSP or part of a multipoint-to-point LSP. 622 Synonyms for a working path are primary path and active path. 624 Recovery Path 626 The path by which traffic is restored after the occurrence of a 627 fault. In other words, the path on which the traffic is directed by 628 the recovery mechanism. The recovery path is established by MPLS 629 means. The recovery path can either be an equivalent recovery path 630 and ensure no reduction in quality of service, or be a limited 631 recovery path and thereby not guarantee the same quality of service 632 (or some other criteria of performance) as the working path. A 633 limited recovery path is not expected to be used for an extended 634 period of time. 636 Synonyms for a recovery path are: back-up path, alternative path, and 637 protection path. 639 Protection Counterpart 641 The "other" path when discussing pre-planned protection switching 642 schemes. The protection counterpart for the working path is the 643 recovery path and vice-versa. 645 Path Group (PG) 647 A logical bundling of multiple working paths, each of which is routed 648 identically between a Path Switch LSR and a Path Merge LSR. 650 Protected Path Group (PPG) 652 A path group that requires protection. 654 Protected Traffic Portion (PTP) 656 The portion of the traffic on an individual path that requires 657 protection. For example, code points in the EXP bits of the shim 658 header may identify a protected portion. 660 Path Switch LSR (PSL) 661 The PSL is responsible for switching or replicating the traffic 662 between the working path and the recovery path. 664 Path Merge LSR (PML) 666 An LSR that is responsible for receiving the recovery path traffic, 667 and either merges the traffic back onto the working path, or, if it 668 is itself the destination, passes the traffic on to the higher layer 669 protocols. 671 Intermediate LSR 673 An LSR on a working or recovery path that is neither a PSL nor a PML 674 for that path. 676 Bypass Tunnel 678 A path that serves to back up a set of working paths using the label 679 stacking approach [1]. The working paths and the bypass tunnel must 680 all share the same path switch LSR (PSL) and the path merge LSR 681 (PML). 683 Switch-Over 685 The process of switching the traffic from the path that the traffic 686 is flowing on onto one or more alternate path(s). This may involve 687 moving traffic from a working path onto one or more recovery paths, 688 or may involve moving traffic from a recovery path(s) on to a more 689 optimal working path(s). 691 Switch-Back 693 The process of returning the traffic from one or more recovery paths 694 back to the working path(s). 696 Revertive Mode 698 A recovery mode in which traffic is automatically switched back from 699 the recovery path to the original working path upon the restoration 700 of the working path to a fault-free condition. This assumes a failed 701 working path does not automatically surrender resources to the 702 network. 704 Non-revertive Mode 706 A recovery mode in which traffic is not automatically switched back 707 to the original working path after this path is restored to a fault- 708 free condition. (Depending on the configuration, the original working 709 path may, upon moving to a fault-free condition, become the recovery 710 path, or it may be used for new working traffic, and be no longer 711 associated with its original recovery path). 713 MPLS Protection Domain 714 The set of LSRs over which a working path and its corresponding 715 recovery path are routed. 717 MPLS Protection Plan 719 The set of all LSP protection paths and the mapping from working to 720 protection paths deployed in an MPLS protection domain at a given 721 time. 723 Liveness Message 725 A message exchanged periodically between two adjacent LSRs that 726 serves as a link probing mechanism. It provides an integrity check of 727 the forward and the backward directions of the link between the two 728 LSRs as well as a check of neighbor aliveness. 730 Path Continuity Test 732 A test that verifies the integrity and continuity of a path or path 733 segment. The details of such a test are beyond the scope of this 734 draft. (This could be accomplished, for example, by transmitting a 735 control message along the same links and nodes as the data traffic or 736 similarly could be measured by the absence of traffic and by 737 providing feedback.) 739 2.3.2 Failure Terminology 741 Path Failure (PF) 742 Path failure is fault detected by MPLS-based recovery mechanisms, 743 which is define as the failure of the liveness message test or a path 744 continuity test, which indicates that path connectivity is lost. 746 Path Degraded (PD) 747 Path degraded is a fault detected by MPLS-based recovery mechanisms 748 that indicates that the quality of the path is unacceptable. 750 Link Failure (LF) 751 A lower layer fault indicating that link continuity is lost. This may 752 be communicated to the MPLS-based recovery mechanisms by the lower 753 layer. 755 Link Degraded (LD) 756 A lower layer indication to MPLS-based recovery mechanisms that the 757 link is performing below an acceptable level. 759 Fault Indication Signal (FIS) 760 A signal that indicates that a fault along a path has occurred. It is 761 relayed by each intermediate LSR to its upstream or downstream 762 neighbor, until it reaches an LSR that is setup to perform MPLS 763 recovery. The FIS is transmitted periodically by the node/nodes 764 closest to the point of failure, for some configurable length of 765 time. 767 Fault Recovery Signal (FRS) 768 A signal that indicates a fault along a working path has been 769 repaired. Again, like the FIS, it is relayed by each intermediate LSR 770 to its upstream or downstream neighbor, until is reaches the LSR that 771 performs recovery of the original path. . The FRS is transmitted 772 periodically by the node/nodes closest to the point of failure, for 773 some configurable length of time. 775 2.4. Abbreviations 777 FIS: Fault Indication Signal. 778 FRS: Fault Recovery Signal. 779 LD: Link Degraded. 780 LF: Link Failure. 781 PD: Path Degraded. 782 PF: Path Failure. 783 PML: Path Merge LSR. 784 PG: Path Group. 785 PPG: Protected Path Group. 786 PTP: Protected Traffic Portion. 787 PSL: Path Switch LSR. 789 3. MPLS-based Recovery Principles 791 MPLS-based recovery refers to the ability to effect quick and 792 complete restoration of traffic affected by a fault in an MPLS- 793 enabled network. The fault may be detected on the IP layer or in 794 lower layers over which IP traffic is transported. Fastest MPLS 795 recovery is assumed to be achieved with protection switching and may 796 be viewed as the MPLS LSR switch completion time that is comparable 797 to, or equivalent to, the 50 ms switch-over completion time of the 798 SONET layer. This section provides a discussion of the concepts and 799 principles of MPLS-based recovery. The concepts are presented in 800 terms of atomic or primitive terms that may be combined to specify 801 recovery approaches. We do not make any assumptions about the 802 underlying layer 1 or layer 2 transport mechanisms or their recovery 803 mechanisms. 805 3.1. Configuration of Recovery 807 An LSR may support any or all of the following recovery options: 809 Default-recovery (No MPLS-based recovery enabled): 810 Traffic on the working path is recovered only via Layer 3 or IP 811 rerouting or by some lower layer mechanism such as SONET APS. This 812 is equivalent to having no MPLS-based recovery. This option may be 813 used for low priority traffic or for traffic that is recovered in 814 another way (for example load shared traffic on parallel working 815 paths may be automatically recovered upon a fault along one of the 816 working paths by distributing it among the remaining working paths). 818 Recoverable (MPLS-based recovery enabled): 819 This working path is recovered using one or more recovery paths, 820 either via rerouting or via protection switching. 822 3.2. Initiation of Path Setup 824 There are three options for the initiation of the recovery path 825 setup. 827 Pre-established: 829 This is the same as the protection switching option. Here a recovery 830 path(s) is established prior to any failure on the working path. The 831 path selection can either be determined by an administrative 832 centralized tool, or chosen based on some algorithm implemented at 833 the PSL and possibly intermediate nodes. To guard against the 834 situation when the pre-established recovery path fails before or at 835 the same time as the working path, the recovery path should have 836 secondary configuration options as explained in Section 3.3 below. 838 Pre Qualified: 840 A pre-established path need not be created, it may be pre-qualified. 841 A pre-qualified recovery path is not created expressly for protecting 842 the working path, but instead is a path created for other purposes 843 that is designated as a recovery path after determination that it is 844 an acceptable alternative for carrying the working path traffic. 845 Variants include the case where an optical path or trail is 846 configured, but no switches are set. 848 Established-on-Demand: 850 This is the same as the rerouting option. Here, a recovery path is 851 established after a failure on its working path has been detected and 852 notified to the PSL. 854 3.3. Initiation of Resource Allocation 856 A recovery path may support the same traffic contract as the working 857 path, or it may not. We will distinguish these two situations by 858 using different additive terms. If the recovery path is capable of 859 replacing the working path without degrading service, it will be 860 called an equivalent recovery path. If the recovery path lacks the 861 resources (or resource reservations) to replace the working path 862 without degrading service, it will be called a limited recovery path. 863 Based on this, there are two options for the initiation of resource 864 allocation: 866 Pre-reserved: 868 This option applies only to protection switching. Here a pre- 869 established recovery path reserves required resources on all hops 870 along its route during its establishment. Although the reserved 871 resources (e.g., bandwidth and/or buffers) at each node cannot be 872 used to admit more working paths, they are available to be used by 873 all traffic that is present at the node before a failure occurs. 875 Reserved-on-Demand: 877 This option may apply either to rerouting or to protection switching. 878 Here a recovery path reserves the required resources after a failure 879 on the working path has been detected and notified to the PSL and 880 before the traffic on the working path is switched over to the 881 recovery path. 883 Note that under both the options above, depending on the amount of 884 resources reserved on the recovery path, it could either be an 885 equivalent recovery path or a limited recovery path. 887 3.4. Scope of Recovery 889 3.4.1 Topology 891 3.4.1.1 Local Repair 893 The intent of local repair is to protect against a link or neighbor 894 node fault and to minimize the amount of time required for failure 895 propagation. In local repair (also known as local recovery [10] [9]), 896 the node immediately upstream of the fault is the one to initiate 897 recovery (either rerouting or protection switching). Local repair can 898 be of two types: 900 Link Recovery/Restoration 902 In this case, the recovery path may be configured to route around a 903 certain link deemed to be unreliable. If protection switching is 904 used, several recovery paths may be configured for one working path, 905 depending on the specific faulty link that each protects against. 907 Alternatively, if rerouting is used, upon the occurrence of a fault 908 on the specified link each path is rebuilt such that it detours 909 around the faulty link. 910 In this case, the recovery path need only be disjoint from its 911 working path at a particular link on the working path, and may have 912 overlapping segments with the working path. Traffic on the working 913 path is switched over to an alternate path at the upstream LSR that 914 connects to the failed link. This method is potentially the fastest 915 to perform the switchover, and can be effective in situations where 916 certain path components are much more unreliable than others. 918 Node Recovery/Restoration 919 In this case, the recovery path may be configured to route around a 920 neighbor node deemed to be unreliable. Thus the recovery path is 921 disjoint from the working path only at a particular node and at links 922 associated with the working path at that node. Once again, the 923 traffic on the primary path is switched over to the recovery path at 924 the upstream LSR that directly connects to the failed node, and the 925 recovery path shares overlapping portions with the working path. 927 3.4.1.2 Global Repair 929 The intent of global repair is to protect against any link or node 930 fault on a path or on a segment of a path, with the obvious exception 931 of the faults occurring at the ingress node of the protected path 932 segment. In global repair the PSL is usually distant from the failure 933 and needs to be notified by a FIS. 934 In global repair also end-to end path recovery/restoration applies. 935 In many cases, the recovery path can be made completely link and node 936 disjoint with its working path. This has the advantage of protecting 937 against all link and node fault(s) on the working path (end-to-end 938 path or path segment). 939 However, it is in some cases slower than local repair since it takes 940 longer for the fault notification message to get to the PSL to 941 trigger the recovery action. 943 3.4.1.3 Alternate Egress Repair 945 It is possible to restore service without specifically recovering the 946 faulted path. 947 For example, for best effort IP service it is possible to select a 948 recovery path that has a different egress point from the working path 949 (i.e., there is no PML). The recovery path egress must simply be a 950 router that is acceptable for forwarding the FEC carried by the 951 working path (without creating looping). In an engineering context, 952 specific alternative FEC/LSP mappings with alternate egresses can be 953 formed. 955 This may simplify enhancing the reliability of implicitly constructed 956 MPLS topologies. A PSL may qualify LSP/FEC bindings as candidate 957 recovery paths as simply link and node disjoint with the immediate 958 downstream LSR of the working path. 960 3.4.1.4 Multi-Layer Repair 962 Multi-layer repair broadens the network designerÆs tool set for those 963 cases where multiple network layers can be managed together to 964 achieve overall network goals. Specific criteria for determining 965 when multi-layer repair is appropriate are beyond the scope of this 966 draft. 968 3.4.1.5 Concatenated Protection Domains 970 A given service may cross multiple networks and these may employ 971 different recovery mechanisms. It is possible to concatenate 972 protection domains so that service recovery can be provided end-to- 973 end. It is considered that the recovery mechanisms in different 974 domains may operate autonomously, and that multiple points of 975 attachment may be used between domains (to ensure there is no single 976 point of failure). Alternate egress repair requires management of 977 concatenated domains in that an explicit MPLS point of failure (the 978 PML) is by definition excluded. Details of concatenated protection 979 domains are beyond the scope of this draft. 981 3.4.2 Path Mapping 983 Path mapping refers to the methods of mapping traffic from a faulty 984 working path on to the recovery path. There are several options for 985 this, as described below. Note that the options below should be 986 viewed as atomic terms that only describe how the working and 987 protection paths are mapped to each other. The issues of resource 988 reservation along these paths, and how switchover is actually 989 performed lead to the more commonly used composite terms, such as 1+1 990 and 1:1 protection, which were described in Section 2.1. 992 1-to-1 Protection 994 In 1-to-1 protection the working path has a designated recovery path 995 that is only to be used to recover that specific working path. 997 n-to-1 Protection 999 In n-to-1 protection, up to n working paths are protected using only 1000 one recovery path. If the intent is to protect against any single 1001 fault on any of the working paths, the n working paths should be 1002 diversely routed between the same PSL and PML. In some cases, 1003 handshaking between PSL and PML may be required to complete the 1004 recovery, the details of which are beyond the scope of this draft. 1006 n-to-m Protection 1008 In n-to-m protection, up to n working paths are protected using m 1009 recovery paths. Once again, if the intent is to protect against any 1010 single fault on any of the n working paths, the n working paths and 1011 the m recovery paths should be diversely routed between the same PSL 1012 and PML. In some cases, handshaking between PSL and PML may be 1013 required to complete the recovery, the details of which are beyond 1014 the scope of this draft. N-to-m protection is for further study. 1016 Split Path Protection 1018 In split path protection, multiple recovery paths are allowed to 1019 carry the traffic of a working path based on a certain configurable 1020 load splitting ratio. This is especially useful when no single 1021 recovery path can be found that can carry the entire traffic of the 1022 working path in case of a fault. Split path protection may require 1023 handshaking between the PSL and the PML(s), and may require the 1024 PML(s) to correlate the traffic arriving on multiple recovery paths 1025 with the working path. Although this is an attractive option, the 1026 details of split path protection are beyond the scope of this draft, 1027 and are for further study. 1029 3.4.3 Bypass Tunnels 1031 It may be convenient, in some cases, to create a "bypass tunnel" for 1032 a PPG between a PSL and PML, thereby allowing multiple recovery paths 1033 to be transparent to intervening LSRs [8]. In this case, one LSP 1034 (the tunnel) is established between the PSL and PML following an 1035 acceptable route and a number of recovery paths are supported through 1036 the tunnel via label stacking. A bypass tunnel can be used with any 1037 of the path mapping options discussed in the previous section. 1039 As with recovery paths, the bypass tunnel may or may not have 1040 resource reservations sufficient to provide recovery without service 1041 degradation. It is possible that the bypass tunnel may have 1042 sufficient resources to recover some number of working paths, but not 1043 all at the same time. If the number of recovery paths carrying 1044 traffic in the tunnel at any given time is restricted, this is 1045 similar to the 1 to n or m to n protection cases mentioned in Section 1046 3.4.2. 1048 3.4.4 Recovery Granularity 1050 Another dimension of recovery considers the amount of traffic 1051 requiring protection. This may range from a fraction of a path to a 1052 bundle of paths. 1054 3.4.4.1 Selective Traffic Recovery 1056 This option allows for the protection of a fraction of traffic within 1057 the same path. The portion of the traffic on an individual path that 1058 requires protection is called a protected traffic portion (PTP). A 1059 single path may carry different classes of traffic, with different 1060 protection requirements. The protected portion of this traffic may be 1061 identified by its class, as for example, via the EXP bits in the MPLS 1062 shim header or via the priority bit in the ATM header. 1064 3.4.4.2 Bundling 1066 Bundling is a technique used to group multiple working paths together 1067 in order to recover them simultaneously. The logical bundling of 1068 multiple working paths requiring protection, each of which is routed 1069 identically between a PSL and a PML, is called a protected path group 1070 (PPG). When a fault occurs on the working path carrying the PPG, the 1071 PPG as a whole can be protected either by being switched to a bypass 1072 tunnel or by being switched to a recovery path. 1074 3.4.5 Recovery Path Resource Use 1075 In the case of pre-reserved recovery paths, there is the question of 1076 what use these resources may be put to when the recovery path is not 1077 in use. There are two options: 1079 Dedicated-resource: 1080 If the recovery path resources are dedicated, they may not be used 1081 for anything except carrying the working traffic. For example, in 1082 the case of 1+1 protection, the working traffic is always carried on 1083 the recovery path. Even if the recovery path is not always carrying 1084 the working traffic, it may not be possible or desirable to allow 1085 other traffic to use these resources. 1087 Extra-traffic-allowed: 1088 If the recovery path only carries the working traffic when the 1089 working path fails, then it is possible to allow extra traffic to use 1090 the reserved resources at other times. Extra traffic is, by 1091 definition, traffic that can be displaced (without violating service 1092 agreements) whenever the recovery path resources are needed for 1093 carrying the working path traffic. 1095 Shared-resource: 1096 A shared recovery resource is dedicated for use by multiple primary 1097 resources that (according to SRLGs) are not expected to fail 1098 simultaneously. Determining what resources that can be shared can be 1099 accomplished by offline analysis or by techniques described in [14]. 1101 3.5. Fault Detection 1103 MPLS recovery is initiated after the detection of either a lower 1104 layer fault or a fault at the IP layer or in the operation of MPLS- 1105 based mechanisms. We consider four classes of impairments: Path 1106 Failure, Path Degraded, Link Failure, and Link Degraded. 1108 Path Failure (PF) is a fault that indicates to an MPLS-based recovery 1109 scheme that the connectivity of the path is lost. This may be 1110 detected by a path continuity test between the PSL and PML. Some, 1111 and perhaps the most common, path failures may be detected using a 1112 link probing mechanism between neighbor LSRs. An example of a probing 1113 mechanism is a liveness message that is exchanged periodically along 1114 the working path between peer LSRs. For either a link probing 1115 mechanism or path continuity test to be effective, the test message 1116 must be guaranteed to follow the same route as the working or 1117 recovery path, over the segment being tested. In addition, the path 1118 continuity test must take the path merge points into consideration. 1119 In the case of a bi-directional link implemented as two 1120 unidirectional links, path failure could mean that either one or both 1121 unidirectional links are damaged. 1123 Path Degraded (PD) is a fault that indicates to MPLS-based recovery 1124 schemes/mechanisms that the path has connectivity, but that the 1125 quality of the connection is unacceptable. This may be detected by a 1126 path performance monitoring mechanism, or some other mechanism for 1127 determining the error rate on the path or some portion of the path. 1129 This is local to the LSR and consists of excessive discarding of 1130 packets at an interface, either due to label mismatch or due to TTL 1131 errors, for example. 1133 Link Failure (LF) is an indication from a lower layer that the link 1134 over which the path is carried has failed. If the lower layer 1135 supports detection and reporting of this fault (that is, any fault 1136 that indicates link failure e.g., SONET LOS), this may be used by the 1137 MPLS recovery mechanism. In some cases, using LF indications may 1138 provide faster fault detection than using only MPLSûbased fault 1139 detection mechanisms. 1141 Link Degraded (LD) is an indication from a lower layer that the link 1142 over which the path is carried is performing below an acceptable 1143 level. If the lower layer supports detection and reporting of this 1144 fault, it may be used by the MPLS recovery mechanism. In some cases, 1145 using LD indications may provide faster fault detection than using 1146 only MPLS-based fault detection mechanisms. 1148 3.6. Fault Notification 1150 MPLS-based recovery relies on rapid and reliable notification of 1151 faults. Once a fault is detected, the node that detected the fault 1152 must determine if the fault is severe enough to require path 1153 recovery. If the node is not capable of initiating direct action 1154 (e.g. as a PSL) the node should send out a notification of the fault 1155 by transmitting a FIS to those of its upstream LSRs that were sending 1156 traffic on the working path that is affected by the fault. This 1157 notification is relayed hop-by-hop by each subsequent LSR to its 1158 upstream neighbor, until it eventually reaches a PSL. A PSL is the 1159 only LSR that can terminate the FIS and initiate a protection switch 1160 of the working path to a recovery path. 1162 Since the FIS is a control message, it should be transmitted with 1163 high priority to ensure that it propagates rapidly towards the 1164 affected PSL(s). Depending on how fault notification is configured in 1165 the LSRs of an MPLS domain, the FIS could be sent either as a Layer 2 1166 or Layer 3 packet [11]. The use of a Layer 2-based notification 1167 requires a Layer 2 path direct to the PSL. An example of a FIS could 1168 be the liveness message sent by a downstream LSR to its upstream 1169 neighbor, with an optional fault notification field set or it can be 1170 implicitly denoted by a teardown message. Alternatively, it could be 1171 a separate fault notification packet. The intermediate LSR should 1172 identify which of its incoming links (upstream LSRs) to propagate the 1173 FIS on. In the case of 1+1 protection, the FIS should also be sent 1174 downstream to the PML where the recovery action is taken. 1176 3.7. Switch-Over Operation 1178 3.7.1 Recovery Trigger 1180 The activation of an MPLS protection switch following the detection 1181 or notification of a fault requires a trigger mechanism at the PSL. 1183 MPLS protection switching may be initiated due to automatic inputs or 1184 external commands. The automatic activation of an MPLS protection 1185 switch results from a response to a defect or fault conditions 1186 detected at the PSL or to fault notifications received at the PSL. It 1187 is possible that the fault detection and trigger mechanisms may be 1188 combined, as is the case when a PF, PD, LF, or LD is detected at a 1189 PSL and triggers a protection switch to the recovery path. In most 1190 cases, however, the detection and trigger mechanisms are distinct, 1191 involving the detection of fault at some intermediate LSR followed by 1192 the propagation of a fault notification back to the PSL via the FIS, 1193 which serves as the protection switch trigger at the PSL. MPLS 1194 protection switching in response to external commands results when 1195 the operator initiates a protection switch by a command to a PSL (or 1196 alternatively by a configuration command to an intermediate LSR, 1197 which transmits the FIS towards the PSL). 1199 Note that the PF fault applies to hard failures (fiber cuts, 1200 transmitter failures, or LSR fabric failures), as does the LF fault, 1201 with the difference that the LF is a lower layer impairment that may 1202 be communicated to - MPLS-based recovery mechanisms. The PD (or LD) 1203 fault, on the other hand, applies to soft defects (excessive errors 1204 due to noise on the link, for instance). The PD (or LD) results in a 1205 fault declaration only when the percentage of lost packets exceeds a 1206 given threshold, which is provisioned and may be set based on the 1207 service level agreement(s) in effect between a service provider and a 1208 customer. 1210 3.7.2 Recovery Action 1212 After a fault is detected or FIS is received by the PSL, the recovery 1213 action involves either a rerouting or protection switching operation. 1214 In both scenarios, the next hop label forwarding entry for a recovery 1215 path is bound to the working path. 1217 3.8. Post Recovery Operation 1219 When traffic is flowing on the recovery path decisions can be made to 1220 whether let the traffic remain on the recovery path and consider it 1221 as a new working path or do a switch to the old or a new working 1222 path. This post recovery operation has two styles, one where the 1223 protection counterparts, i.e. the working and recovery path, are 1224 fixed or "pinned" to its route and one in which the PSL or other 1225 network entity with real time knowledge of failure dynamically 1226 performs re-establishment or controlled rearrangement of the paths 1227 comprising the protected service. 1229 3.8.1 Fixed Protection Counterparts 1231 For fixed protection counterparts the PSL will be pre-configured with 1232 the appropriate behavior to take when the original fixed path is 1233 restored to service. The choices are revertive and non-revertive 1234 mode. The choice will typically be depended on relative costs of the 1235 working and protection paths, and the tolerance of the service to the 1236 effects of switching paths yet again. These protection modes indicate 1237 whether or not there is a preferred path for the protected traffic. 1239 3.8.1.1 Revertive Mode 1241 If the working path always is the preferred path, this path will be 1242 used whenever it is available. Thus, in the event of a fault on this 1243 path, its unused resources will not be reclaimed by the network on 1244 failure. If the working path has a fault, traffic is switched to the 1245 recovery path. In the revertive mode of operation, when the 1246 preferred path is restored the traffic is automatically switched back 1247 to it. 1249 There are a number of implications to pinned working and recovery 1250 paths: 1251 - upon failure and traffic moved to recovery path, the traffic is 1252 unprotected until such time as the path defect in the original 1253 working path is repaired and that path restored to service. 1254 - upon failure and traffic moved to recovery path, the resources 1255 associated with the original path remain reserved. 1257 3.8.1.2 Non-revertive Mode 1259 In the non-revertive mode of operation, there is no preferred path or 1260 it may be desirable to minimize further disruption of the service 1261 brought on by a revertive switching operation. A switch-back to the 1262 original working path is not desired or not possible since the 1263 original path may no longer exist after the occurrence of a fault on 1264 that path. 1265 If there is a fault on the working path, traffic is switched to the 1266 recovery path. When or if the faulty path (the originally working 1267 path) is restored, it may become the recovery path (either by 1268 configuration, or, if desired, by management actions). 1270 In the non-revertive mode of operation, the working traffic may or 1271 may not be restored to a new optimal working path or to the original 1272 working path anyway. This is because it might be useful, in some 1273 cases, to either: (a) administratively perform a protection switch 1274 back to the original working path after gaining further assurances 1275 about the integrity of the path, or (b) it may be acceptable to 1276 continue operation on the recovery path, or (c) it may be desirable 1277 to move the traffic to a new optimal working path that is calculated 1278 based on network topology and network policies. 1280 3.8.2 Dynamic Protection Counterparts 1282 For Dynamic protection counterparts when the traffic is switched over 1283 to a recovery path, the association between the original working path 1284 and the recovery path may no longer exist, since the original path 1285 itself may no longer exist after the fault. Instead, when the network 1286 reaches a stable state following routing convergence, the recovery 1287 path may be switched over to a different preferred path either 1288 optimization based on the new network topology and associated 1289 information or based on pre-configured information. 1291 Dynamic protection counterparts assume that upon failure, the PSL or 1292 other network entity will establish new working paths if another 1293 switch-over will be performed. 1295 3.8.3 Restoration and Notification 1297 MPLS restoration deals with returning the working traffic from the 1298 recovery path to the original or a new working path. Reversion is 1299 performed by the PSL either upon receiving notification, via FRS, 1300 that the working path is repaired, or upon receiving notification 1301 that a new working path is established. 1303 For fixed counterparts in revertive mode, an LSR that detected the 1304 fault on the working path also detects the restoration of the working 1305 path. If the working path had experienced a LF defect, the LSR 1306 detects a return to normal operation via the receipt of a liveness 1307 message from its peer. If the working path had experienced a LD 1308 defect at an LSR interface, the LSR could detect a return to normal 1309 operation via the resumption of error-free packet reception on that 1310 interface. Alternatively, a lower layer that no longer detects a LF 1311 defect may inform the MPLS-based recovery mechanisms at the LSR that 1312 the link to its peer LSR is operational. 1313 The LSR then transmits FRS to its upstream LSR(s) that were 1314 transmitting traffic on the working path. At the point the PSL 1315 receives the FRS, it switches the working traffic back to the 1316 original working path. 1318 A similar scheme is for dynamic counterparts where e.g. an update of 1319 topology and/or network convergence may trigger installation or setup 1320 of new working paths and send notification to the PSL to perform a 1321 switch over. 1323 We note that if there is a way to transmit fault information back 1324 along a recovery path towards a PSL and if the recovery path is an 1325 equivalent working path, it is possible for the working path and its 1326 recovery path to exchange roles once the original working path is 1327 repaired following a fault. This is because, in that case, the 1328 recovery path effectively becomes the working path, and the restored 1329 working path functions as a recovery path for the original recovery 1330 path. This is important, since it affords the benefits of non- 1331 revertive switch operation outlined in Section 3.8.1, without leaving 1332 the recovery path unprotected. 1334 3.8.4 Reverting to Preferred Path (or Controlled Rearrangement) 1336 In the revertive mode, a "make before break" restoration switching 1337 can be used, which is less disruptive than performing protection 1338 switching upon the occurrence of network impairments. This will 1339 minimize both packet loss and packet reordering. The controlled 1340 rearrangement of paths can also be used to satisfy traffic 1341 engineering requirements for load balancing across an MPLS domain. 1343 3.9. Performance 1345 Resource/performance requirements for recovery paths should be 1346 specified in terms of the following attributes: 1348 I. Resource class attribute: 1349 Equivalent Recovery Class: The recovery path has the same resource 1350 reservations and performance guarantees as the working path. In other 1351 words, the recovery path meets the same SLAs as the working path. 1352 Limited Recovery Class: The recovery path does not have the same 1353 resource reservations and performance guarantees as the working path. 1355 A. Lower Class: The recovery path has lower resource requirements or 1356 less stringent performance requirements than the working path. 1358 B. Best Effort Class: The recovery path is best effort. 1360 II. Priority Attribute: 1361 The recovery path has a priority attribute just like the working path 1362 (i.e., the priority attribute of the associated traffic trunks). It 1363 can have the same priority as the working path or lower priority. 1365 III. Preemption Attribute: 1366 The recovery path can have the same preemption attribute as the 1367 working path or a lower one. 1369 4. MPLS Recovery Features 1371 The following features are desirable from an operational point of 1372 view: 1374 I. It is highly desirable that MPLS recovery provides an option to 1375 identify protection groups (PPGs) and protection portions (PTPs). 1377 II. Each PSL should be capable of performing MPLS recovery upon the 1378 detection of the impairments or upon receipt of notifications of 1379 impairments. 1381 III. A MPLS recovery method should not preclude manual protection 1382 switching commands. This implies that it would be possible under 1383 administrative commands to transfer traffic from a working path to a 1384 recovery path, or to transfer traffic from a recovery path to a 1385 working path, once the working path becomes operational following a 1386 fault. 1388 IV. A PSL may be capable of performing either a switch back to the 1389 original working path after the fault is corrected or a switchover to 1390 a new working path, upon the discovery or establishment of a more 1391 optimal working path. 1393 V. The recovery model should take into consideration path merging at 1394 intermediate LSRs. If a fault affects the merged segment, all the 1395 paths sharing that merged segment should be able to recover. 1396 Similarly, if a fault affects a non-merged segment, only the path 1397 that is affected by the fault should be recovered. 1399 5. Comparison Criteria 1401 Possible criteria to use for comparison of MPLS-based recovery 1402 schemes are as follows: 1404 Recovery Time 1406 We define recovery time as the time required for a recovery path to 1407 be activated (and traffic flowing) after a fault. Recovery Time is 1408 the sum of the Fault Detection Time, Hold-off Time, Notification 1409 Time, Recovery Operation Time, and the Traffic Restoration Time. In 1410 other words, it is the time between a failure of a node or link in 1411 the network and the time before a recovery path is installed and the 1412 traffic starts flowing on it. 1414 Full Restoration Time 1416 We define full restoration time as the time required for a permanent 1417 restoration. This is the time required for traffic to be routed onto 1418 links, which are capable of or have been engineered sufficiently to 1419 handle traffic in recovery scenarios. Note that this time may or may 1420 not be different from the "Recovery Time" depending on whether 1421 equivalent or limited recovery paths are used. 1423 Setup vulnerability 1425 The amount of time that a working path or a set of working paths is 1426 left unprotected during such tasks as recovery path computation and 1427 recovery path setup may be used to compare schemes. The nature of 1428 this vulnerability should be taken into account, e.g.: End to End 1429 schemes correlate the vulnerability with working paths, Local Repair 1430 schemes have a topological correlation that cuts across working paths 1431 and Network Plan approaches have a correlation that impacts the 1432 entire network. 1434 Backup Capacity 1436 Recovery schemes may require differing amounts of "backup capacity" 1437 in the event of a fault. This capacity will be dependent on the 1438 traffic characteristics of the network. However, it may also be 1439 dependent on the particular protection plan selection algorithms as 1440 well as the signaling and re-routing methods. 1442 Additive Latency 1444 Recovery schemes may introduce additive latency to traffic. For 1445 example, a recovery path may take many more hops than the working 1446 path. This may be dependent on the recovery path selection 1447 algorithms. 1449 Quality of Protection 1451 Recovery schemes can be considered to encompass a spectrum of "packet 1452 survivability" which may range from "relative" to "absolute". 1453 Relative survivability may mean that the packet is on an equal 1454 footing with other traffic of, as an example, the same diff-serv code 1455 point (DSCP) in contending for the surviving network resources. 1456 Absolute survivability may mean that the survivability of the 1457 protected traffic has explicit guarantees. 1459 Re-ordering 1461 Recovery schemes may introduce re-ordering of packets. Also the 1462 action of putting traffic back on preferred paths might cause packet 1463 re-ordering. 1465 State Overhead 1467 As the number of recovery paths in a protection plan grows, the state 1468 required to maintain them also grows. Schemes may require differing 1469 numbers of paths to maintain certain levels of coverage, etc. The 1470 state required may also depend on the particular scheme used to 1471 recover. In many cases the state overhead will be in proportion to 1472 the number of recovery paths. 1474 Loss 1476 Recovery schemes may introduce a certain amount of packet loss during 1477 switchover to a recovery path. Schemes that introduce loss during 1478 recovery can measure this loss by evaluating recovery times in 1479 proportion to the link speed. 1481 In case of link or node failure a certain packet loss is inevitable. 1483 Coverage 1485 Recovery schemes may offer various types of failover coverage. The 1486 total coverage may be defined in terms of several metrics: 1488 I. Fault Types: Recovery schemes may account for only link faults or 1489 both node and link faults or also degraded service. For example, a 1490 scheme may require more recovery paths to take node faults into 1491 account. 1493 II. Number of concurrent faults: dependent on the layout of recovery 1494 paths in the protection plan, multiple fault scenarios may be able to 1495 be restored. 1497 III. Number of recovery paths: for a given fault, there may be one or 1498 more recovery paths. 1500 IV. Percentage of coverage: dependent on a scheme and its 1501 implementation, a certain percentage of faults may be covered. This 1502 may be subdivided into percentage of link faults and percentage of 1503 node faults. 1505 V. The number of protected paths may effect how fast the total set of 1506 paths affected by a fault could be recovered. The ratio of protected 1507 is n/N, where n is the number of protected paths and N is the total 1508 number of paths. 1510 6. Security Considerations 1512 The MPLS recovery that is specified herein does not raise any 1513 security issues that are not already present in the MPLS 1514 architecture. 1516 7. Intellectual Property Considerations 1518 The IETF has been notified of intellectual property rights claimed in 1519 regard to some or all of the specification contained in this 1520 document. For more information consult the online list of claimed 1521 rights. 1523 8. Acknowledgements 1525 We would like to thank members of the MPLS WG mailing list for their 1526 suggestions on the earlier versions of this draft. In particular, 1527 Bora Akyol, Dave Allan, Neil Harrison, and Dave Danenberg whose 1528 suggestions and comments were very helpful in revising the document. 1530 The editors would like to give very special thanks to Curtis 1531 Villamizar for his careful and extremely thorough reading of the 1532 document and for taking the time to provide numerous suggestions, 1533 which were very helpful in our latest revision of the document. 1535 9. AuthorsÆ Addresses 1537 Vishal Sharma Ben Mack-Crane 1538 Metanoia, Inc. Tellabs Operations, Inc. 1539 335 Elan Village Ln., Unit 203 4951 Indiana Avenue 1540 San Jose, CA 95134 Lisle, IL 60532 1541 Phone: 408-943-1794 Phone: 630-512-7255 1542 v.sharma@ieee.org Ben.Mack-Crane@tellabs.com 1543 Srinivas Makam Ken Owens 1544 Tellabs Operations, Inc. Erlang Technology, Inc. 1545 Lisle, IL 60532 St. Louis, MO 63119 1546 Phone: 630-512-7217 Phone: 314-918-1579 1547 Srinivas.Makam@tellabs.com keno@erlangtech.com 1549 Changcheng Huang Fiffi Hellstrand 1550 Dept. of Systems & Computer Engg. Nortel Networks 1551 Carleton University St Eriksgatan 115 1552 Minto Center, Rm. 3082 PO Box 6701 1553 1125 Colonial By Drive 113 85 Stockholm, Sweden 1554 Ottawa, Ontario K1S 5B6, Canada Phone: +46 8 5088 3687 1555 Phone: 613 520-2600 x2477 Fiffi@nortelnetworks.com 1556 Changcheng.Huang@sce.carleton.ca 1558 Jon Weil Brad Cain 1559 Nortel Networks Cereva Networks 1560 Harlow Laboratories London Road 3 Network Drive 1561 Harlow Essex CM17 9NA, UK Marlborough, MA 01752 1562 Phone: +44 (0)1279 403935 Phone: 508-787-5000 1563 jonweil@nortelnetworks.com bcain@cereva.com 1565 Loa Andersson Bilel Jamoussi 1566 Utfors AB Nortel Networks 1567 R…sundav„gen 12, Box 525 3 Federal Street, BL3-03 1568 169 29 Solna, Sweden Billerica, MA 01821, USA 1569 Phone: +46 8 5270 5038 Phone:(978) 288-4506 1570 loa.andersson@utfors.se jamoussi@nortelnetworks.com 1572 Seyhan Civanlar Angela Chiu 1573 Lemur Networks, Inc. Celion Networks, Inc. 1574 135 West 20th Street, 5th Floor One Shiela Drive, Suite 2 1575 New York, NY 10011 Tinton Falls, NJ 07724 1576 Phone: 212-367-7676 Phone: (732) 345-3441 1577 scivanlar@lemurnetworks.com angela.chiu@celion.com 1579 10. References 1581 [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label 1582 Switching Architecture", RFC 3031, January 2001. 1584 [2] Andersson, L., Doolan, P., Feldman, N., Fredette, A., Thomas, B., 1585 "LDP Specification", RFC 3036, January 2001. 1587 [3] Awduche, D. Hannan, A., and Xiao, X., "Applicability Statement 1588 for Extensions to RSVP for LSP-Tunnels", draft-ietf-mpls-rsvp- 1589 tunnel-applicability-02.txt, Work in Progress, April 2001. 1591 [4] Jamoussi, B. et al "Constraint-Based LSP Setup using LDP", 1592 Internet Draft draft-ietf-mpls-cr-ldp-05.txt, Work in Progress , 1593 February 2001. 1595 [5] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource 1596 ReSerVation Protocol (RSVP) -- Version 1 Functional 1597 Specification", RFC 2205, September 1997. 1599 [6] Awduche, D. et al "Extensions to RSVP for LSP Tunnels", Internet 1600 Draft, draft-ietf-mpls-rsvp-lsp-tunnel-08.txt, Work in Progress, 1601 February 2001. 1603 [7] Hellstrand, F., and Andersson, L., "Extensions to RSVP-TE and CR- 1604 LDP for setup of pre-established LSP Tunnels," Internet Draft, 1605 Work in Progress, draft-hellstrand-mpls-recovery-merge-01.txt, 1606 November 2000. 1608 [8] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J., 1609 "Requirements for Traffic Engineering Over MPLS", RFC 2702, 1610 September 1999. 1612 [9] Kini, S., Lakshman, T. V., Villamizar, C., "Reservation Protocol 1613 with Traffic Engineering Extensions: Extension for Label Switched 1614 Path Restoration," Internet Draft, Work in Progress, draft-kini- 1615 rsvp-lsp-restoration-00.txt, November 2000. 1617 [10] Haskin, D. and Krishnan R., "A Method for Setting an Alternative 1618 Label Switched Path to Handle Fast Reroute", Internet Draft draft- 1619 haskin-mpls-fast-reroute-05.txt, November 2000, Work in progress. 1621 [11] Owens, K., Makam, V., Sharma, V., Mack-Crane, B., and Haung, C., 1622 "A Path Protection/Restoration Mechanism for MPLS Networks", 1623 Internet Draft, draft-chang-mpls-path-protection-03.txt, Work in 1624 Progress, July 2001. 1626 [14] Kini, S., Kodialam, M., Sengupta, S., Villamizar, C., "Shared 1627 Backup Label Switched Path Restoration", Internet Draft, draft- 1628 kini-restoration-shared-backup-01.txt, Work in Progress May 2001.