idnits 2.17.00 (12 Aug 2021) /tmp/idnits29179/draft-ietf-mpls-recovery-frmwrk-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 33 longer pages, the longest (page 8) being 62 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([2], [3], [4], [5], [6], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1373: '...I. MPLS recovery SHALL provide an opti...' RFC 2119 keyword, line 1376: '... II. Each PSL SHALL be capable of pe...' RFC 2119 keyword, line 1380: '... recovery method SHALL not preclude ma...' RFC 2119 keyword, line 1387: '... IV. A PSL SHALL be capable of perfo...' RFC 2119 keyword, line 1400: '... There SHOULD be an option for:...' (2 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 324 has weird spacing: '...traffic is sw...' == Line 532 has weird spacing: '... on the recov...' == Line 666 has weird spacing: '...icating the t...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: III. A MPLS recovery method SHALL not preclude manual protection switching commands. This implies that it would be possible under administrative commands to transfer traffic from a working path to a recovery path, or to transfer traffic from a recovery path to a working path, once the working path becomes operational following a fault. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: I. Configuration of the recovery path as excess or reserved, with excess as the default. The recovery path that is configured as excess SHALL provide lower priority preemptable traffic access to the protection bandwidth, while the recovery path configured as reserved SHALL not provide any other traffic access to the protection bandwidth. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: draft-ietf-mpls-arch has been published as RFC 3031 == Outdated reference: draft-ietf-mpls-ldp has been published as RFC 3036 == Outdated reference: draft-ietf-mpls-rsvp-tunnel-applicability has been published as RFC 3210 ** Downref: Normative reference to an Informational draft: draft-ietf-mpls-rsvp-tunnel-applicability (ref. '3') == Outdated reference: draft-ietf-mpls-cr-ldp has been published as RFC 3212 == Outdated reference: draft-ietf-mpls-rsvp-lsp-tunnel has been published as RFC 3209 ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '7') -- Possible downref: Normative reference to a draft: ref. '8' == Outdated reference: A later version (-01) exists of draft-swallow-rsvp-bypass-label-00 -- Possible downref: Normative reference to a draft: ref. '9' -- Possible downref: Normative reference to a draft: ref. '10' -- Possible downref: Normative reference to a draft: ref. '11' -- Possible downref: Normative reference to a draft: ref. '12' == Outdated reference: A later version (-03) exists of draft-chang-mpls-path-protection-02 -- Possible downref: Normative reference to a draft: ref. '13' Summary: 13 errors (**), 0 flaws (~~), 14 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 IETF Draft Vishal Sharma 3 Multi-Protocol Label Switching Ben-Mack Crane 4 Expires: May 2001 Srinivas Makam 5 Ken Owens 6 Tellabs Operations, Inc. 8 Changcheng Huang 9 Carleton University 11 Fiffi Hellstrand 12 Jon Weil 13 Loa Andersson 14 Bilel Jamoussi 15 Nortel Networks 17 Brad Cain 18 Mirror Image Internet 20 Seyhan Civanlar 21 Coreon Networks 23 Angela Chiu 24 AT&T Labs 26 November 2000 28 Framework for MPLS-based Recovery 29 31 Status of this memo 33 This document is an Internet-Draft and is in full conformance with 34 all provisions of Section 10 of RFC2026. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that other 37 groups may also distribute working documents as Internet-Drafts. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 42 The list of current Internet-Drafts can be accessed at 43 http://www.ietf.org/ietf/1id-abstracts.txt 44 The list of Internet-Draft Shadow Directories can be accessed at 45 http://www.ietf.org/shadow.html. 47 Abstract 49 Multi-protocol label switching (MPLS) [1] integrates the label 50 swapping forwarding paradigm with network layer routing. To deliver 51 reliable service, MPLS requires a set of procedures to provide 52 protection of the traffic carried on different paths. This requires 53 that the label switched routers (LSRs) support fault detection, fault 54 notification, and fault recovery mechanisms, and that MPLS signaling 55 [2] [3] [4] [5] [6] support the configuration of recovery. With these 56 objectives in mind, this document specifies a framework for MPLS 57 based recovery. 59 Table of Contents Page 61 1.0 Introduction 3 62 1.1 Background 3 63 1.2 Motivations for MPLS-Based Recovery 3 64 1.3 Objectives 4 66 2.0 Overview 5 67 2.1 Recovery Models 6 68 2.2 Recovery Cycles 7 69 2.2.1 MPLS Recovery Cycle Model 7 70 2.2.2 MPLS Reversion Cycle Model 9 71 2.2.3 Dynamic Reroute Cycle Model 10 72 2.3 Definitions and Terminology 11 73 2.4 Abbreviations 15 75 3.0 MPLS Recovery Principles 15 76 3.1 Configuration of Recovery 15 77 3.2 Initiation of Path Setup 15 78 3.3 Initiation of Resource Allocation 16 79 3.4 Scope of Recovery 17 80 3.4.1 Topology 17 81 3.4.1.1 Local Repair 17 82 3.4.1.2 Global Repair 17 83 3.4.1.3 Alternate Egress Repair 18 84 3.4.1.4 Multi-Layer Repair 18 85 3.4.1.5 Concatenated Protection Domains 18 86 3.4.2 Path Mapping 18 87 3.4.3 Bypass Tunnels 19 88 3.4.4 Recovery Granularity 20 89 3.4.4.1 Selective Traffic Recovery 20 90 3.4.4.2 Bundling 20 91 3.4.5 Recovery Path Resource Use 20 92 3.5 Fault Detection 21 93 3.6 Fault Notification 21 94 3.7 Switch Over Operation 22 95 3.7.1 Recovery Trigger 22 96 3.7.2 Recovery Action 22 97 3.8 Switch Back Operation 23 98 3.8.1 Fixed Protection Counterparts 23 99 3.8.2 Dynamic Protection Counterparts 24 100 3.8.3 Restoration and Notification 25 101 3.8.4 Reverting to Preferred Path 25 102 3.9 Performance 26 104 4.0 Recovery Requirements 26 105 5.0 MPLS Recovery Options 27 106 6.0 Comparison Criteria 27 107 7.0 Security Considerations 29 108 8.0 Intellectual Property Considerations 29 109 9.0 Acknowledgements 29 110 10.0 Author's Addresses 30 111 11.0 References 30 113 1.0 Introduction 115 This memo describes a framework for MPLS-based recovery. We provide a 116 detailed taxonomy of recovery terminology, and discuss the motivation 117 for, the objectives of, and the requirements for MPLS-based recovery. 118 We outline principles for MPLS-based recovery, and also provide 119 comparison criteria that may serve as a basis for comparing and 120 evaluating different recovery schemes. 122 1.1 Background 124 Network routing deployed today is focussed primarily on connectivity 125 and typically supports only one class of service, the best effort 126 class. Multi-protocol label switching, on the other hand, by 127 integrating forwarding based on label-swapping of a link local label 128 with network layer routing allows flexibility in the delivery of new 129 routing services. MPLS allows for using such media specific 130 forwarding mechanisms as label swapping. This enables more 131 sophisticated features such as quality-of-service (QoS) and traffic 132 engineering [7] to be implemented more effectively. An important 133 component of providing QoS, however, is the ability to transport data 134 reliably and efficiently. Although the current routing algorithms are 135 very robust and survivable, the amount of time they take to recover 136 from a fault can be significant, on the order of several seconds or 137 minutes, causing serious disruption of service for some applications 138 in the interim. This is unacceptable to many organizations that aim 139 to provide a highly reliable service, and thus require recovery times 140 on the order of tens of milliseconds, as specified, for example, in 141 the GR253 specification for SONET. 143 MPLS recovery may be motivated by the notion that there are inherent 144 limitations to improving the recovery times of current routing 145 algorithms. Additional improvement not obtainable by other means can 146 be obtained by augmenting these algorithms with MPLS recovery 147 mechanisms. Since MPLS is likely to be the technology of choice in 148 the future IP-based transport network, it is useful that MPLS be able 149 to provide protection and restoration of traffic. MPLS may 150 facilitate the convergence of network functionality on a common 151 control and management plane. Further, a protection priority could be 152 used as a differentiating mechanism for premium services that require 153 high reliability. The remainder of this document provides a framework 154 for MPLS based recovery. It is focused at a conceptual level and is 155 meant to address motivation, objectives and requirements. Issues of 156 mechanism, policy, routing plans and characteristics of traffic 157 carried by recovery paths are beyond the scope of this document. 159 1.2 Motivation for MPLS-Based Recovery 161 MPLS based protection of traffic (called MPLS-based Recovery) is 162 useful for a number of reasons. The most important is its ability to 163 increase network reliability by enabling a faster response to faults 164 than is possible with traditional Layer 3 (or IP layer) approaches 165 alone while still providing the visibility of the network afforded by 166 Layer 3. Furthermore, a protection mechanism using MPLS could enable 167 IP traffic to be put directly over WDM optical channels, without an 168 intervening SONET layer. This would facilitate the construction of 169 IP-over-WDM networks. 171 The need for MPLS-based recovery arises because of the following: 173 I. Layer 3 or IP rerouting may be too slow for a core MPLS network 174 that needs to support high reliability/availability. 176 II. Layer 0 (for example, optical layer) or Layer 1 (for example, 177 SONET) mechanisms may not be deployed in topologies that meet 178 carriers' protection goals. 180 III. The granularity at which the lower layers may be able to protect 181 traffic may be too coarse for traffic that is switched using MPLS- 182 based mechanisms. 184 IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher 185 layer operations. Thus, while they may provide, for example, link 186 protection, they cannot easily provide node protection or protection 187 of traffic transported at layer 3. 189 V. MPLS has desirable attributes when applied to the purpose of 190 recovery for connectionless networks. Specifically that an LSP is 191 source routed and a forwarding path for recovery can be "pinned" and 192 is not affected by transient instability in SPF routing brought on by 193 failure scenarios. 195 Furthermore there is a need for open standards. 197 VI. Establishing interoperability of protection mechanisms between 198 routers/LSRs from different vendors in IP or MPLS networks is 199 urgently required to enable the adoption of MPLS as a viable core 200 transport and traffic engineering technology. 202 1.3 Objectives/Goals 204 We lay down the following objectives for MPLS-based recovery. 206 I. MPLS-based recovery mechanisms should facilitate fast (10's of ms) 207 recovery times. 209 II. MPLS-based recovery should maximize network reliability and 210 availability. MPLS-based recovery of traffic should minimize the 211 number of single points of failure in the MPLS protected domain. 213 III. MPLS-based recovery should enhance the reliability of the 214 protected traffic while minimally or predictably degrading the 215 traffic carried by the diverted resources. 217 IV. MPLS-based recovery techniques should be applicable for 218 protection of traffic at various granularities. For example, it 219 should be possible to specify MPLS-based recovery for a portion of 220 the traffic on an individual path, for all traffic on an individual 221 path, or for all traffic on a group of paths. Note that a path is 222 used as a general term and includes the notion of a link, IP route or 223 LSP. 225 V. MPLS-based recovery techniques may be applicable for an entire 226 end-to-end path or for segments of an end-to-end path. 228 VI. MPLS-based recovery actions should not adversely affect other 229 network operations. 231 VII. MPLS-based recovery actions in one MPLS protection domain 232 (defined in Section 2.2) should not adversely affect the recovery 233 actions in other MPLS protection domains. 235 VII. MPLS-based recovery mechanisms should be able to take into 236 consideration the recovery actions of lower layers. 238 VIII. MPLS-based recovery actions should avoid network-layering 239 violations. That is, defects in MPLS-based mechanisms should not 240 trigger lower layer protection switching. 242 IX. MPLS-based recovery mechanisms should minimize the loss of data 243 and packet reordering during recovery operations. (The current MPLS 244 specification has itself no explicit requirement on reordering). 246 X. MPLS-based recovery mechanisms should minimize the state overhead 247 incurred for each recovery path maintained. 249 XI. MPLS-based recovery mechanisms should be able to preserve the 250 constraints on traffic after switchover, if desired. That is, if 251 desired, the recovery path should meet the resource requirements of, 252 and achieve the same performance characteristics as the working path. 254 2.0 Overview 256 There are several options for providing protection of traffic using 257 MPLS. The most generic requirement is the specification of whether 258 recovery should be via Layer 3 (or IP) rerouting or via MPLS 259 protection switching or rerouting actions. 261 Generally network operators aim to provide the fastest and the best 262 protection mechanism that can be provided at a reasonable cost. The 263 higher the level of protection, the more resources are consumed. 264 Therefore it is expected that network operators will offer a spectrum 265 of service levels. MPLS-based recovery should give the flexibility to 266 select the recovery mechanism, choose the granularity at which 267 traffic is protected, and to also choose the specific types of 268 traffic that are protected in order to give operators more control 269 over that tradeoff. With MPLS-based recovery, it can be possible to 270 provide different levels of protection for different classes of 271 service, based on their service requirements. For example, using 272 approaches outlined below, a VLL service that supports real-time 273 applications like VoIP may be supported using link/node protection 274 together with pre-established, pre-reserved path protection, while 275 best effort traffic may use established-on-demand path protection or 276 simply rely on IP re-route or higher layer recovery mechanisms. As 277 another example of their range of application, MPLS-based recovery 278 strategies may be used to protect traffic not originally flowing on 279 label switched paths, such as IP traffic that is normally routed hop- 280 by-hop, as well as traffic forwarded on label switched paths. 282 2.1 Recovery Models 284 There are two basic models for path recovery: rerouting and 285 protection switching. 287 Protection switching and rerouting, as defined below, may be used 288 together. For example, protection switching to a recovery path may 289 be used for rapid restoration of connectivity while rerouting 290 determines a new optimal network configuration, rearranging paths, as 291 needed, at a later time [8] [9]. 293 2.1.1 Rerouting 295 Recovery by rerouting is defined as establishing new paths or path 296 segments on demand for restoring traffic after the occurrence of a 297 fault. The new paths may be based upon fault information, network 298 routing policies, pre-defined configurations and network topology 299 information. Thus, upon detecting a fault, paths or path segments to 300 bypass the fault are established using signaling. Reroute mechanisms 301 are inherently slower than protection switching mechanisms, since 302 more must be done following the detection of a fault. However reroute 303 mechanisms are simpler and more frugal as no resources are committed 304 until after the fault occurs and the location of the fault is known. 306 Once the network routing algorithms have converged after a fault, it 307 may be preferable, in some cases, to reoptimize the network by 308 performing a reroute based on the current state of the network and 309 network policies. This is discussed further in Section 3.8. 311 In terms of the principles defined in section 3, reroute recovery 312 employs paths established-on-demand with resources reserved-on- 313 demand. 315 2.1.2 Protection Switching 317 Protection switching recovery mechanisms pre-establish a recovery 318 path or path segment, based upon network routing policies, the 319 restoration requirements of the traffic on the working path, and 320 administrative considerations. The recovery path may or may not be 321 link and node disjoint with the working path [10]. However if the 322 recovery path shares sources of failure with the working path, the 323 overall reliability of the construct is degraded. When a fault is 324 detected, the protected traffic is switched over to the recovery 325 path(s) and restored. 327 In terms of the principles in section 3, protection switching employs 328 pre-established recovery paths, and if resource reservation is 329 required on the recovery path, pre-reserved resources. 331 2.1.2.1. Subtypes of Protection Switching 333 The resources (bandwidth, buffers, processing) on the recovery path 334 may be used to carry either a copy of the working path traffic or 335 extra traffic that is displaced when a protection switch occurs. 336 This leads to two subtypes of protection switching. 338 In 1+1 ("one plus one") protection, the resources (bandwidth, 339 buffers, processing capacity) on the recovery path are fully 340 reserved, and carry the same traffic as the working path. Selection 341 between the traffic on the working and recovery paths is made at the 342 path merge LSR (PML). In effect the PSL function is deprecated to 343 establishment of the working and recovery paths and a simple 344 replication function. The recovery intelligence is delegated to the 345 PML. 347 In 1:1 ("one for one") protection, the resources (if any) allocated 348 on the recovery path are fully available to preemptible low priority 349 traffic except when the recovery path is in use due to a fault on the 350 working path. In other words, in 1:1 protection, the protected 351 traffic normally travels only on the working path, and is switched to 352 the recovery path only when the working path has a fault. Once the 353 protection switch is initiated, the low priority traffic being 354 carried on the recovery path may be displaced by the protected 355 traffic. This method affords a way to make efficient use of the 356 recovery path resources. 358 This concept can be extended to 1:n (one for n) and m:n (m for n) 359 protection. 361 2.2 The Recovery Cycles 363 There are three defined recovery cycles; the MPLS Recovery Cycle, the 364 MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first 365 cycle detects a fault and restores traffic onto MPLS-based recovery 366 paths. If the recovery path is non-optimal the cycle may be followed 367 by any of the two latter to achieve an optimized network again. The 368 reversion cycle applies for explicitly routed traffic that that does 369 not rely on any dynamic routing protocols to be converged. The 370 dynamic re-routing cycle applies for traffic that is forwarded based 371 on hop-by-hop routing. 373 2.2.1 MPLS Recovery Cycle Model 375 The MPLS recovery cycle model is illustrated in Figure 1. 376 Definitions and a key to abbreviations follow. 378 --Network Impairment 379 | --Fault Detected 380 | | --Start of Notification 381 | | | -- Start of Recovery Operation 382 | | | | --Recovery Operation Complete 383 | | | | | --Path Traffic Restored 384 | | | | | | 385 | | | | | | 386 v v v v v v 387 ---------------------------------------------------------------- 388 | T1 | T2 | T3 | T4 | T5 | 390 Figure 1. MPLS Recovery Cycle Model 392 The various timing measures used in the model are described below. 393 T1 Fault Detection Time 394 T2 Hold-off Time 395 T3 Notification Time 396 T4 Recovery Operation Time 397 T5 Traffic Restoration Time 399 Definitions of the recovery cycle times are as follows: 401 Fault Detection Time 403 The time between the occurrence of a network impairment and the 404 moment the fault is detected by MPLS-based recovery mechanisms. This 405 time may be highly dependent on lower layer protocols. 407 Hold-Off Time 409 The configured waiting time between the detection of a fault and 410 taking MPLS-based recovery action, to allow time for lower layer 411 protection to take effect. The Hold-off Time may be zero. 413 Note: The Hold-Off Time may occur after the Notification Time 414 interval if the node responsible for the switchover, the Path Switch 415 LSR (PSL), rather than the detecting LSR, is configured to wait. 417 Notification Time 418 The time between initiation of a fault indication signal (FIS) by the 419 LSR detecting the fault and the time at which the Path Switch LSR 420 (PSL) begins the recovery operation. This is zero if the PSL detects 421 the fault itself or infers a fault from such events as an adjacency 422 failure. 424 Note: If the PSL detects the fault itself, there still may be a Hold- 425 Off Time period between detection and the start of the recovery 426 operation. 428 Recovery Operation Time 430 The time between the first and last recovery actions. This may 431 include message exchanges between the PSL and PML to coordinate 432 recovery actions. 434 Traffic Restoration Time 436 The time between the last recovery action and the time that the 437 traffic (if present) is completely recovered. This interval is 438 intended to account for the time required for traffic to once again 439 arrive at the point in the network that experienced disrupted or 440 degraded service due to the occurrence of the fault (e.g. the PML). 441 This time may depend on the location of the fault, the recovery 442 mechanism, and the propagation delay along the recovery path. 444 2.2.2 MPLS Reversion Cycle Model 446 Protection switching, revertive mode, requires the traffic to be 447 switched back to a preferred path when the fault on that path is 448 cleared. The MPLS reversion cycle model is illustrated in Figure 2. 449 Note that the cycle shown below comes after the recovery cycle shown 450 in Fig. 1. 452 --Network Impairment Repaired 453 | --Fault Cleared 454 | | --Path Available 455 | | | --Start of Reversion Operation 456 | | | | --Reversion Operation Complete 457 | | | | | --Traffic Restored on Preferred Path 458 | | | | | | 459 | | | | | | 460 v v v v v v 461 ----------------------------------------------------------------- 462 | T7 | T8 | T9 | T10| T11| 464 Figure 2. MPLS Reversion Cycle Model 466 The various timing measures used in the model are described below. 467 T7 Fault Clearing Time 468 T8 Wait-to-Restore Time 469 T9 Notification Time 470 T10 Reversion Operation Time 471 T11 Traffic Restoration Time 473 Note that time T6 (not shown above) is the time for which the network 474 impairment is not repaired and traffic is flowing on the recovery 475 path. 477 Definitions of the reversion cycle times are as follows: 479 Fault Clearing Time 481 The time between the repair of a network impairment and the time that 482 MPLS-based mechanisms learn that the fault has been cleared. This 483 time may be highly dependent on lower layer protocols. 485 Wait-to-Restore Time 487 The configured waiting time between the clearing of a fault and MPLS- 488 based recovery action(s). Waiting time may be needed to ensure the 489 path is stable and to avoid flapping in cases where a fault is 490 intermittent. The Wait-to-Restore Time may be zero. 492 Note: The Wait-to-Restore Time may occur after the Notification Time 493 interval if the PSL is configured to wait. 495 Notification Time 497 The time between initiation of an FRS by the LSR clearing the fault 498 and the time at which the path switch LSR begins the reversion 499 operation. This is zero if the PSL clears the fault itself. 500 Note: If the PSL clears the fault itself, there still may be a Wait- 501 to-Restore Time period between fault clearing and the start of the 502 reversion operation. 504 Reversion Operation Time 506 The time between the first and last reversion actions. This may 507 include message exchanges between the PSL and PML to coordinate 508 reversion actions. 510 Traffic Restoration Time 512 The time between the last reversion action and the time that traffic 513 (if present) is completely restored on the preferred path. This 514 interval is expected to be quite small since both paths are working 515 and care may be taken to limit the traffic disruption (e.g., using 516 "make before break" techniques and synchronous switch-over). 518 In practice, the only interesting times in the reversion cycle are 519 the Wait-to-Restore Time and the Traffic Restoration Time (or some 520 other measure of traffic disruption). Given that both paths are 521 available, there is no need for rapid operation, and a well- 522 controlled switch-back with minimal disruption is desirable. 524 2.2.3 Dynamic Re-routing Cycle Model 526 Dynamic rerouting aims to bring the IP network to a stable state 527 after a network impairment has occurred. A re-optimized network is 528 achieved after the routing protocols have converged, and the traffic 529 is moved from a recovery path to a (possibly) new working path. The 530 steps involved in this mode are illustrated in Figure 3. 532 Note that the cycle shown below may be overlaid on the recovery 533 cycle shown in Fig. 1 or the reversion cycle shown in Fig. 2, or both 534 (in the event that both the recovery cycle and the reversion cycle 535 take place before the routing protocols converge, and after the 536 convergence of the routing protocols it is determined (based on on- 537 line algorithms or off-line traffic engineering tools, network 538 configuration, or a variety of other possible criteria) that there is 539 a better route for the working path). 541 --Network Enters a Semi-stable State after an Impairment 542 | --Dynamic Routing Protocols Converge 543 | | --Initiate Setup of New Working Path between PSL 544 | | | and PML 545 | | | --Switchover Operation Complete 546 | | | | --Traffic Moved to New Working Path 547 | | | | | 548 | | | | | 549 v v v v v 550 ----------------------------------------------------------------- 551 | T12 | T13 | T14 | T15 | 553 Figure 3. Dynamic Rerouting Cycle Model 554 The various timing measures used in the model are described below. 555 T12 Network Route Convergence Time 556 T13 Hold-down Time (optional) 557 T14 Switchover Operation Time 558 T15 Traffic Restoration Time 560 Network Route Convergence Time 562 We define the network route convergence time as the time taken for 563 the network routing protocols to converge and for the network to 564 reach a stable state. 566 Holddown Time 568 We define the holddown period as a bounded time for which a recovery 569 path must be used. In some scenarios it may be difficult to determine 570 if the working path is stable. In these cases a holddown time may be 571 used to prevent excess flapping of traffic between a working and a 572 recovery path. 574 Switchover Operation Time 575 The time between the first and last switchover actions. This may 576 include message exchanges between the PSL and PML to coordinate the 577 switchover actions. 579 As an example of the recovery cycle, we present a sequence of events 580 that occur after a network impairment occurs and when a protection 581 switch is followed by dynamic rerouting. 583 I. Link or path fault occurs 584 II. Signaling initiated (FIS) for the fault detected 585 III. FIS arrives at the PSL 586 IV. The PSL initiates a protection switch to a pre-configured 587 recovery path 588 V. The PSL switches over the traffic from the working path to the 589 recovery path 590 VI. The network enters a semi-stable state 591 VII. Dynamic routing protocols converge after the fault, and a new 592 working path is calculated (based, for example, on some of the 593 criteria mentioned earlier in Section 2.1.1). 594 VIII. A new working path is established between the PSL and the PML 595 (assumption is that PSL and PML have not changed) 596 IX. Traffic is switched over to the new working path. 598 2.3 Definitions and Terminology 600 This document assumes the terminology given in [11], and, in 601 addition, introduces the following new terms. 603 2.3.1 General Recovery Terminology 605 Rerouting 607 A recovery mechanism in which the recovery path or path segments are 608 created dynamically after the detection of a fault on the working 609 path. In other words, a recovery mechanism in which the recovery path 610 is not pre-established. 612 Protection Switching 614 A recovery mechanism in which the recovery path or path segments are 615 created prior to the detection of a fault on the working path. In 616 other words, a recovery mechanism in which the recovery path is pre- 617 established. 619 Working Path 621 The protected path that carries traffic before the occurrence of a 622 fault. The working path exists between a PSL and PML. The working 623 path can be of different kinds; a hop-by-hop routed path, a trunk, a 624 link, an LSP or part of a multipoint-to-point LSP. 626 Synonyms for a working path are primary path and active path. 628 Recovery Path 630 The path by which traffic is restored after the occurrence of a 631 fault. In other words, the path on which the traffic is directed by 632 the recovery mechanism. The recovery path is established by MPLS 633 means. The recovery path can either be an equivalent recovery path 634 and ensure no reduction in quality of service, or be a limited 635 recovery path and thereby not guarantee the same quality of service 636 (or some other criteria of performance) as the working path. A 637 limited recovery path is not expected to be used for an extended 638 period of time. 640 Synonyms for a recovery path are: back-up path, alternative path, and 641 protection path. 643 Protection Counterpart 645 The "other" path when discussing pre-planned protection switching 646 schemes. The protection counterpart for the working path is the 647 recovery path and vice-versa. 649 Path Group (PG) 651 A logical bundling of multiple working paths, each of which is routed 652 identically between a Path Switch LSR and a Path Merge LSR. 654 Protected Path Group (PPG) 656 A path group that requires protection. 658 Protected Traffic Portion (PTP) 660 The portion of the traffic on an individual path that requires 661 protection. For example, code points in the EXP bits of the shim 662 header may identify a protected portion. 664 Path Switch LSR (PSL) 666 The PSL is responsible for switching or replicating the traffic 667 between the working path and the recovery path. 669 Path Merge LSR (PML) 671 An LSR that receives both working path traffic and its corresponding 672 recovery path traffic, and either merges their traffic into a single 673 outgoing path, or, if it is itself the destination, passes the 674 traffic on to the higher layer protocols. 676 Intermediate LSR 678 An LSR on a working or recovery path that is neither a PSL nor a PML 679 for that path. 681 Bypass Tunnel 683 A path that serves to back up a set of working paths using the label 684 stacking approach [1]. The working paths and the bypass tunnel must 685 all share the same path switch LSR (PSL) and the path merge LSR 686 (PML). 688 Switch-Over 690 The process of switching the traffic from the path that the traffic 691 is flowing on onto one or more alternate path(s). This may involve 692 moving traffic from a working path onto one or more recovery paths, 693 or may involve moving traffic from a recovery path(s) on to a more 694 optimal working path(s). 696 Switch-Back 698 The process of returning the traffic from one or more recovery paths 699 back to the working path(s). 701 Revertive Mode 703 A recovery mode in which traffic is automatically switched back from 704 the recovery path to the original working path upon the restoration 705 of the working path to a fault-free condition. This assumes a failed 706 working path does not automatically surrender resources to the 707 network. 709 Non-revertive Mode 711 A recovery mode in which traffic is not automatically switched back 712 to the original working path after this path is restored to a fault- 713 free condition. (Depending on the configuration, the original working 714 path may, upon moving to a fault-free condition, become the recovery 715 path, or it may be used for new working traffic, and be no longer 716 associated with its original recovery path). 718 MPLS Protection Domain 720 The set of LSRs over which a working path and its corresponding 721 recovery path are routed. 723 MPLS Protection Plan 725 The set of all LSP protection paths and the mapping from working to 726 protection paths deployed in an MPLS protection domain at a given 727 time. 729 Liveness Message 731 A message exchanged periodically between two adjacent LSRs that 732 serves as a link probing mechanism. It provides an integrity check of 733 the forward and the backward directions of the link between the two 734 LSRs as well as a check of neighbor aliveness. 736 Path Continuity Test 738 A test that verifies the integrity and continuity of a path or path 739 segment. The details of such a test are beyond the scope of this 740 draft. (This could be accomplished, for example, by transmitting a 741 control message along the same links and nodes as the data traffic or 742 similarly could be measured by the absence of traffic and by 743 providing feedback.) 745 2.3.2 Failure Terminology 747 Path Failure (PF) 748 Path failure is fault detected by MPLS-based recovery mechanisms, 749 which is define as the failure of the liveness message test or a path 750 continuity test, which indicates that path connectivity is lost. 752 Path Degraded (PD) 753 Path degraded is a fault detected by MPLS-based recovery mechanisms 754 that indicates that the quality of the path is unacceptable. 756 Link Failure (LF) 757 A lower layer fault indicating that link continuity is lost. This may 758 be communicated to the MPLS-based recovery mechanisms by the lower 759 layer. 761 Link Degraded (LD) 762 A lower layer indication to MPLS-based recovery mechanisms that the 763 link is performing below an acceptable level. 765 Fault Indication Signal (FIS) 766 A signal that indicates that a fault along a path has occurred. It is 767 relayed by each intermediate LSR to its upstream or downstream 768 neighbor, until it reaches an LSR that is setup to perform MPLS 769 recovery. 771 Fault Recovery Signal (FRS) 772 A signal that indicates a fault along a working path has been 773 repaired. Again, like the FIS, it is relayed by each intermediate LSR 774 to its upstream or downstream neighbor, until is reaches the LSR that 775 performs recovery of the original path. 777 2.4 Abbreviations 779 FIS: Fault Indication Signal. 780 FRS: Fault Recovery Signal. 781 LD: Link Degraded. 782 LF: Link Failure. 783 PD: Path Degraded. 784 PF: Path Failure. 785 PML: Path Merge LSR. 787 PG: Path Group. 788 PPG: Protected Path Group. 789 PTP: Protected Traffic Portion. 790 PSL: Path Switch LSR. 792 3.0 MPLS-based Recovery Principles 794 MPLS-based recovery refers to the ability to effect quick and 795 complete restoration of traffic affected by a fault in an MPLS- 796 enabled network. The fault may be detected on the IP layer or in 797 lower layers over which IP traffic is transported. Fastest MPLS 798 recovery is assumed to be achieved with protection switching and may 799 be viewed as the MPLS LSR switch completion time that is comparable 800 to, or equivalent to, the 50 ms switch-over completion time of the 801 SONET layer. This section provides a discussion of the concepts and 802 principles of MPLS-based recovery. The concepts are presented in 803 terms of atomic or primitive terms that may be combined to specify 804 recovery approaches. We do not make any assumptions about the 805 underlying layer 1 or layer 2 transport mechanisms or their recovery 806 mechanisms. 808 3.1 Configuration of Recovery 810 An LSR should allow for configuration of the following recovery 811 options: 813 Default-recovery (No MPLS-based recovery enabled): 814 Traffic on the working path is recovered only via Layer 3 or IP 815 rerouting or by some lower layer mechanism such as SONET APS. This 816 is equivalent to having no MPLS-based recovery. This option may be 817 used for low priority traffic or for traffic that is recovered in 818 another way (for example load shared traffic on parallel working 819 paths may be automatically recovered upon a fault along one of the 820 working paths by distributing it among the remaining working paths). 822 Recoverable (MPLS-based recovery enabled): 823 This working path is recovered using one or more recovery paths, 824 either via rerouting or via protection switching. 826 3.2 Initiation of Path Setup 828 There are three options for the initiation of the recovery path 829 setup. 831 Pre-established: 833 This is the same as the protection switching option. Here a recovery 834 path(s) is established prior to any failure on the working path. The 835 path selection can either be determined by an administrative 836 centralized tool (online or offline), or chosen based on some 837 algorithm implemented at the PSL and possibly intermediate nodes. To 838 guard against the situation when the pre-established recovery path 839 fails before or at the same time as the working path, the recovery 840 path should have secondary configuration options as explained in 841 Section 3.3 below. 843 Pre Qualified: 845 A pre-established path need not be created, it may be pre-qualified. 846 A pre-qualified recovery path is not created expressly for protecting 847 the working path, but instead is a path created for other purposes 848 that is designated as a recovery path after determination that it is 849 an acceptable alternative for carrying the working path traffic. 850 Variants include the case where an optical path or trail is 851 configured, but no switches are set. 853 Established-on-Demand: 855 This is the same as the rerouting option. Here, a recovery path is 856 established after a failure on its working path has been detected and 857 notified to the PSL. 859 3.3 Initiation of Resource Allocation 861 A recovery path may support the same traffic contract as the working 862 path, or it may not. We will distinguish these two situations by 863 using different additive terms. If the recovery path is capable of 864 replacing the working path without degrading service, it will be 865 called an equivalent recovery path. If the recovery path lacks the 866 resources (or resource reservations) to replace the working path 867 without degrading service, it will be called a limited recovery path. 868 Based on this, there are two options for the initiation of resource 869 allocation: 871 Pre-reserved: 873 This option applies only to protection switching. Here a pre- 874 established recovery path reserves required resources on all hops 875 along its route during its establishment. Although the reserved 876 resources (e.g., bandwidth and/or buffers) at each node cannot be 877 used to admit more working paths, they are available to be used by 878 all traffic that is present at the node before a failure occurs. 880 Reserved-on-Demand: 882 This option may apply either to rerouting or to protection switching. 883 Here a recovery path reserves the required resources after a failure 884 on the working path has been detected and notified to the PSL and 885 before the traffic on the working path is switched over to the 886 recovery path. 888 Note that under both the options above, depending on the amount of 889 resources reserved on the recovery path, it could either be an 890 equivalent recovery path or a limited recovery path. 892 3.4 Scope of Recovery 894 3.4.1 Topology 896 3.4.1.1 Local Repair 898 The intent of local repair is to protect against a link or neighbor 899 node fault and to minimize the amount of time required for failure 900 propagation. In local repair (also known as local recovery [12] [9]), 901 the node immediately upstream of the fault is the one to initiate 902 recovery (either rerouting or protection switching). Local repair can 903 be of two types: 905 Link Recovery/Restoration 907 In this case, the recovery path may be configured to route around a 908 certain link deemed to be unreliable. If protection switching is 909 used, several recovery paths may be configured for one working path, 910 depending on the specific faulty link that each protects against. 912 Alternatively, if rerouting is used, upon the occurrence of a fault 913 on the specified link each path is rebuilt such that it detours 914 around the faulty link. 915 In this case, the recovery path need only be disjoint from its 916 working path at a particular link on the working path, and may have 917 overlapping segments with the working path. Traffic on the working 918 path is switched over to an alternate path at the upstream LSR that 919 connects to the failed link. This method is potentially the fastest 920 to perform the switchover, and can be effective in situations where 921 certain path components are much more unreliable than others. 923 Node Recovery/Restoration 925 In this case, the recovery path may be configured to route around a 926 neighbor node deemed to be unreliable. Thus the recovery path is 927 disjoint from the working path only at a particular node and at links 928 associated with the working path at that node. Once again, the 929 traffic on the primary path is switched over to the recovery path at 930 the upstream LSR that directly connects to the failed node, and the 931 recovery path shares overlapping portions with the working path. 933 3.4.1.2 Global Repair 935 The intent of global repair is to protect against any link or node 936 fault on a path or on a segment of a path, with the obvious exception 937 of the faults occurring at the ingress node of the protected path 938 segment. In global repair the PSL is usually distant from the failure 939 and needs to be notified by a FIS. 940 In global repair also end-to end path recovery/restoration applies. 941 In many cases, the recovery path can be made completely link and node 942 disjoint with its working path. This has the advantage of protecting 943 against all link and node fault(s) on the working path (end-to-end 944 path or path segment). 946 However, it is in some cases slower than local repair since it takes 947 longer for the fault notification message to get to the PSL to 948 trigger the recovery action. 950 3.4.1.3 Alternate Egress Repair 952 It is possible to restore service without specifically recovering the 953 faulted path. 954 For example, for best effort IP service it is possible to select a 955 recovery path that has a different egress point from the working path 956 (i.e., there is no PML). The recovery path egress must simply be a 957 router that is acceptable for forwarding the FEC carried by the 958 working path (without creating looping). In an engineering context, 959 specific alternative FEC/LSP mappings with alternate egresses can be 960 formed. 962 This may simplify enhancing the reliability of implicitly constructed 963 MPLS topologies. A PSL may qualify LSP/FEC bindings as candidate 964 recovery paths as simply link and node disjoint with the immediate 965 downstream LSR of the working path. 967 3.4.1.4 Multi-Layer Repair 969 Multi-layer repair broadens the network designer's tool set for those 970 cases where multiple network layers can be managed together to 971 achieve overall network goals. Specific criteria for determining 972 when multi-layer repair is appropriate are beyond the scope of this 973 draft. 975 3.4.1.5 Concatenated Protection Domains 977 A given service may cross multiple networks and these may employ 978 different recovery mechanisms. It is possible to concatenate 979 protection domains so that service recovery can be provided end-to- 980 end. It is considered that the recovery mechanisms in different 981 domains may operate autonomously, and that multiple points of 982 attachment may be used between domains (to ensure there is no single 983 point of failure). Alternate egress repair requires management of 984 concatenated domains in that an explicit MPLS point of failure (the 985 PML) is by definition excluded. Details of concatenated protection 986 domains are beyond the scope of this draft. 988 3.4.2 Path Mapping 990 Path mapping refers to the methods of mapping traffic from a faulty 991 working path on to the recovery path. There are several options for 992 this, as described below. Note that the options below should be 993 viewed as atomic terms that only describe how the working and 994 protection paths are mapped to each other. The issues of resource 995 reservation along these paths, and how switchover is actually 996 performed lead to the more commonly used composite terms, such as 1+1 997 and 1:1 protection, which were described in Section 2.1. 999 1-to-1 Protection 1001 In 1-to-1 protection the working path has a designated recovery path 1002 that is only to be used to recover that specific working path. 1004 ii) n-to-1 Protection 1006 In n-to-1 protection, up to n working paths are protected using only 1007 one recovery path. If the intent is to protect against any single 1008 fault on any of the working paths, the n working paths should be 1009 diversely routed between the same PSL and PML. In some cases, 1010 handshaking between PSL and PML may be required to complete the 1011 recovery, the details of which are beyond the scope of this draft. 1013 n-to-m Protection 1015 In n-to-m protection, up to n working paths are protected using m 1016 recovery paths. Once again, if the intent is to protect against any 1017 single fault on any of the n working paths, the n working paths and 1018 the m recovery paths should be diversely routed between the same PSL 1019 and PML. In some cases, handshaking between PSL and PML may be 1020 required to complete the recovery, the details of which are beyond 1021 the scope of this draft. N-to-m protection is for further study. 1023 Split Path Protection 1025 In split path protection, multiple recovery paths are allowed to 1026 carry the traffic of a working path based on a certain configurable 1027 load splitting ratio. This is especially useful when no single 1028 recovery path can be found that can carry the entire traffic of the 1029 working path in case of a fault. Split path protection may require 1030 handshaking between the PSL and the PML(s), and may require the 1031 PML(s) to correlate the traffic arriving on multiple recovery paths 1032 with the working path. Although this is an attractive option, the 1033 details of split path protection are beyond the scope of this draft, 1034 and are for further study. 1036 3.4.3 Bypass Tunnels 1038 It may be convenient, in some cases, to create a "bypass tunnel" for 1039 a PPG between a PSL and PML, thereby allowing multiple recovery paths 1040 to be transparent to intervening LSRs [8]. In this case, one LSP 1041 (the tunnel) is established between the PSL and PML following an 1042 acceptable route and a number of recovery paths are supported through 1043 the tunnel via label stacking. A bypass tunnel can be used with any 1044 of the path mapping options discussed in the previous section. 1046 As with recovery paths, the bypass tunnel may or may not have 1047 resource reservations sufficient to provide recovery without service 1048 degradation. It is possible that the bypass tunnel may have 1049 sufficient resources to recover some number of working paths, but not 1050 all at the same time. If the number of recovery paths carrying 1051 traffic in the tunnel at any given time is restricted, this is 1052 similar to the 1 to n or m to n protection cases mentioned in Section 1053 3.4.2. 1055 3.4.4 Recovery Granularity 1057 Another dimension of recovery considers the amount of traffic 1058 requiring protection. This may range from a fraction of a path to a 1059 bundle of paths. 1061 3.4.4.1 Selective Traffic Recovery 1063 This option allows for the protection of a fraction of traffic within 1064 the same path. The portion of the traffic on an individual path that 1065 requires protection is called a protected traffic portion (PTP). A 1066 single path may carry different classes of traffic, with different 1067 protection requirements. The protected portion of this traffic may be 1068 identified by its class, as for example, via the EXP bits in the MPLS 1069 shim header or via the priority bit in the ATM header. 1071 3.4.4.2 Bundling 1073 Bundling is a technique used to group multiple working paths together 1074 in order to recover them simultaneously. The logical bundling of 1075 multiple working paths requiring protection, each of which is routed 1076 identically between a PSL and a PML, is called a protected path group 1077 (PPG). When a fault occurs on the working path carrying the PPG, the 1078 PPG as a whole can be protected either by being switched to a bypass 1079 tunnel or by being switched to a recovery path. 1081 3.4.5 Recovery Path Resource Use 1083 In the case of pre-reserved recovery paths, there is the question of 1084 what use these resources may be put to when the recovery path is not 1085 in use. There are two options: 1087 Dedicated-resource: 1088 If the recovery path resources are dedicated, they may not be used 1089 for anything except carrying the working traffic. For example, in 1090 the case of 1+1 protection, the working traffic is always carried on 1091 the recovery path. Even if the recovery path is not always carrying 1092 the working traffic, it may not be possible or desirable to allow 1093 other traffic to use these resources. 1095 Extra-traffic-allowed: 1096 If the recovery path only carries the working traffic when the 1097 working path fails, then it is possible to allow extra traffic to use 1098 the reserved resources at other times. Extra traffic is, by 1099 definition, traffic that can be displaced (without violating service 1100 agreements) whenever the recovery path resources are needed for 1101 carrying the working path traffic. 1103 3.5 Fault Detection 1104 MPLS recovery is initiated after the detection of either a lower 1105 layer fault or a fault at the IP layer or in the operation of MPLS- 1106 based mechanisms. We consider four classes of impairments: Path 1107 Failure, Path Degraded, Link Failure, and Link Degraded. 1109 Path Failure (PF) is a fault that indicates to an MPLS-based recovery 1110 scheme that the connectivity of the path is lost. This may be 1111 detected by a path continuity test between the PSL and PML. Some, 1112 and perhaps the most common, path failures may be detected using a 1113 link probing mechanism between neighbor LSRs. An example of a probing 1114 mechanism is a liveness message that is exchanged periodically along 1115 the working path between peer LSRs. For either a link probing 1116 mechanism or path continuity test to be effective, the test message 1117 must be guaranteed to follow the same route as the working or 1118 recovery path, over the segment being tested. In addition, the path 1119 continuity test must take the path merge points into consideration. 1120 In the case of a bi-directional link implemented as two 1121 unidirectional links, path failure could mean that either one or both 1122 unidirectional links are damaged. 1124 Path Degraded (PD) is a fault that indicates to MPLS-based recovery 1125 schemes/mechanisms that the path has connectivity, but that the 1126 quality of the connection is unacceptable. This may be detected by a 1127 path performance monitoring mechanism, or some other mechanism for 1128 determining the error rate on the path or some portion of the path. 1129 This is local to the LSR and consists of excessive discarding of 1130 packets at an interface, either due to label mismatch or due to TTL 1131 errors, for example. 1133 Link Failure (LF) is an indication from a lower layer that the link 1134 over which the path is carried has failed. If the lower layer 1135 supports detection and reporting of this fault (that is, any fault 1136 that indicates link failure e.g., SONET LOS), this may be used by the 1137 MPLS recovery mechanism. In some cases, using LF indications may 1138 provide faster fault detection than using only MPLS_based fault 1139 detection mechanisms. 1141 Link Degraded (LD) is an indication from a lower layer that the link 1142 over which the path is carried is performing below an acceptable 1143 level. If the lower layer supports detection and reporting of this 1144 fault, it may be used by the MPLS recovery mechanism. In some cases, 1145 using LD indications may provide faster fault detection than using 1146 only MPLS-based fault detection mechanisms. 1148 3.6 Fault Notification 1150 MPLS-based recovery relies on rapid and reliable notification of 1151 faults. Once a fault is detected, the node that detected the fault 1152 must determine if the fault is severe enough to require path 1153 recovery. If the node is not capable of initiating direct action 1154 (e.g. as a PSL) the node should send out a notification of the fault 1155 by transmitting a FIS to those of its upstream LSRs that were sending 1156 traffic on the working path that is affected by the fault. This 1157 notification is relayed hop-by-hop by each subsequent LSR to its 1158 upstream neighbor, until it eventually reaches a PSL. A PSL is the 1159 only LSR that can terminate the FIS and initiate a protection switch 1160 of the working path to a recovery path. 1162 Since the FIS is a control message, it should be transmitted with 1163 high priority to ensure that it propagates rapidly towards the 1164 affected PSL(s). Depending on how fault notification is configured in 1165 the LSRs of an MPLS domain, the FIS could be sent either as a Layer 2 1166 or Layer 3 packet [13]. The use of a Layer 2-based notification 1167 requires a Layer 2 path direct to the PSL. An example of a FIS could 1168 be the liveness message sent by a downstream LSR to its upstream 1169 neighbor, with an optional fault notification field set or it can be 1170 implicitly denoted by a teardown message. Alternatively, it could be 1171 a separate fault notification packet. The intermediate LSR should 1172 identify which of its incoming links (upstream LSRs) to propagate the 1173 FIS on. In the case of 1+1 protection, the FIS should also be sent 1174 downstream to the PML where the recovery action is taken. 1176 3.7 Switch-Over Operation 1178 3.7.1 Recovery Trigger 1180 The activation of an MPLS protection switch following the detection 1181 or notification of a fault requires a trigger mechanism at the PSL. 1182 MPLS protection switching may be initiated due to automatic inputs or 1183 external commands. The automatic activation of an MPLS protection 1184 switch results from a response to a defect or fault conditions 1185 detected at the PSL or to fault notifications received at the PSL. It 1186 is possible that the fault detection and trigger mechanisms may be 1187 combined, as is the case when a PF, PD, LF, or LD is detected at a 1188 PSL and triggers a protection switch to the recovery path. In most 1189 cases, however, the detection and trigger mechanisms are distinct, 1190 involving the detection of fault at some intermediate LSR followed by 1191 the propagation of a fault notification back to the PSL via the FIS, 1192 which serves as the protection switch trigger at the PSL. MPLS 1193 protection switching in response to external commands results when 1194 the operator initiates a protection switch by a command to a PSL (or 1195 alternatively by a configuration command to an intermediate LSR, 1196 which transmits the FIS towards the PSL). 1198 Note that the PF fault applies to hard failures (fiber cuts, 1199 transmitter failures, or LSR fabric failures), as does the LF fault, 1200 with the difference that the LF is a lower layer impairment that may 1201 be communicated to - MPLS-based recovery mechanisms. The PD (or LD) 1202 fault, on the other hand, applies to soft defects (excessive errors 1203 due to noise on the link, for instance). The PD (or LD) results in a 1204 fault declaration only when the percentage of lost packets exceeds a 1205 given threshold, which is provisioned and may be set based on the 1206 service level agreement(s) in effect between a service provider and a 1207 customer. 1209 3.7.2 Recovery Action 1210 After a fault is detected or FIS is received by the PSL, the recovery 1211 action involves either a rerouting or protection switching operation. 1212 In both scenarios, the next hop label forwarding entry for a recovery 1213 path is bound to the working path. 1215 3.8 Switch-Back Operation 1217 When traffic is flowing on the recovery path decisions can be made to 1218 whether let the traffic remain on the recovery path and consider it 1219 as a new working path or do a switch to the old or a new working 1220 path. This switch-back operation has two styles, one where the 1221 protection counterparts, i.e. the working and recovery path, are 1222 fixed or "pinned" to its route and one in which the PSL or other 1223 network entity with real time knowledge of failure dynamically 1224 performs re-establishment or controlled rearrangement of the paths 1225 comprising the protected service. 1227 3.8.1 Fixed Protection Counterparts 1229 For fixed protection counterparts the PSL will be pre-configured with 1230 the appropriate behavior to take when the original fixed path is 1231 restored to service. The choices are revertive and non-revertive 1232 mode. The choice will typically be depended on relative costs of the 1233 working and protection paths, and the tolerance of the service to the 1234 effects of switching paths yet again. These protection modes indicate 1235 whether or not there is a preferred path for the protected traffic. 1237 3.8.1.1 Revertive Mode 1239 If the working path always is the preferred path, this path will be 1240 used whenever it is available. Thus, in the event of a fault on this 1241 path, its unused resources will not be reclaimed by the network on 1242 failure. If the working path has a fault, traffic is switched to the 1243 recovery path. In the revertive mode of operation, when the 1244 preferred path is restored the traffic is automatically switched back 1245 to it. 1247 There are a number of implications to pinned working and recovery 1248 paths: 1249 - upon failure and traffic moved to recovery path, the traffic is 1250 unprotected until such time as the path defect in the original 1251 working path is repaired and that path restored to service. 1252 - upon failure and traffic moved to recovery path, the resources 1253 associated with the original path remain reserved. 1255 3.8.1.2 Non-revertive Mode 1257 In the non-revertive mode of operation, there is no preferred path or 1258 it may be desirable to minimize further disruption of the service 1259 brought on by a revertive switching operation. A switch-back to the 1260 original working path is not desired or not possible since the 1261 original path may no longer exist after the occurrence of a fault on 1262 that path. 1263 If there is a fault on the working path, traffic is switched to the 1264 recovery path. When or if the faulty path (the originally working 1265 path) is restored, it may become the recovery path (either by 1266 configuration, or, if desired, by management actions). 1268 In the non-revertive mode of operation, the working traffic may or 1269 may not be restored to a new optimal working path or to the original 1270 working path anyway. This is because it might be useful, in some 1271 cases, to either: (a) administratively perform a protection switch 1272 back to the original working path after gaining further assurances 1273 about the integrity of the path, or (b) it may be acceptable to 1274 continue operation on the recovery path, or (c) it may be desirable 1275 to move the traffic to a new optimal working path that is calculated 1276 based on network topology and network policies. 1278 3.8.2 Dynamic Protection Counterparts 1280 For Dynamic protection counterparts when the traffic is switched over 1281 to a recovery path, the association between the original working path 1282 and the recovery path may no longer exist, since the original path 1283 itself may no longer exist after the fault. Instead, when the network 1284 reaches a stable state following routing convergence, the recovery 1285 path may be switched over to a different preferred path either 1286 optimization based on the new network topology and associated 1287 information or based on pre-configured information. 1289 Dynamic protection counterparts assume that upon failure, the PSL or 1290 other network entity will establish new working paths if a switch- 1291 back will be performed. 1293 3.8.3 Restoration and Notification 1295 MPLS restoration deals with returning the working traffic from the 1296 recovery path to the original or a new working path. Reversion is 1297 performed by the PSL either upon receiving notification, via FRS, 1298 that the working path is repaired, or upon receiving notification 1299 that a new working path is established. 1301 For fixed counterparts in revertive mode, an LSR that detected the 1302 fault on the working path also detects the restoration of the working 1303 path. If the working path had experienced a LF defect, the LSR 1304 detects a return to normal operation via the receipt of a liveness 1305 message from its peer. If the working path had experienced a LD 1306 defect at an LSR interface, the LSR could detect a return to normal 1307 operation via the resumption of error-free packet reception on that 1308 interface. Alternatively, a lower layer that no longer detects a LF 1309 defect may inform the MPLS-based recovery mechanisms at the LSR that 1310 the link to its peer LSR is operational. 1311 The LSR then transmits FRS to its upstream LSR(s) that were 1312 transmitting traffic on the working path. At the point the PSL 1313 receives the FRS, it switches the working traffic back to the 1314 original working path. 1316 A similar scheme is for dynamic counterparts where e.g. an update of 1317 topology and/or network convergence may trigger installation or setup 1318 of new working paths and send notification to the PSL to perform a 1319 switch over. 1321 We note that if there is a way to transmit fault information back 1322 along a recovery path towards a PSL and if the recovery path is an 1323 equivalent working path, it is possible for the working path and its 1324 recovery path to exchange roles once the original working path is 1325 repaired following a fault. This is because, in that case, the 1326 recovery path effectively becomes the working path, and the restored 1327 working path functions as a recovery path for the original recovery 1328 path. This is important, since it affords the benefits of non- 1329 revertive switch operation outlined in Section 3.8.1, without leaving 1330 the recovery path unprotected. 1332 3.8.4 Reverting to Preferred Path (or Controlled Rearrangement) 1334 In the revertive mode, a "make before break" restoration switching 1335 can be used, which is less disruptive than performing protection 1336 switching upon the occurrence of network impairments. This will 1337 minimize both packet loss and packet reordering. The controlled 1338 rearrangement of paths can also be used to satisfy traffic 1339 engineering requirements for load balancing across an MPLS domain. 1341 3.9 Performance 1343 Resource/performance requirements for recovery paths should be 1344 specified in terms of the following attributes: 1346 I. Resource class attribute: 1347 Equivalent Recovery Class: The recovery path has the same resource 1348 reservations and performance guarantees as the working path. In other 1349 words, the recovery path meets the same SLAs as the working path. 1350 Limited Recovery Class: The recovery path does not have the same 1351 resource reservations and performance guarantees as the working path. 1353 A. Lower Class: The recovery path has lower resource requirements or 1354 less stringent performance requirements than the working path. 1356 B. Best Effort Class: The recovery path is best effort. 1358 II. Priority Attribute: 1360 The recovery path has a priority attribute just like the working path 1361 (i.e., the priority attribute of the associated traffic trunks). It 1362 can have the same priority as the working path or lower priority. 1364 III. Preemption Attribute: 1366 The recovery path can have the same preemption attribute as the 1367 working path or a lower one. 1369 4.0 MPLS Recovery Requirement 1371 The following are the MPLS recovery requirements: 1373 I. MPLS recovery SHALL provide an option to identify protection 1374 groups (PPGs) and protection portions (PTPs). 1376 II. Each PSL SHALL be capable of performing MPLS recovery upon the 1377 detection of the impairments or upon receipt of notifications of 1378 impairments. 1380 III. A MPLS recovery method SHALL not preclude manual protection 1381 switching commands. This implies that it would be possible under 1382 administrative commands to transfer traffic from a working path to a 1383 recovery path, or to transfer traffic from a recovery path to a 1384 working path, once the working path becomes operational following a 1385 fault. 1387 IV. A PSL SHALL be capable of performing either a switch back to the 1388 original working path after the fault is corrected or a switchover to 1389 a new working path, upon the discovery or establishment of a more 1390 optimal working path. 1392 V. The recovery model should take into consideration path merging at 1393 intermediate LSRs. If a fault affects the merged segment, all the 1394 paths sharing that merged segment should be able to recover. 1395 Similarly, if a fault affects a non-merged segment, only the path 1396 that is affected by the fault should be recovered. 1398 5.0 MPLS Recovery Options 1400 There SHOULD be an option for: 1402 I. Configuration of the recovery path as excess or reserved, with 1403 excess as the default. The recovery path that is configured as excess 1404 SHALL provide lower priority preemptable traffic access to the 1405 protection bandwidth, while the recovery path configured as reserved 1406 SHALL not provide any other traffic access to the protection 1407 bandwidth. 1409 II. Configuring the protection alternatives as either rerouting or 1410 protection switching. 1412 III. Enabling restoration as either non-revertive or revertive, with 1413 non-revertive as the default if fixed protection counterparts are 1414 used. 1416 6.0 Comparison Criteria 1417 Possible criteria to use for comparison of MPLS-based recovery 1418 schemes are as follows: 1420 Recovery Time 1422 We define recovery time as the time required for a recovery path to 1423 be activated (and traffic flowing) after a fault. Recovery Time is 1424 the sum of the Fault Detection Time, Hold-off Time, Notification 1425 Time, Recovery Operation Time, and the Traffic Restoration Time. In 1426 other words, it is the time between a failure of a node or link in 1427 the network and the time before a recovery path is installed and the 1428 traffic starts flowing on it. 1430 Full Restoration Time 1432 We define full restoration time as the time required for a permanent 1433 restoration. This is the time required for traffic to be routed onto 1434 links, which are capable of or have been engineered sufficiently to 1435 handle traffic in recovery scenarios. Note that this time may or may 1436 not be different from the "Recovery Time" depending on whether 1437 equivalent or limited recovery paths are used. 1439 Setup vulnerability 1441 The amount of time that a working path or a set of working paths is 1442 left unprotected during such tasks as recovery path computation and 1443 recovery path setup may be used to compare schemes. The nature of 1444 this vulnerability should be taken into account, e.g.: End to End 1445 schemes correlate the vulnerability with working paths, Local Repair 1446 schemes have a topological correlation that cuts across working paths 1447 and Network Plan approaches have a correlation that impacts the 1448 entire network. 1450 Backup Capacity 1452 Recovery schemes may require differing amounts of "backup capacity" 1453 in the event of a fault. This capacity will be dependent on the 1454 traffic characteristics of the network. However, it may also be 1455 dependent on the particular protection plan selection algorithms as 1456 well as the signaling and re-routing methods. 1458 Additive Latency 1460 Recovery schemes may introduce additive latency to traffic. For 1461 example, a recovery path may take many more hops than the working 1462 path. This may be dependent on the recovery path selection 1463 algorithms. 1465 Quality of Protection 1467 Recovery schemes can be considered to encompass a spectrum of "packet 1468 survivability" which may range from "relative" to "absolute". 1470 Relative survivability may mean that the packet is on an equal 1471 footing with other traffic of, as an example, the same diff-serv code 1472 point (DSCP) in contending for the surviving network resources. 1473 Absolute survivability may mean that the survivability of the 1474 protected traffic has explicit guarantees. 1476 Re-ordering 1478 Recovery schemes may introduce re-ordering of packets. Also the 1479 action of putting traffic back on preferred paths might cause packet 1480 re-ordering. 1482 State Overhead 1484 As the number of recovery paths in a protection plan grows, the state 1485 required to maintain them also grows. Schemes may require differing 1486 numbers of paths to maintain certain levels of coverage, etc. The 1487 state required may also depend on the particular scheme used to 1488 recover. In many cases the state overhead will be in proportion to 1489 the number of recovery paths. 1491 Loss 1493 Recovery schemes may introduce a certain amount of packet loss during 1494 switchover to a recovery path. Schemes that introduce loss during 1495 recovery can measure this loss by evaluating recovery times in 1496 proportion to the link speed. 1498 In case of link or node failure a certain packet loss is inevitable. 1500 Coverage 1502 Recovery schemes may offer various types of failover coverage. The 1503 total coverage may be defined in terms of several metrics: 1505 I. Fault Types: Recovery schemes may account for only link faults or 1506 both node and link faults or also degraded service. For example, a 1507 scheme may require more recovery paths to take node faults into 1508 account. 1510 II. Number of concurrent faults: dependent on the layout of recovery 1511 paths in the protection plan, multiple fault scenarios may be able to 1512 be restored. 1514 III. Number of recovery paths: for a given fault, there may be one or 1515 more recovery paths. 1517 IV. Percentage of coverage: dependent on a scheme and its 1518 implementation, a certain percentage of faults may be covered. This 1519 may be subdivided into percentage of link faults and percentage of 1520 node faults. 1522 V. The number of protected paths may effect how fast the total set of 1523 paths affected by a fault could be recovered. The ratio of protected 1524 is n/N, where n is the number of protected paths and N is the total 1525 number of paths. 1526 7.0 Security Considerations 1528 The MPLS recovery that is specified herein does not raise any 1529 security issues that are not already present in the MPLS 1530 architecture. 1532 8.0 Intellectual Property Considerations 1534 The IETF has been notified of intellectual property rights claimed in 1535 regard to some or all of the specification contained in this 1536 document. For more information consult the online list of claimed 1537 rights. 1539 9.0 Acknowledgements 1541 We would like to thank members of the MPLS WG mailing list for their 1542 suggestions on the earlier version of this draft. In particular, Bora 1543 Akyol, Dave Allan, and Neil Harrisson, whose suggestions and comments 1544 were very helpful in revising the document. 1546 10.0 Authors' Addresses 1548 Vishal Sharma Ben Mack-Crane 1549 Tellabs Research Center Tellabs Operations, Inc. 1550 One Kendall Square 4951 Indiana Avenue 1551 Bldg. 100, Ste. 121 Lisle, IL 60532 1552 Cambridge, MA 02139-1562 Phone: 630-512-7255 1553 Phone: 617-577-8760 Ben.Mack-Crane@tellabs.com 1554 Vishal.Sharma@tellabs.com 1556 Srinivas Makam Ken Owens 1557 Tellabs Operations, Inc. Tellabs Operations, Inc. 1558 4951 Indiana Avenue 1106 Fourth Street 1559 Lisle, IL 60532 St. Louis, MO 63126 1560 Phone: 630-512-7217 Phone: 314-918-1579 1561 Srinivas.Makam@tellabs.com Ken.Owens@tellabs.com 1563 Changcheng Huang Fiffi Hellstrand 1564 Dept. of Systems & Computer Engg. Nortel Networks 1565 Carleton University St Eriksgatan 115 1566 Minto Center, Rm. 3082 PO Box 6701 1567 1125 Colonial By Drive 113 85 Stockholm, Sweden 1568 Ottawa, Ontario K1S 5B6, Canada Phone: +46 8 5088 3687 1569 Phone: 613 520-2600 x2477 Fiffi@nortelnetworks.com 1570 Changcheng.Huang@sce.carleton.ca 1571 Jon Weil Brad Cain 1572 Nortel Networks Mirror Image Internet 1573 Harlow Laboratories London Road 49 Dragon Ct. 1574 Harlow Essex CM17 9NA, UK Woburn, MA 01801, USA 1575 Phone: +44 (0)1279 403935 bcain@mirror-image.com 1576 jonweil@nortelnetworks.com 1578 Loa Andersson Bilel Jamoussi 1579 Nortel Networks Nortel Networks 1580 St Eriksgatan 115, PO Box 6701 3 Federal Street, BL3-03 1581 113 85 Stockholm, Sweden Billerica, MA 01821, USA 1582 Phone: +46 8 50 88 36 34 Phone:(978) 288-4506 1583 loa.andersson@nortelnetworks.com jamoussi@nortelnetworks.com 1585 Seyhan Civanlar Angela Chiu 1586 Coreon, Inc. AT&T Labs, Rm. 4-204 1587 1200 South Avenue, Suite 103 100 Schulz Drive 1588 Staten Island, NY 10314 Red Bank, NJ 07701 1589 Phone: (718) 889 4203 Phone: (732) 345-3441 1590 scivanlar@coreon.net alchiu@att.com 1592 11.0 References 1594 [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label 1595 Switching Architecture", Internet Draft draft-ietf-mpls-arch-07.txt, 1596 Work in Progress , July 2000. 1598 [2] Andersson, L., Doolan, P., Feldman, N., Fredette, A., Thomas, B., 1599 "LDP Specification", Internet Draft draft-ietf-mpls-ldp-11.txt, Work in 1600 Progress , August 2000. 1602 [3] Awduche, D. Hannan, A., and Xiao, X., "Applicability Statement for 1603 Extensions to RSVP for LSP-Tunnels", draft-ietf-mpls-rsvp-tunnel- 1604 applicability-01.txt, work in progress, April 2000. 1606 [4] Jamoussi, B. et al "Constraint-Based LSP Setup using LDP", Internet 1607 Draft draft-ietf-mpls-cr-ldp-04.txt, Work in Progress , July 2000. 1609 [5] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource ReSerVation 1610 Protocol (RSVP) -- Version 1 Functional Specification", RFC 2205, 1611 September 1997. 1613 [6] Awduche, D. et al "Extensions to RSVP for LSP Tunnels", Internet 1614 Draft draft-ietf-mpls-rsvp-lsp-tunnel-07.txt, Work in Progress, August 1615 2000. 1617 [7] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J., 1618 "Requirements for Traffic Engineering Over MPLS", RFC 2702, September 1619 1999. 1621 [8] Andersson, L., Cain B., Jamoussi, B., "Requirement Framework for 1622 Fast Re-route with MPLS", draft-andersson-reroute-frmwrk-00.txt, work in 1623 progress, October 1999. 1625 [9] Goguen, R. and Swallow, G., "RSVP Label Allocation for Backup 1626 Tunnels", draft-swallow-rsvp-bypass-label-00.txt, work in progress, 1627 October 1999. 1629 [10] Makam, S., Sharma, V., Owens, K., Huang, C., 1630 "Protection/restoration of MPLS Networks", Internet Draft draft-makam- 1631 mpls-protection-00.txt, work in progress, October 1999. 1633 [11] Callon, R., Doolan, P., Feldman, N., Fredette, A., Swallow, G., 1634 Viswanathan, A., "A Framework for Multiprotocol Label Switching", 1635 Internet Draft draft-ietf-mpls-framework-05.txt, Work in Progress, 1636 September 1999. 1638 [12] Haskin, D. and Krishnan R., "A Method for Setting an Alternative 1639 Label Switched Path to Handle Fast Reroute", Internet Draft draft- 1640 haskin-mpls-fast-reroute-05.txt, November 2000, Work in progress. 1642 [13] Owens, K., Makam,V., Sharma, V., Mack-Crane, B., and Haung, C., "A 1643 Path Protection/Restoration Mechanism for MPLS Networks", Internet 1644 Draft, draft-chang-mpls-path-protection-02.txt, Work in Progress 1645 November 2000.