idnits 2.17.00 (12 Aug 2021) /tmp/idnits26227/draft-ietf-mpls-recovery-frmwrk-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == There are 4 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 26 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 30 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 67 instances of too long lines in the document, the longest one being 1 character in excess of 72. ** The abstract seems to contain references ([2], [3], [4], [5], [6], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 10 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1353: '...I. MPLS recovery SHALL provide an opti...' RFC 2119 keyword, line 1356: '... II. Each PSL SHALL be capable of p...' RFC 2119 keyword, line 1360: '... recovery method SHALL not preclude ma...' RFC 2119 keyword, line 1367: '... IV. A PSL SHALL be capable of perf...' RFC 2119 keyword, line 1380: '... There SHOULD be an option for:...' (4 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The "Author's Address" (or "Authors' Addresses") section title is misspelled. == Line 325 has weird spacing: '...traffic is sw...' == Line 673 has weird spacing: '...icating the t...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: III. A MPLS recovery method SHALL not preclude manual protection switching commands. This implies that it would be possible under administrative commands to transfer traffic from a working path to a recovery path, or to transfer traffic from a recovery path to a working path, once the working path becomes operational following a fault. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: I. Configuration of the recovery path as excess or reserved, with excess as the default. The recovery path that is configured as excess SHALL provide lower priority preemptable traffic access to the protection bandwidth, while the recovery path configured as reserved SHALL not provide any other traffic access to the protection bandwidth. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' == Outdated reference: draft-ietf-mpls-rsvp-tunnel-applicability has been published as RFC 3210 ** Downref: Normative reference to an Informational draft: draft-ietf-mpls-rsvp-tunnel-applicability (ref. '3') -- Possible downref: Non-RFC (?) normative reference: ref. '4' -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '7') -- Possible downref: Normative reference to a draft: ref. '8' == Outdated reference: A later version (-01) exists of draft-swallow-rsvp-bypass-label-00 -- Possible downref: Normative reference to a draft: ref. '9' -- Possible downref: Normative reference to a draft: ref. '10' -- Possible downref: Non-RFC (?) normative reference: ref. '11' == Outdated reference: A later version (-05) exists of draft-haskin-mpls-fast-reroute-01 -- Possible downref: Normative reference to a draft: ref. '12' Summary: 14 errors (**), 0 flaws (~~), 13 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 IETF Draft Vishal Sharma 3 Multi-Protocol Label Switching Ben-Mack Crane 4 Expires: March 2001 Srinivas Makam 5 Ken Owens 6 Tellabs Operations, Inc. 8 Changcheng Huang 9 Carleton University 11 Fiffi Hellstrand 12 Jon Weil 13 Loa Andersson 14 Bilel Jamoussi 15 Nortel Networks 17 Brad Cain 18 Mirror Image Internet 20 Seyhan Civanlar 21 Coreon Networks 23 Angela Chiu 24 AT&T Labs 26 September 2000 28 Framework for MPLS-based Recovery 29 31 Status of this memo 33 This document is an Internet-Draft and is in full conformance with 34 all provisions of Section 10 of RFC2026. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as Internet- 38 Drafts. Internet-Drafts are draft documents valid for a maximum of 39 six months and may be updated, replaced, or obsoleted by other 40 documents at any time. It is inappropriate to use Internet-Drafts 41 as reference material or to cite them other than as "work in 42 progress." 43 The list of current Internet-Drafts can be accessed at 44 http://www.ietf.org/ietf/1id-abstracts.txt 45 The list of Internet-Draft Shadow Directories can be accessed at 46 http://www.ietf.org/shadow.html. 48 Abstract 50 Multi-protocol label switching (MPLS) [1] integrates the label 51 swapping forwarding paradigm with network layer routing. To deliver 52 reliable service, MPLS requires a set of procedures to provide 53 protection of the traffic carried on different paths. This requires 54 that the label switched routers (LSRs) support fault detection, 55 fault notification, and fault recovery mechanisms, and that MPLS 56 signaling [2] [3] [4] [5] [6] support the configuration of 57 recovery. With these objectives in mind, this document specifies a 58 framework for MPLS based recovery. 60 Table of Contents Page 62 1.0 Introduction 3 63 1.1 Background 3 64 1.2 Motivations for MPLS-Based Recovery 3 65 1.3 Objectives 4 67 2.0 Overview 5 68 2.1 Recovery Models 6 69 2.2 Recovery Cycles 7 70 2.2.1 MPLS Recovery Cycle Model 7 71 2.2.2 MPLS Reversion Cycle Model 9 72 2.2.3 Dynamic Reroute Cycle Model 10 73 2.3 Definitions and Terminology 11 74 2.4 Abbreviations 15 76 3.0 MPLS Recovery Principles 15 77 3.1 Configuration of Recovery 15 78 3.2 Initiation of Path Setup 15 79 3.3 Initiation of Resource Allocation 16 80 3.4 Scope of Recovery 17 81 3.4.1 Topology 17 82 3.4.1.1 Local Repair 17 83 3.4.1.2 Global Repair 17 84 3.4.1.3 Alternate Egress Repair 18 85 3.4.1.4 Multi-Layer Repair 18 86 3.4.1.5 Concatenated Protection Domains 18 87 3.4.2 Path Mapping 18 88 3.4.3 Bypass Tunnels 19 89 3.4.4 Recovery Granularity 20 90 3.4.4.1 Selective Traffic Recovery 20 91 3.4.4.2 Bundling 20 92 3.4.5 Recovery Path Resource Use 20 93 3.5 Fault Detection 21 94 3.6 Fault Notification 21 95 3.7 Switch Over Operation 22 96 3.7.1 Recovery Trigger 22 97 3.7.2 Recovery Action 22 98 3.8 Switch Back Operation 23 99 3.8.1 Revertive and Non-revertive Mode 23 100 3.8.2 Restoration and Notification 23 101 3.8.3 Reverting to Preferred Path 23 102 3.9 Performance 24 104 4.0 Recovery Requirements 25 105 5.0 MPLS Recovery Options 25 106 6.0 Comparison Criteria 26 107 7.0 Security Considerations 27 108 8.0 Intellectual Property Considerations 27 109 9.0 Acknowledgements 28 110 10.0 Author's Addresses 28 111 11.0 References 29 113 1.0 Introduction 115 This memo describes a framework for MPLS-based recovery. We provide 116 a detailed taxonomy of recovery terminology, and discuss the 117 motivation for, the objectives of, and the requirements for MPLS- 118 based recovery. We outline principles for MPLS-based recovery, and 119 also provide comparison criteria that may serve as a basis for 120 comparing and evaluating different recovery schemes. 122 1.1 Background 124 Network routing deployed today is focussed primarily on 125 connectivity and typically supports only one class of service, the 126 best effort class. Multi-protocol label switching, on the other 127 hand, by integrating forwarding based on label-swapping of a link 128 local label with network layer routing allows flexibility in the 129 delivery of new routing services. MPLS allows for using media 130 specific forwarding mechanisms as label swapping. This enables more 131 sophisticated features such as quality-of-service (QoS) and traffic 132 engineering [7] to be implemented more effectively. An important 133 component of providing QoS, however, is the ability to transport 134 data reliably and efficiently. Although the current routing 135 algorithms are very robust and survivable, the amount of time they 136 take to recover from a fault can be significant, on the order of 137 several seconds or minutes, causing serious disruption of service 138 for some applications in the interim. This is unacceptable to many 139 organizations that aim to provide a highly reliable service, and 140 thus require recovery times on the order of tens of milliseconds, 141 as specified, for example, in the GR253 specification for SONET. 143 MPLS recovery may be motivated by the notion that there are 144 inherent limitations to improving the recovery times of current 145 routing algorithms. Additional improvement not obtainable by other 146 means can be obtained by augmenting these algorithms with MPLS 147 recovery mechanisms. Since MPLS is likely to be the technology of 148 choice in the future IP-based transport network, it is useful that 149 MPLS be able to provide protection and restoration of traffic. 150 MPLS may facilitate the convergence of network functionality on a 151 common control and management plane. Further, a protection priority 152 could be used as a differentiating mechanism for premium services 153 that require high reliability. The remainder of this document 154 provides a framework for MPLS based recovery. It is focused at a 155 conceptual level and is meant to address motivation, objectives and 156 requirements. Issues of mechanism, policy, routing plans and 157 characteristics of traffic carried by protection paths are beyond 158 the scope of this document. 160 1.2 Motivation for MPLS-Based Recovery 161 MPLS based protection of traffic (called MPLS-based Recovery) is 162 useful for a number of reasons. The most important is its ability 163 to increase network reliability by enabling a faster response to 164 faults than is possible with traditional Layer 3 (or the IP layer) 165 alone while still providing the visibility of the network afforded 166 Layer 3. Furthermore, a protection mechanism using MPLS could 167 enable IP traffic to be put directly over WDM optical channels, 168 without an intervening SONET layer. This would facilitate the 169 construction of IP-over-WDM networks. 171 The need for MPLS-based recovery arises because of the following: 173 I. Layer 3 or IP rerouting may be too slow for a core MPLS network 174 that needs to support high reliability/availability. 176 II. Layer 0 (for example, optical layer) or Layer 1 (for example, 177 SONET) mechanisms may not be deployed in topologies that meet 178 carriersĘ protection goals. 180 III. The granularity at which the lower layers may be able to 181 protect traffic may be too coarse for traffic that is switched 182 using MPLS-based mechanisms. 184 IV. Layer 0 or Layer 1 mechanisms may have no visibility into 185 higher layer operations. Thus, while they may provide, for 186 example, link protection, they cannot easily provide node 187 protection or protection of traffic transported using MPLS. 189 Furthermore there is a need for open standards. 191 V. Establishing interoperability of protection mechanisms between 192 routers/LSRs from different vendors in IP or MPLS networks is 193 urgently required to enable the adoption of MPLS as a viable core 194 transport and traffic engineering technology. 196 1.3 Objectives/Goals 198 We lay down the following objectives for MPLS-based recovery. 200 I. MPLS-based recovery mechanisms should facilitate fast (10Ęs of 201 ms) recovery times. 203 II. MPLS-based recovery should maximize network reliability and 204 availability. MPLS based protection of traffic should minimize the 205 number of single points of failure in the MPLS protected domain. 207 III. MPLS based recovery should enhance the reliability of the 208 protected traffic while minimally or predictably degrading the 209 traffic carried by the diverted resources. 211 IV. MPLS-based recovery techniques should be applicable for 212 protection of traffic at various granularities. For example, it 213 should be possible to specify MPLS-based recovery for a portion of 214 the traffic on an individual path, for all traffic on an individual 215 path, or for all traffic on a group of paths. 217 V. MPLS-based recovery techniques may be applicable for an entire 218 end-to-end path or for segments of an end-to-end path. 220 VI. MPLS-based recovery actions should not adversely affect other 221 network operations. 223 VII. MPLS-based recovery actions in one MPLS protection domain 224 (defined in Section 2.2) should not adversely affect the recovery 225 actions in other MPLS protection domains. 227 VII. MPLS-based recovery mechanisms should be able to take into 228 consideration the recovery actions of lower layers. 230 VIII. MPLS-based recovery actions should avoid network-layering 231 violations. That is, defects in MPLS-based mechanisms should not 232 trigger lower layer protection switching. 234 IX. MPLS-based recovery mechanisms should minimize the loss of data 235 and packet reordering during recovery operations. (The current MPLS 236 specification has itself no explicit requirement on reordering). 238 X. MPLS-based recovery mechanisms should minimize the state 239 overhead incurred for each recovery path maintained. 241 XI. MPLS-based recovery mechanisms should be able to preserve the 242 constraints on traffic after switchover, if desired. That is, if 243 desired, the recovery path should meet the resource requirements 244 of, and achieve the same performance characteristics, as the 245 working path. 247 2.0 Overview 249 There are several options for providing protection of traffic using 250 MPLS. The most generic requirement is the specification of whether 251 recovery should be via Layer 3 (or IP) rerouting or via MPLS 252 protection switching or rerouting actions. 254 Generally network operators aim to provide the fastest and the best 255 protection mechanism that can be provided at a reasonable cost. The 256 higher the level of protection, the more resources it consumes, 257 therefore it is expected that network operators will offer a 258 spectrum of service levels. MPLS-based recovery should give the 259 flexibility to select the recovery mechanism, choose the 260 granularity at which traffic is protected, and to also choose the 261 specific types of traffic that are protected in order to give 262 operators more control over that tradeoff. With MPLS-based 263 recovery, it can be possible to provide different levels of 264 protection for different classes of service, based on their service 265 requirements. For example, using approaches outlined below, a VLL 266 service that supports real-time applications like VoIP may be 267 supported using link/node protection together with pre-established, 268 pre-reserved path protection, while best effort traffic may use 269 established-on-demand path protection or simply rely on IP re- 270 route or higher layer recovery mechanisms. As another example of 271 their range of application, MPLS-based recovery strategies may be 272 used to protect traffic not originally flowing on label switched 273 paths, such as IP traffic that is normally routed hop-by-hop, as 274 well as traffic forwarded on label switched paths. 276 2.1 Recovery Models 278 There are two basic models for path recovery: rerouting and 279 protection switching. 281 Protection switching and rerouting, as defined below, may be used 282 together. For example, protection switching to a recovery path may 283 be used for rapid restoration of connectivity while rerouting 284 determines a new optimal network configuration, rearranging paths, 285 as needed, at a later time [8] [9]. 287 2.1.1 Rerouting 289 Recovery by rerouting is defined as establishing new paths or path 290 segments on demand for restoring traffic after the occurrence of a 291 fault. The new paths may be based upon fault information, network 292 routing policies, pre-defined configurations and network topology 293 information. Thus, upon detecting a fault, paths or path segments 294 to bypass the fault are established using signaling. Reroute 295 mechanisms are inherently slower than protection switching 296 mechanisms, since more must be done following the detection of a 297 fault. However reroute mechanisms are simpler and more frugal as no 298 resources are committed until after the fault occurs and the 299 location of the fault is known. 301 Pre-planned techniques need to take into account all possible 302 failures in the protected domain such that " blind switching" upon 303 detection of failure has a high probability of providing useful 304 recovery. 305 Once the network routing algorithms have converged after a fault, 306 it may be preferable, in some cases, to reoptimize the network by 307 performing a reroute based on the current state of the network and 308 network policies. This is currently discussed further in Section 309 3.8, but will also be clarified further in upcoming revisions of 310 this document. 312 In terms of the principles defined in section 3, reroute recovery 313 employs paths established-on-demand with resources reserved-on- 314 demand. 316 2.1.2 Protection Switching 318 Protection switching recovery mechanisms pre-establish a recovery 319 path or path segment, based upon network routing policies, the 320 restoration requirements of the traffic on the working path, and 321 administrative considerations. The recovery path may or may not be 322 link and node disjoint with the working path [10]. However if the 323 recovery path shares sources of failure with the working path, the 324 overall reliability of the construct is degraded. When a fault is 325 detected, the protected traffic is switched over to the recovery 326 path(s) and restored. 328 In terms of the principles in section 3, protection switching 329 employs pre-established recovery paths, and if resource reservation 330 is required on the recovery path, pre-reserved resources. 332 2.1.2.1. Subtypes of Protection Switching 334 The resources (bandwidth, buffers, processing) on the recovery path 335 may be used to carry either a copy of the working path traffic or 336 extra traffic that is displaced when a protection switch occurs. 337 This leads to two subtypes of protection switching. 339 In 1+1 ("one plus one") protection, the resources (bandwidth, 340 buffers, processing capacity) on the recovery path are fully 341 reserved, and carry the same traffic as the working path. Selection 342 between the traffic on the working and recovery paths is made at 343 the path merge LSR (PML). In effect the PSL function is deprecated 344 to establishment of the working and protection paths and a simple 345 replication function. The recovery intelligence is delegated to the 346 PML. 348 In 1:1 ("one for one") protection, the resources (if any) allocated 349 on the recovery path are fully available to preemptible low 350 priority traffic except when the recovery path is in use due to a 351 fault on the working path. In other words, in 1:1 protection, the 352 protected traffic normally travels only on the working path, and is 353 switched to the recovery path only when the working path has a 354 fault. Once the protection switch is initiated, the low priority 355 traffic being carried on the recovery path may be displaced by the 356 protected traffic. This method affords a way to make efficient use 357 of the recovery path resources. 359 This concept can be extended to 1:n (one for n) and m:n (m for n) 360 protection. 362 Additional specifications of the recovery actions are found in 363 Section 365 2.2 The Recovery Cycles 367 There are three defined recovery cycles; the MPLS Recovery Cycle, 368 the MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The 369 first cycle detects a fault and restores traffic onto MPLS-based 370 recovery paths. If the recovery path is non-optimal the cycle may 371 be followed by any of the two latter to achieve an optimized 372 network again. The reversion cycle applies for explicitly routed 373 traffic that that does not rely on any dynamic routing protocols to 374 be converged. The dynamic re-routing cycle applies for traffic that 375 is forwarded based on hop-by-hop routing. 377 2.2.1 MPLS Recovery Cycle Model 378 The MPLS recovery cycle model is illustrated in Figure 1. 379 Definitions and a key to abbreviations follow. 381 --Network Impairment 382 | --Fault Detected 383 | | --Start of Notification 384 | | | -- Start of Recovery Operation 385 | | | | --Recovery Operation Complete 386 | | | | | --Path Traffic Restored 387 | | | | | | 388 | | | | | | 389 v v v v v v 390 ---------------------------------------------------------------- 391 | T1 | T2 | T3 | T4 | T5 | 393 Figure 1. MPLS Recovery Cycle Model 395 The various timing measures used in the model are described below. 396 T1 Fault Detection Time 397 T2 Hold-off Time 398 T3 Notification Time 399 T4 Recovery Operation Time 400 T5 Traffic Restoration Time 402 Definitions of the recovery cycle times are as follows: 404 Fault Detection Time 406 The time between the occurrence of a network impairment and the 407 moment the fault is detected by MPLS-based recovery mechanisms. 408 This time may be highly dependent on lower layer protocols. 410 Hold-Off Time 412 The configured waiting time between the detection of a fault and 413 taking MPLS-based recovery action, to allow time for lower layer 414 protection to take effect. The Hold-off Time may be zero. 416 Note: The Hold-Off Time may occur after the Notification Time 417 interval if the node responsible for the switchover, the Path 418 Switch LSR (PSL), rather than the detecting LSR, is configured to 419 wait. 421 Notification Time 423 The time between initiation of a fault indication signal (FIS) by 424 the LSR detecting the fault and the time at which the Path Switch 425 LSR (PSL) begins the recovery operation. This is zero if the PSL 426 detects the fault itself or infers a fault from such events as an 427 adjacency failure. 429 Note: If the PSL detects the fault itself, there still may be a 430 Hold-Off Time period between detection and the start of the 431 recovery operation. 433 Recovery Operation Time 435 The time between the first and last recovery actions. This may 436 include message exchanges between the PSL and PML to coordinate 437 recovery actions. 439 Traffic Restoration Time 441 The time between the last recovery action and the time that the 442 traffic (if present) is completely recovered. This interval is 443 intended to account for the time required for traffic to once again 444 arrive at the point in the network that experienced disrupted or 445 degraded service due to the occurrence of the fault (e.g. the PML). 446 This time may depend on the location of the fault, the recovery 447 mechanism, and the propagation delay along the recovery path. 449 2.2.2 MPLS Reversion Cycle Model 451 Protection switching, revertive mode, requires the traffic to be 452 switched back to a preferred path when the fault on that path is 453 cleared. The MPLS reversion cycle model is illustrated in Figure 454 2. Note that the cycle shown below comes after the recovery cycle 455 shown in Fig. 1. 457 --Network Impairment Repaired 458 | --Fault Cleared 459 | | --Path Available 460 | | | --Start of Reversion Operation 461 | | | | --Reversion Operation Complete 462 | | | | | --Traffic Restored on Preferred Path 463 | | | | | | 464 | | | | | | 465 v v v v v v 466 --------------------------------------------------------------- 467 | T7 | T8 | T9 | T10| T11| 469 Figure 2. MPLS Reversion Cycle Model 471 The various timing measures used in the model are described below. 472 T7 Fault Clearing Time 473 T8 Wait-to-Restore Time 474 T9 Notification Time 475 T10 Reversion Operation Time 476 T11 Traffic Restoration Time 478 Note that time T6 (not shown above) is the time for which the 479 network impairment is not repaired and traffic is flowing on the 480 recovery path. 482 Definitions of the reversion cycle times are as follows: 484 Fault Clearing Time 485 The time between the repair of a network impairment and the time 486 that MPLS-based mechanisms learn that the fault has been cleared. 487 This time may be highly dependent on lower layer protocols. 489 Wait-to-Restore Time 491 The configured waiting time between the clearing of a fault and 492 MPLS-based recovery action(s). Waiting time may be needed to 493 ensure the path is stable and to avoid flapping in cases where a 494 fault is intermittent. The Wait-to-Restore Time may be zero. 496 Note: The Wait-to-Restore Time may occur after the Notification 497 Time interval if the PSL is configured to wait. 499 Notification Time 501 The time between initiation of an FRS by the LSR clearing the fault 502 and the time at which the path switch LSR begins the reversion 503 operation. This is zero if the PSL clears the fault itself. 504 Note: If the PSL clears the fault itself, there still may be a 505 Wait-to-Restore Time period between fault clearing and the start of 506 the reversion operation. 508 Reversion Operation Time 510 The time between the first and last reversion actions. This may 511 include message exchanges between the PSL and PML to coordinate 512 reversion actions. 514 Traffic Restoration Time 516 The time between the last reversion action and the time that 517 traffic (if present) is completely restored on the preferred path. 518 This interval is expected to be quite small since both paths are 519 working and care may be taken to limit the traffic disruption 520 (e.g., using "make before break" techniques and synchronous switch- 521 over). 523 In practice, the only interesting times in the reversion cycle are 524 the Wait-to-Restore Time and the Traffic Restoration Time (or some 525 other measure of traffic disruption). Given that both paths are 526 available, there is no need for rapid operation, and a well- 527 controlled switch-back with minimal disruption is desirable. 529 2.2.3 Dynamic Re-routing Cycle Model 531 Dynamic rerouting aims to bring the IP network to a stable state 532 after a network impairment has occurred. A re-optimized network is 533 achieved after the routing protocols have converged, and the 534 traffic is moved from a recovery path to a (possibly) new working 535 path. The steps involved in this mode are illustrated in Figure 3. 537 Note that the cycle shown below may follow the recovery cycle shown 538 in Fig. 1 or the reversion cycle shown in Fig. 2, or both (in the 539 event that both the recovery cycle and the reversion cycle take 540 place before the routing protocols converge, and after the 541 convergence of the routing protocols it is determined (based on on- 542 line algorithms or off-line traffic engineering tools, network 543 configuration, or a variety of other possible criteria) that there 544 is a better route for the working path). 546 --Network Enters a Semi-stable State after an Impairment 547 | --Dynamic Routing Protocols Converge 548 | | --Initiate Setup of New Working Path between PSL 549 | | | and PML 550 | | | --Switchover Operation Complete 551 | | | | --Traffic Moved to New Working Path 552 | | | | | 553 | | | | | 554 v v v v v 555 --------------------------------------------------------------- 556 | T12 | T13 | T14 | T15 | 558 Figure 3. Dynamic Rerouting Cycle Model 559 The various timing measures used in the model are described below. 560 T12 Network Route Convergence Time 561 T13 Hold-down Time (optional) 562 T14 Switchover Operation Time 563 T15 Traffic Restoration Time 565 Network Route Convergence Time 567 We define the network route convergence time as the time taken for 568 the network routing protocols to converge and for the network to 569 reach a stable state. 571 Holddown Time 573 We define the holddown period as a bounded time for which a 574 recovery path must be used. In some scenarios it may be difficult 575 to determine if the working path is stable. In these cases a 576 holddown time may be used to prevent excess flapping of traffic 577 between a working and a recovery path. 579 Switchover Operation Time 581 The time between the first and last switchover actions. This may 582 include message exchanges between the PSL and PML to coordinate the 583 switchover actions. 585 As an example of the recovery cycle, we present a sequence of 586 events that occur after a network impairment occurs and when a 587 protection switch is followed by dynamic rerouting. 589 I. Link or path fault occurs 590 II. Signaling initiated (FIS) for the fault detected 591 III. FIS arrives at the PSL 592 IV. The PSL initiates a protection switch to a pre-configured 593 recovery path 594 V. The PSL switches over the traffic from the working path to the 595 recovery path 596 VI. The network enters a semi-stable state 597 VII. Dynamic routing protocols converge after the fault, and a new 598 working path is calculated (based, for example, on some of the 599 criteria mentioned earlier in Section 2.1.1). 600 VIII. A new working path is established between the PSL and the PML 601 (assumption is that PSL and PML have not changed) 602 IX. Traffic is switched over to the new working path. 604 2.3 Definitions and Terminology 606 This document assumes the terminology given in [11], and, in 607 addition, introduces the following new terms. 609 2.3.1 General Recovery Terminology 611 Rerouting 613 A recovery mechanism in which the recovery path or path segments 614 are created dynamically after the detection of a fault on the 615 working path. In other words, a recovery mechanism in which the 616 recovery path is not pre-established. 618 Protection Switching 620 A recovery mechanism in which the recovery path or path segments 621 are created prior to the detection of a fault on the working path. 622 In other words, a recovery mechanism in which the recovery path is 623 pre-established. 625 Working Path 627 The protected path that carries traffic before the occurrence of a 628 fault. The working path exists between a PSL and PML. The working 629 path can be of different kinds; a hop-by-hop routed path, a trunk, 630 a link, an LSP or part of a multipoint-to-point LSP. 632 Synonyms for a working path are primary path and active path. 634 Recovery Path 636 The path by which traffic is restored after the occurrence of a 637 fault. In other words, the path on which the traffic is directed by 638 the recovery mechanism. The recovery path is established by MPLS 639 means. The recovery path can either be an equivalent recovery path 640 and ensure no reduction in quality of service, or be a limited 641 recovery path and thereby not guarantee the same quality of service 642 (or some other criteria of performance) as the working path. A 643 limited recovery path is not expected to be used for an extended 644 period of time. 646 Synonyms for a recovery path are: back-up path, alternative path, 647 and protection path. 649 Protection Counterpart 651 The "other" path when discussing pre-planned protection switching 652 schemes. The protection counterpart for the working path is the 653 recovery path and vice-versa. 655 Path Group (PG) 657 A logical bundling of multiple working paths, each of which is 658 routed identically between a Path Switch LSR and a Path Merge LSR. 660 Protected Path Group (PPG) 662 A path group that requires protection. 664 Protected Traffic Portion (PTP) 665 The portion of the traffic on an individual path that requires 666 protection. For example, code points in the EXP bits of the shim 667 header may identify a protected portion. 669 Path Switch LSR (PSL) 671 An LSR that is the transmitter of both the working path traffic and 672 its corresponding recovery path traffic. The PSL is responsible for 673 switching or replicating the traffic between the working path and 674 the recovery path. 676 Path Merge LSR (PML) 678 An LSR that receives both working path traffic and its 679 corresponding recovery path traffic, and either merges their 680 traffic into a single outgoing path, or, if it is itself the 681 destination, passes the traffic on to the higher layer protocols. 683 Intermediate LSR 685 An LSR on a working or recovery path that is neither a PSL nor a 686 PML for that path. 688 Bypass Tunnel 690 A path that serves to backup a set of working paths using the label 691 stacking approach. The working paths and the bypass tunnel must all 692 share the same path switch LSR (PSL) and the path merge LSR (PML). 694 Switch-Over 696 The process of switching the traffic from the path that the traffic 697 is flowing on onto one or more alternate path(s). This may involve 698 moving traffic from a working path onto one or more recovery paths, 699 or may involve moving traffic from a recovery path(s) on to a more 700 optimal working path(s). 702 Switch-Back 704 The process of returning the traffic from one or more recovery 705 paths back to the working path(s). 707 Revertive Mode 709 A recovery mode in which traffic is automatically switched back 710 from the recovery path to the original working path upon the 711 restoration of the working path to a fault-free condition. 713 Non-revertive Mode 715 A recovery mode in which traffic is not automatically switched back 716 to the original working path after this path is restored to a 717 fault-free condition. (Depending on the configuration, the original 718 working path may, upon moving to a fault-free condition, become the 719 recovery path, or it may be used for new working traffic, and be no 720 longer associated with its original recovery path). 722 MPLS Protection Domain 724 The set of LSRs over which a working path and its corresponding 725 recovery path are routed. 727 MPLS Protection Plan 729 The set of all LSP protection paths and the mapping from working to 730 protection paths deployed in an MPLS protection domain at a given 731 time. 733 Liveness Message 735 A message exchanged periodically between two adjacent LSRs that 736 serves as a link probing mechanism. It provides an integrity check 737 of the forward and the backward directions of the link between the 738 two LSRs as well as a check of neighbor aliveness. 740 Path Continuity Test 742 A test that verifies the integrity and continuity of a path or path 743 segment. The details of such a test are beyond the scope of this 744 draft. (This could be accomplished, for example, by transmitting a 745 control message along the same links and nodes as the data traffic 746 or similarly could be measured by the absence of traffic and by 747 providing feedback.) 749 2.3.2 Failure Terminology 751 Path Failure (PF) 752 Path failure is fault detected by MPLS-based recovery mechanisms, 753 which is define as the failure of the liveness message test or a 754 path continuity test, which indicates that path connectivity is 755 lost. 757 Path Degraded (PD) 758 Path degraded is a fault detected by MPLS-based recovery mechanisms 759 that indicates that the quality of the path is unacceptable. 761 Link Failure (LF) 762 A lower layer fault indicating that link continuity is lost. This 763 may be communicated to the MPLS-based recovery mechanisms by the 764 lower layer. 766 Link Degraded (LD) 767 A lower layer indication to MPLS-based recovery mechanisms that the 768 link is performing below an acceptable level. 770 Fault Indication Signal (FIS) 771 A signal that indicates that a fault along a path has occurred. It 772 is relayed by each intermediate LSR to its upstream or downstream 773 neighbor, until it reaches an LSR that is setup to perform MPLS 774 recovery. 776 Fault Recovery Signal (FRS) 777 A signal that indicates a fault along a working path has been 778 repaired. Again, like the FIS, it is relayed by each intermediate 779 LSR to its upstream or downstream neighbor, until is reaches the 780 LSR that performs recovery of the original path. 782 2.4 Abbreviations 784 FIS: Fault Indication Signal. 785 FRS: Fault Recovery Signal. 786 LD: Link Degraded. 787 LF: Link Failure. 788 PD: Path Degraded. 789 PF: Path Failure. 790 PML: Path Merge LSR. 791 PG: Path Group. 792 PPG: Protected Path Group. 793 PTP: Protected Traffic Portion. 794 PSL: Path Switch LSR. 796 3.0 MPLS-based Recovery Principles 798 MPLS-based recovery refers to the ability to effect quick and 799 complete restoration of traffic affected by a fault in an MPLS- 800 enabled network. The fault may be detected on the IP layer or in 801 lower layers over which IP traffic is transported. Fast MPLS 802 protection may be viewed as the MPLS LSR switch completion time 803 that is comparable to, or equivalent to, the 50 ms switch-over 804 completion time of the SONET layer. This section provides a 805 discussion of the concepts and principles of MPLS-based recovery. 806 The concepts are presented in terms of atomic or primitive terms 807 that may be combined to specify recovery approaches. We do not 808 make any assumptions about the underlying layer 1 or layer 2 809 transport mechanisms or their recovery mechanisms. 811 3.1 Configuration of Recovery 813 An LSR should allow for configuration of the following recovery 814 options: 816 Default-recovery (No MPLS-based recovery enabled): 817 Traffic on the working path is recovered only via Layer 3 or IP 818 rerouting. This is equivalent to having no MPLS-based recovery. 819 This option may be used for low priority traffic or for traffic 820 that is recovered in another way (for example load shared traffic 821 on parallel working paths may be automatically recovered upon a 822 fault along one of the working paths by distributing it among the 823 remaining working paths). 825 Recoverable (MPLS-based recovery enabled): 826 This working path is recovered using one or more recovery paths, 827 either via rerouting or via protection switching. 829 3.2 Initiation of Path Setup 831 As explained in Section 2.2, there are two options for the 832 initiation of the recovery path setup. 834 Pre-established: 836 This is the same as the protection switching option. Here a 837 recovery path(s) is established prior to any failure on the working 838 path. The path selection can either be determined by an 839 administrative centralized tool (online or offline), or chosen 840 based on some algorithm implemented at the PSL and possibly 841 intermediate nodes. To guard against the situation when the pre- 842 established recovery path fails before or at the same time as the 843 working path, the recovery path should have secondary configuration 844 options as explained in Section 3.3 below. 846 Pre Qualified: 848 A pre-established path need not be created, it may be pre- 849 qualified. A pre-qualified recovery path is not created expressly 850 for protecting the working path, but instead is a path created for 851 other purposes that is designated as a recovery path after 852 determination that it is an acceptable alternative for carrying the 853 working path traffic. Variants include the case where an optical 854 path or trail is configured, but no switches are set. 856 Established-on-Demand: 858 This is the same as the rerouting option. Here, a recovery path is 859 established after a failure on its working path has been detected 860 and notified to the PSL. 862 3.3 Initiation of Resource Allocation 863 A recovery path may support the same traffic contract as the 864 working path, or it may not. We will distinguish these two 865 situations by using different additive terms. If the recovery path 866 is capable of replacing the working path without degrading service, 867 it will be called an equivalent recovery path. If the recovery path 868 lacks the resources (or resource reservations) to replace the 869 working path without degrading service, it will be called a limited 870 recovery path. Based on this, there are two options for the 871 initiation of resource allocation: 873 Pre-reserved: 875 This option applies only to protection switching. Here a pre- 876 established recovery path reserves required resources on all hops 877 along its route during its establishment. Although the reserved 878 resources (e.g., bandwidth and/or buffers) at each node cannot be 879 used to admit more working paths, they are available to be used by 880 all traffic that is present at the node before a failure occurs. 882 Reserved-on-Demand: 884 This option may apply either to rerouting or to protection 885 switching. Here a recovery path reserves the required resources 886 after a failure on the working path has been detected and notified 887 to the PSL and before the traffic on the working path is switched 888 over to the recovery path. 890 Note that under both the options above, depending on the amount of 891 resources reserved on the recovery path, it could either be an 892 equivalent recovery path or a limited recovery path. 894 3.4 Scope of Recovery 896 3.4.1 Topology 898 3.4.1.1 Local Repair 900 The intent of local repair is to protect against a single link or 901 neighbor node fault. In local repair (also known as local recovery 902 [12] [9]), the node immediately upstream of the fault is the one to 903 initiate recovery (either rerouting or protection switching). Local 904 repair can be of two types: 906 Link Recovery/Restoration 908 In this case, the recovery path may be configured to route around a 909 certain link deemed to be unreliable. If protection switching is 910 used, several recovery paths may be configured for one working 911 path, depending on the specific faulty link that each protects 912 against. 914 Alternatively, if rerouting is used, upon the occurrence of a fault 915 on the specified link each path is rebuilt such that it detours 916 around the faulty link. 918 In this case, the recovery path need only be disjoint from its 919 working path at a particular link on the working path, and may have 920 overlapping segments with the working path. Traffic on the working 921 path is switched over to an alternate path at the upstream LSR that 922 connects to the failed link. This method is potentially the fastest 923 to perform the switchover, and can be effective in situations where 924 certain path components are much more unreliable than others. 926 Node Recovery/Restoration 928 In this case, the recovery path may be configured to route around a 929 neighbor node deemed to be unreliable. Thus the recovery path is 930 disjoint from the working path only at a particular node and at 931 links associated with the working path at that node. Once again, 932 the traffic on the primary path is switched over to the recovery 933 path at the upstream LSR that directly connects to the failed node, 934 and the recovery path shares overlapping portions with the working 935 path. 937 3.4.1.2 Global Repair 939 The intent of global repair is to protect against any link or node 940 fault on a label switched path or on a segment of a label switched 941 path, with the obvious exception of the faults occurring at the 942 ingress node. In global repair (also known as path 943 recovery/restoration) the node that initiates the recovery is the 944 ingress to the label switched path and so may be distant from the 945 faulty link or node. In some cases, a fault notification (in the 946 form of a FIS) must be sent from the node detecting the fault to 947 the PSL. In many cases, the recovery path can be made completely 948 link and node disjoint with its working path. This has the 949 advantage of protecting against all link and node fault(s) on the 950 working path (or path segment), and being more efficient than per- 951 hop link or node recovery. 952 In addition, it can be potentially more optimal in resource usage 953 than the link or node recovery. However, it is in some cases slower 954 than local repair since it takes longer for the fault notification 955 message to get to the PSL to trigger the recovery action. 957 3.4.1.3 Alternate Egress Repair 959 It is possible to restore service without specifically recovering 960 the faulted path. 961 For example, for best effort IP service it is possible to select a 962 recovery path that has a different egress point from the working 963 path (i.e., there is no PML). The recovery path egress must simply 964 be a router that is acceptable for forwarding the FEC carried by 965 the working path (without creating looping). In an engineering 966 context, specific alternative FEC/LSP mappings with alternate 967 egresses can be formed. 969 This may simplify enhancing the reliability of implicitly 970 constructed MPLS topologies. A PSL may qualify LSP/FEC bindings as 971 candidate recovery paths as simply link and node disjoint with the 972 immediate downstream LSR of the working path. 974 3.4.1.4 Multi-Layer Repair 976 Multi-layer repair broadens the network designerĘs tool set for 977 those cases where multiple network layers can be managed together 978 to achieve overall network goals. Specific criteria for 979 determining when multi-layer repair is appropriate are beyond the 980 scope of this draft. 982 3.4.1.5 Concatenated Protection Domains 984 A given service may cross multiple networks and these may employ 985 different recovery mechanisms. It is possible to concatenate 986 protection domains so that service recovery can be provided end-to- 987 end. It is considered that the recovery mechanisms in different 988 domains may operate autonomously, and that multiple points of 989 attachment may be used between domains (to ensure there is no 990 single point of failure). Alternate egress repair requires 991 management of concatenated domains in that an explicit MPLS point 992 of failure (the PML) is by definition excluded. Details of 993 concatenated protection domains are beyond the scope of this draft. 995 3.4.2 Path Mapping 997 Path mapping refers to the methods of mapping traffic from a faulty 998 working path on to the recovery path. There are several options for 999 this, as described below. Note that the options below should be 1000 viewed as atomic terms that only describe how the working and 1001 protection paths are mapped to each other. The issues of resource 1002 reservation along these paths, and how switchover is actually 1003 performed lead to the more commonly used composite terms, such as 1004 1+1 and 1:1 protection, which were described in Section 2.1. 1006 1-to-1 Protection 1008 In 1-to-1 protection the working path has a designated recovery 1009 path that is only to be used to recover that specific working path. 1011 ii) n-to-1 Protection 1013 In n-to-1 protection, up to n working paths are protected using 1014 only one recovery path. If the intent is to protect against any 1015 single fault on any of the working paths, the n working paths 1016 should be diversely routed between the same PSL and PML. In some 1017 cases, handshaking between PSL and PML may be required to complete 1018 the recovery, the details of which are beyond the scope of this 1019 draft. 1021 n-to-m Protection 1023 In n-to-m protection, up to n working paths are protected using m 1024 recovery paths. Once again, if the intent is to protect against any 1025 single fault on any of the n working paths, the n working paths and 1026 the m recovery paths should be diversely routed between the same 1027 PSL and PML. In some cases, handshaking between PSL and PML may be 1028 required to complete the recovery, the details of which are beyond 1029 the scope of this draft. N-to-m protection is for further study. 1031 Split Path Protection 1033 In split path protection, multiple recovery paths are allowed to 1034 carry the traffic of a working path based on a certain configurable 1035 load splitting ratio. This is especially useful when no single 1036 recovery path can be found that can carry the entire traffic of the 1037 working path in case of a fault. Split path protection may require 1038 handshaking between the PSL and the PML(s), and may require the 1039 PML(s) to correlate the traffic arriving on multiple recovery paths 1040 with the working path. Although this is an attractive option, the 1041 details of split path protection are beyond the scope of this 1042 draft, and are for further study. 1044 3.4.3 Bypass Tunnels 1046 It may be convenient, in some cases, to create a "bypass tunnel" 1047 for a PPG between a PSL and PML, thereby allowing multiple recovery 1048 paths to be transparent to intervening LSRs [Error! Bookmark not 1049 defined.]. In this case, one LSP (the tunnel) is established 1050 between the PSL and PML following an acceptable route and a number 1051 of recovery paths are supported through the tunnel via label 1052 stacking. A bypass tunnel can be used with any of the path mapping 1053 options discussed in the previous section. 1055 As with recovery paths, the bypass tunnel may or may not have 1056 resource reservations sufficient to provide recovery without 1057 service degradation. It is possible that the bypass tunnel may 1058 have sufficient resources to recover some number of working paths, 1059 but not all at the same time. If the number of recovery paths 1060 carrying traffic in the tunnel at any given time is restricted, 1061 this is similar to the 1 to n or m to n protection cases mentioned 1062 in Section 3.4.2. 1064 3.4.4 Recovery Granularity 1066 Another dimension of recovery considers the amount of traffic 1067 requiring protection. This may range from a fraction of a path to a 1068 bundle of paths. 1070 3.4.4.1 Selective Traffic Recovery 1072 This option allows for the protection of a fraction of traffic 1073 within the same path. The portion of the traffic on an individual 1074 path that requires protection is called a protected traffic portion 1075 (PTP). A single path may carry different classes of traffic, with 1076 different protection requirements. The protected portion of this 1077 traffic may be identified by its class, as for example, via the EXP 1078 bits in the MPLS shim header or via the priority bit in the ATM 1079 header. 1081 3.4.4.2 Bundling 1082 Bundling is a technique used to group multiple working paths 1083 together in order to recover them simultaneously. The logical 1084 bundling of multiple working paths requiring protection, each of 1085 which is routed identically between a PSL and a PML, is called a 1086 protected path group (PPG). When a fault occurs on the working path 1087 carrying the PPG, the PPG as a whole can be protected either by 1088 being switched to a bypass tunnel or by being switched to a 1089 recovery path. 1091 3.4.5 Recovery Path Resource Use 1093 In the case of pre-reserved recovery paths, there is the question 1094 of what use these resources may be put to when the recovery path is 1095 not in use. There are two options: 1097 Dedicated-resource: 1098 If the recovery path resources are dedicated, they may not be used 1099 for anything except carrying the working traffic. For example, in 1100 the case of 1+1 protection, the working traffic is always carried 1101 on the recovery path. Even if the recovery path is not always 1102 carrying the working traffic, it may not be possible or desirable 1103 to allow other traffic to use these resources. 1105 Extra-traffic-allowed: 1106 If the recovery path only carries the working traffic when the 1107 working path fails, then it is possible to allow extra traffic to 1108 use the reserved resources at other times. Extra traffic is, by 1109 definition, traffic that can be displaced (without violating 1110 service agreements) whenever the recovery path resources are needed 1111 for carrying the working path traffic. 1113 3.5 Fault Detection 1115 MPLS recovery is initiated after the detection of either a lower 1116 layer fault or a fault at the IP layer or in the operation of MPLS- 1117 based mechanisms. We consider four classes of impairments: Path 1118 Failure, Path Degraded, Link Failure, and Link Degraded. 1120 Path Failure (PF) is a fault that indicates to an MPLS-based 1121 recovery scheme that the connectivity of the path is lost. This 1122 may be detected by a path continuity test between the PSL and PML. 1123 Some, and perhaps the most common, path failures may be detected 1124 using a link probing mechanism between neighbor LSRs. An example of 1125 a probing mechanism is a liveness message that is exchanged 1126 periodically along the working path between peer LSRs. For either 1127 a link probing mechanism or path continuity test to be effective, 1128 the test message must be guaranteed to follow the same route as the 1129 working or recovery path, over the segment being tested. In 1130 addition, the path continuity test must take the path merge points 1131 into consideration. In the case of a bi-directional link 1132 implemented as two unidirectional links, path failure could mean 1133 that either one or both unidirectional links are damaged. 1135 Path Degraded (PD) is a fault that indicates to MPLS-based recovery 1136 schemes/mechanisms that the path has connectivity, but that the 1137 quality of the connection is unacceptable. This may be detected by 1138 a path performance monitoring mechanism, or some other mechanism 1139 for determining the error rate on the path or some portion of the 1140 path. This is local to the LSR and consists of excessive discarding 1141 of packets at an interface, either due to label mismatch or due to 1142 TTL errors, for example. 1144 Link Failure (LF) is an indication from a lower layer that the link 1145 over which the path is carried has failed. If the lower layer 1146 supports detection and reporting of this fault (that is, any fault 1147 that indicates link failure e.g., SONET LOS), this may be used by 1148 the MPLS recovery mechanism. In some cases, using LF indications 1149 may provide faster fault detection than using only MPLS-based fault 1150 detection mechanisms. 1152 Link Degraded (LD) is an indication from a lower layer that the 1153 link over which the path is carried is performing below an 1154 acceptable level. If the lower layer supports detection and 1155 reporting of this fault, it may be used by the MPLS recovery 1156 mechanism. In some cases, using LD indications may provide faster 1157 fault detection than using only MPLS-based fault detection 1158 mechanisms. 1160 3.6 Fault Notification 1162 Protection switching relies on rapid and reliable notification of 1163 faults. Once a fault is detected, the node that detected the fault 1164 must determine if the fault is severe enough to require path 1165 recovery. Then the node should send out a notification of the fault 1166 by transmitting a FIS to those of its upstream LSRs that were 1167 sending traffic on the working path that is affected by the fault. 1168 This notification is relayed hop-by-hop by each subsequent LSR to 1169 its upstream neighbor, until it eventually reaches a PSL. A PSL is 1170 the only LSR that can terminate the FIS and initiate a protection 1171 switch of the working path to a recovery path. 1173 Since the FIS is a control message, it should be transmitted with 1174 high priority to ensure that it propagates rapidly towards the 1175 affected PSL(s). Depending on how fault notification is configured 1176 in the LSRs of an MPLS domain, the FIS could be sent either as a 1177 Layer 2 or Layer 3 packet. An example of a FIS could be the 1178 liveness message sent by a downstream LSR to its upstream neighbor, 1179 with an optional fault notification field set. Alternatively, it 1180 could be a separate fault notification packet. The intermediate LSR 1181 should identify which of its incoming links (upstream LSRs) to 1182 propagate the FIS on. In the case of 1+1 protection, the FIS should 1183 also be sent downstream to the PML where the recovery action is 1184 taken. 1186 3.7 Switch-Over Operation 1188 3.7.1 Recovery Trigger 1190 The activation of an MPLS protection switch following the detection 1191 or notification of a fault requires a trigger mechanism at the PSL. 1193 MPLS protection switching may be initiated due to automatic inputs 1194 or external commands. The automatic activation of an MPLS 1195 protection switch results from a response to a defect or fault 1196 conditions detected at the PSL or to fault notifications received 1197 at the PSL. It is possible that the fault detection and trigger 1198 mechanisms may be combined, as is the case when a PF, PD, LF, or LD 1199 is detected at a PSL and triggers a protection switch to the 1200 recovery path. In most cases, however, the detection and trigger 1201 mechanisms are distinct, involving the detection of fault at some 1202 intermediate LSR followed by the propagation of a fault 1203 notification back to the PSL via the FIS, which serves as the 1204 protection switch trigger at the PSL. MPLS protection switching in 1205 response to external commands results when the operator initiates a 1206 protection switch by a command to a PSL (or alternatively by a 1207 configuration command to an intermediate LSR, which transmits the 1208 FIS towards the PSL). 1210 Note that the PF fault applies to hard failures (fiber cuts, 1211 transmitter failures, or LSR fabric failures), as does the LF 1212 fault, with the difference that the LF is a lower layer impairment 1213 that may be communicated to - MPLS-based recovery mechanisms. The 1214 PD (or LD) fault, on the other hand, applies to soft defects 1215 (excessive errors due to noise on the link, for instance). The PD 1216 (or LD) results in a fault declaration only when the percentage of 1217 lost packets exceeds a given threshold, which is provisioned and 1218 may be set based on the service level agreement(s) in effect 1219 between a service provider and a customer. 1221 3.7.2 Recovery Action 1223 After a fault is detected or FIS is received by the PSL, the 1224 recovery action involves either a rerouting or protection switching 1225 operation. In both scenarios, the next hop label forwarding entry 1226 for a recovery path is bound to the working path. 1228 3.8 Switch-Back Operation 1230 3.8.1 Revertive and Non-Revertive Modes 1232 These protection modes indicate whether or not there is a preferred 1233 path for the protected traffic. 1235 3.8.1.1 Revertive Mode 1237 If the working path always is the preferred path, this path will be 1238 used whenever it is available. If the working path has a fault, 1239 traffic is switched to the recovery path. In the revertive mode of 1240 operation, when the preferred path is restored the traffic is 1241 automatically switched back to it. 1243 3.8.1.2 Non-revertive Mode 1245 In the non-revertive mode of operation, there is no preferred path. 1246 A switchback to the "original" working path is not desired or not 1247 possible since the original path may no longer exist after the 1248 occurrence of a fault on that path. 1250 If there is a fault on the working path, traffic is switched to the 1251 recovery path. When or if the faulty path (the originally working 1252 path) is restored, it may become the recovery path (either by 1253 configuration, or, if desired, by management actions). This applies 1254 for explicitly routed working paths. 1256 When the traffic is switched over to a recovery path, the 1257 association between the original working path and the recovery path 1258 may no longer exist, since the original path itself may no longer 1259 exist after the fault. Instead, when the network reaches a stable 1260 state following routing convergence, the recovery path may be 1261 switched over to a different preferred path based either on pre- 1262 configured information or optimization based on the new network 1263 topology and associated information. 1265 3.8.2 Restoration and Notification 1267 MPLS restoration deals with returning the working traffic from the 1268 recovery path to the original or a new working path. Reversion is 1269 performed by the PSL upon receiving notification, via FRS, that the 1270 working path is repaired or upon receiving notification that a new 1271 working path is established. 1273 As before, an LSR that detected the fault on the working path also 1274 detects the restoration of the working path. If the working path 1275 had experienced a LF defect, the LSR detects a return to normal 1276 operation via the receipt of a liveness message from its peer. If 1277 the working path had experienced a LD defect at an LSR interface, 1278 the LSR could detect a return to normal operation via the 1279 resumption of error-free packet reception on that interface. 1280 Alternatively, a lower layer that no longer detects a LF defect may 1281 inform the MPLS-based recovery mechanisms at the LSR that the link 1282 to its peer LSR is operational. 1284 The LSR then transmits FRS to its upstream LSR(s) that were 1285 transmitting traffic on the working path. This is relayed hop-by- 1286 hop until it reaches the PSL(s), at which point the PSL switches 1287 the working traffic back to the original working path. 1289 In the non-revertive mode of operation, the working traffic may or 1290 may not be restored to the original working path. This is because 1291 it might be useful, in some cases, to either: (a) administratively 1292 perform a protection switch back to the original working path after 1293 gaining further assurances about the integrity of the path, or (b) 1294 it may be acceptable to continue operation without the recovery 1295 path being protected, or (c) it may be desirable to move the 1296 traffic to a new working path that is calculated based on network 1297 topology and network policies, after the dynamic routing protocols 1298 have converged. 1299 We note that if there is a way to transmit fault information back 1300 along a recovery path towards a PSL and if the recovery path is an 1301 equivalent recovery path, it is possible for the working path and 1302 its recovery path to exchange roles once the original working path 1303 is repaired following a fault. This is because, in that case, the 1304 recovery path effectively becomes the working path, and the 1305 restored working path functions as a recovery path for the original 1306 recovery path. This is important, since it affords the benefits of 1307 non-revertive switch operation outlined in Section 3.8.1, without 1308 leaving the recovery path unprotected. 1310 3.8.3 Reverting to Preferred Path (or Controlled Rearrangement) 1312 In the revertive mode, a "make before break" restoration switching 1313 can be used, which is less disruptive than performing protection 1314 switching upon the occurrence of network impairments. This will 1315 minimize both packet loss and packet reordering. The controlled 1316 rearrangement of paths can also be used to satisfy traffic 1317 engineering requirements for load balancing across an MPLS domain. 1319 3.9 Performance 1321 Resource/performance requirements for recovery paths should be 1322 specified in terms of the following attributes: 1324 I. Resource class attribute: 1325 Equivalent Recovery Class: The recovery path has the same resource 1326 reservations and performance guarantees as the working path. In 1327 other words, the recovery path meets the same SLAs as the working 1328 path. 1329 Limited Recovery Class: The recovery path does not have the same 1330 resource reservations and performance guarantees as the working 1331 path. 1333 A. Lower Class: The recovery path has lower resource requirements 1334 or less stringent performance requirements than the working path. 1336 B. Best Effort Class: The recovery path is best effort. 1338 II. Priority Attribute: 1340 The recovery path has a priority attribute just like the working 1341 path (i.e., the priority attribute of the associated traffic 1342 trunks). It can have the same priority as the working path or lower 1343 priority. 1345 III. Preemption Attribute: 1346 The recovery path can have the same preemption attribute as the 1347 working path or a lower one. 1349 4.0 MPLS Recovery Requirement 1351 The following are the MPLS recovery requirements: 1353 I. MPLS recovery SHALL provide an option to identify protection 1354 groups (PPGs) and protection portions (PTPs). 1356 II. Each PSL SHALL be capable of performing MPLS recovery upon the 1357 detection of the impairments or upon receipt of notifications of 1358 impairments. 1360 III. A MPLS recovery method SHALL not preclude manual protection 1361 switching commands. This implies that it would be possible under 1362 administrative commands to transfer traffic from a working path to 1363 a recovery path, or to transfer traffic from a recovery path to a 1364 working path, once the working path becomes operational following a 1365 fault. 1367 IV. A PSL SHALL be capable of performing either a switch back to 1368 the original working path after the fault is corrected or a 1369 switchover to a new working path, upon the discovery of a more 1370 optimal working path. 1372 V. The recovery model should take into consideration path merging 1373 at intermediate LSRs. If a fault affects the merged segment, all 1374 the paths sharing that merged segment should be able to recover. 1375 Similarly, if a fault affects a non-merged segment, only the path 1376 that is affected by the fault should be recovered. 1378 5.0 MPLS Recovery Options 1380 There SHOULD be an option for: 1382 I. Configuration of the recovery path as excess or reserved, with 1383 excess as the default. The recovery path that is configured as 1384 excess SHALL provide lower priority preemptable traffic access to 1385 the protection bandwidth, while the recovery path configured as 1386 reserved SHALL not provide any other traffic access to the 1387 protection bandwidth. 1389 II. Each protected path SHALL provide an option for configuring the 1390 protection alternatives as either rerouting or protection 1391 switching. 1393 III. Each protected path SHALL provide a configuration option for 1394 enabling restoration as either non-revertive or revertive, with 1395 revertive as the default. 1397 6.0 Comparison Criteria 1399 Possible criteria to use for comparison of MPLS-based recovery 1400 schemes are as follows: 1402 Recovery Time 1404 We define recovery time as the time required for a recovery path to 1405 be activated (and traffic flowing) after a fault. Recovery Time is 1406 the sum of the Fault Detection Time, Hold-off Time, Notification 1407 Time, Recovery Operation Time, and the Traffic Restoration Time. In 1408 other words, it is the time between a failure of a node or link in 1409 the network and the time before a recovery path is installed and 1410 the traffic starts flowing on it. 1412 Full Restoration Time 1414 We define full restoration time as the time required for a 1415 permanent restoration. This is the time required for traffic to be 1416 routed onto links, which are capable of or have been engineered 1417 sufficiently to handle traffic in recovery scenarios. Note that 1418 this time may or may not be different from the "Recovery Time" 1419 depending on whether equivalent or limited recovery paths are used. 1421 Backup Capacity 1423 Recovery schemes may require differing amounts of "backup capacity" 1424 in the event of a fault. This capacity will be dependent on the 1425 traffic characteristics of the network. However, it may also be 1426 dependent on the particular protection plan selection algorithms as 1427 well as the signaling and re-routing methods. 1429 Additive Latency 1431 Recovery schemes may introduce additive latency to traffic. For 1432 example, a recovery path may take many more hops than the working 1433 path. This may be dependent on the recovery path selection 1434 algorithms. 1436 Quality of Protection 1438 Recovery schemes can be considered to encompass a spectrum of 1439 "packet survivability" which may range from "relative" to 1440 "absolute. Relative survivability may mean that the packet is on an 1441 equal footing with other traffic of, as an example, the same diff- 1442 serv code point (DSCP) in contending for the surviving network 1443 resources. Absolute survivability may mean that the survivability 1444 of the protected traffic has explicit guarantees. 1446 Re-ordering 1448 Recovery schemes may introduce re-ordering of packets. Also the 1449 action of putting traffic back on preferred paths might cause 1450 packet re-ordering. 1452 State Overhead 1454 As the number of recovery paths in a protection plan grows, the 1455 state required to maintain them also grows. Schemes may require 1456 differing numbers of paths to maintain certain levels of coverage, 1457 etc. The state required may also depend on the particular scheme 1458 used to recover. In many cases the state overhead will be in 1459 proportion to the number of recovery paths. 1461 Loss 1462 Recovery schemes may introduce a certain amount of packet loss 1463 during switchover to a recovery path. Schemes that introduce loss 1464 during recovery can measure this loss by evaluating recovery times 1465 in proportion to the link speed. 1467 In case of link or node failure a certain packet loss is 1468 inevitable. 1470 Coverage 1472 Recovery schemes may offer various types of failover coverage. The 1473 total coverage may be defined in terms of several metrics: 1475 I. Fault Types: Recovery schemes may account for only link faults 1476 or both node and link faults or also degraded service. For example, 1477 a scheme may require more recovery paths to take node faults into 1478 account. 1480 II. Number of concurrent faults: dependent on the layout of 1481 recovery paths in the protection plan, multiple fault scenarios may 1482 be able to be restored. 1484 III. Number of recovery paths: for a given fault, there may be one 1485 or more recovery paths. 1487 IV. Percentage of coverage: dependent on a scheme and its 1488 implementation, a certain percentage of faults may be covered. This 1489 may be subdivided into percentage of link faults and percentage of 1490 node faults. 1492 V. The number of protected paths may effect how fast the total set 1493 of paths affected by a fault could be recovered. The ratio of 1494 protected is n/N, where n is the number of protected paths and N is 1495 the total number of paths. 1497 7.0 Security Considerations 1499 The MPLS recovery that is specified herein does not raise any 1500 security issues that are not already present in the MPLS 1501 architecture. 1503 8.0 Intellectual Property Considerations 1505 The IETF has been notified of intellectual property rights claimed 1506 in regard to some or all of the specification contained in this 1507 document. For more information consult the online list of claimed 1508 rights. 1510 9.0 Acknowledgements 1512 We would like to thank members of the MPLS WG mailing list for 1513 their suggestions on the earlier version of this draft. In 1514 particular, Bora Akyol, Dave Allan, and Neil Harrisson, whose 1515 suggestions and comments were very helpful in revising the 1516 document. 1518 10.0 AuthorsĘ Addresses 1520 Vishal Sharma Ben Mack-Crane 1521 Tellabs Research Center Tellabs Operations, Inc. 1522 One Kendall Square 4951 Indiana Avenue 1523 Bldg. 100, Ste. 121 Lisle, IL 60532 1524 Cambridge, MA 02139-1562 Phone: 630-512-7255 1525 Phone: 617-577-8760 Ben.Mack-Crane@tellabs.com 1526 Vishal.Sharma@tellabs.com 1528 Srinivas Makam Ken Owens 1529 Tellabs Operations, Inc. Tellabs Operations, Inc. 1530 4951 Indiana Avenue 1106 Fourth Street 1531 Lisle, IL 60532 St. Louis, MO 63126 1532 Phone: 630-512-7217 Phone: 314-918-1579 1533 Srinivas.Makam@tellabs.com Ken.Owens@tellabs.com 1535 Changcheng Huang Fiffi Hellstrand 1536 Dept. of Systems & Computer Engg. Nortel Networks 1537 Carleton University St Eriksgatan 115 1538 Minto Center, Rm. 3082 PO Box 6701 1539 1125 Colonial By Drive 113 85 Stockholm, Sweden 1540 Ottawa, Ontario K1S 5B6, Canada Phone: +46 8 5088 3687 1541 Phone: 613 520-2600 x2477 Fiffi@nortelnetworks.com 1542 Changcheng.Huang@sce.carleton.ca 1544 Jon Weil Brad Cain 1545 Nortel Networks Mirror Image Internet 1546 Harlow Laboratories London Road 49 Dragon Ct. 1547 Harlow Essex CM17 9NA, UK Woburn, MA 01801, USA 1548 Phone: +44 (0)1279 403935 bcain@mirror-image.com 1549 jonweil@nortelnetworks.com 1551 Loa Andersson Bilel Jamoussi 1552 Nortel Networks Nortel Networks 1553 St Eriksgatan 115, PO Box 6701 3 Federal Street, BL3-03 1554 113 85 Stockholm, Sweden Billerica, MA 01821, USA 1555 Phone: +46 8 50 88 36 34 Phone:(978) 288-4506 1556 loa.andersson@nortelnetworks.com jamoussi@nortelnetworks.com 1558 Seyhan Civanlar Angela Chiu 1559 Coreon, Inc. AT&T Labs, Rm. 4-204 1560 1200 South Avenue, Suite 103 100 Schulz Drive 1561 Staten Island, NY 10314 Red Bank, NJ 07701 1562 Phone: (718) 889 4203 Phone: (732) 345-3441 1563 scivanlar@coreon.net alchiu@att.com 1565 11.0 References 1567 [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label 1568 Switching Architecture", Work in Progress, Internet Draft , August 1999. 1571 [2] Andersson, L., Doolan, P., Feldman, N., Fredette, A., Thomas, B., 1572 "LDP Specification", Work in Progress, Internet Draft , September 1999. 1575 [3] Awduche, D. Hannan, A., and Xiao, X., "Applicability Statement for 1576 Extensions to RSVP for LSP-Tunnels", draft-ietf-mpls-rsvp-tunnel- 1577 applicability-00.txt", work in progress, Sept. 1999. 1579 [4] Jamoussi, B. "Constraint-Based LSP Setup using LDP", Work in 1580 Progress, Internet Draft , September 1581 1999. 1583 [5] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource 1584 ReSerVation Protocol (RSVP) -- Version 1 Functional Specification", 1585 RFC 2205, September 1997. 1587 [6] Awduche, D. et al "Extensions to RSVP for LSP Tunnels", Work in 1588 Progress, Internet Draft , Work in Progress, September 1999. 1611 [12] Haskin, D. and Krishnan R., "A Method for Setting an Alternative 1612 Label Switched Path to Handle Fast Reroute", draft-haskin-mpls-fast- 1613 reroute-01.txt, 1999, Work in progress.