idnits 2.17.00 (12 Aug 2021) /tmp/idnits7678/draft-bookham-rtgwg-nfix-arch-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (March 9, 2020) is 803 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC7130' is mentioned on line 944, but not defined == Outdated reference: draft-ietf-nvo3-geneve has been published as RFC 8926 == Outdated reference: A later version (-06) exists of draft-ietf-bess-evpn-ipvpn-interworking-02 == Outdated reference: A later version (-22) exists of draft-ietf-spring-segment-routing-policy-06 == Outdated reference: A later version (-08) exists of draft-ietf-rtgwg-segment-routing-ti-lfa-03 == Outdated reference: draft-ietf-bess-nsh-bgp-control-plane has been published as RFC 9015 == Outdated reference: A later version (-17) exists of draft-ietf-idr-te-lsp-distribution-12 == Outdated reference: A later version (-06) exists of draft-barth-pce-segment-routing-policy-cp-04 == Outdated reference: A later version (-09) exists of draft-filsfils-spring-sr-policy-considerations-04 == Outdated reference: A later version (-18) exists of draft-ietf-rtgwg-bgp-pic-11 Summary: 0 errors (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RTG Working Group C. Bookham, Ed. 3 Internet-Draft A. Stone 4 Intended status: Informational Nokia 5 Expires: September 10, 2020 J. Tantsura 6 Apstra 7 M. Durrani 8 Equinix Inc 9 March 9, 2020 11 An Architecture for Network Function Interconnect 12 draft-bookham-rtgwg-nfix-arch-00 14 Abstract 16 The emergence of technologies such as 5G, the Internet of Things 17 (IoT), and Industry 4.0, coupled with the move towards network 18 functionvirtualization, means that the service requirements demanded 19 from networks are changing. This document describes an architecture 20 for a Network Function Interconnect (NFIX) that allows for 21 interworking of physical and virtual network functions in a unified 22 and scalable manner across wide-area network and data center domains 23 while maintaining the ability to deliver against SLAs. 25 Requirements Language 27 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 28 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 29 document are to be interpreted as described in BCP 14 30 [RFC2119][RFC8174] when, and only when, they appear in all capitals, 31 as shown here. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 10, 2020. 50 Copyright Notice 52 Copyright (c) 2020 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 3. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 6 71 5. Theory of Operation . . . . . . . . . . . . . . . . . . . . . 7 72 5.1. VNF Assumptions . . . . . . . . . . . . . . . . . . . . . 7 73 5.2. Overview . . . . . . . . . . . . . . . . . . . . . . . . 8 74 5.3. Use of a Centralized Controller . . . . . . . . . . . . . 9 75 5.4. Transport Layer . . . . . . . . . . . . . . . . . . . . . 11 76 5.4.1. Intra-Domain Routing . . . . . . . . . . . . . . . . 11 77 5.4.2. Intra-Domain Routing . . . . . . . . . . . . . . . . 11 78 5.4.3. Inter-Domain Routing . . . . . . . . . . . . . . . . 12 79 5.4.4. Intra-Domain and Inter-Domain Traffic-Engineering . . 13 80 5.5. Service Layer . . . . . . . . . . . . . . . . . . . . . . 15 81 5.6. Service Differentiation . . . . . . . . . . . . . . . . . 16 82 5.7. Automated Service Activation . . . . . . . . . . . . . . 17 83 5.8. Service Function Chaining . . . . . . . . . . . . . . . . 18 84 5.9. Stability and Availability . . . . . . . . . . . . . . . 20 85 5.9.1. IGP Reconvergence . . . . . . . . . . . . . . . . . . 20 86 5.9.2. Data Center Reconvergence . . . . . . . . . . . . . . 21 87 5.9.3. Exchange of Inter-Domain Routes . . . . . . . . . . . 21 88 5.9.4. Controller Redundancy . . . . . . . . . . . . . . . . 22 89 5.9.5. Path and Segment Liveliness . . . . . . . . . . . . . 24 90 5.10. Scalability . . . . . . . . . . . . . . . . . . . . . . . 25 91 5.10.1. Asymmetric Model B for VPN Families . . . . . . . . 27 92 6. Illustration of Use . . . . . . . . . . . . . . . . . . . . . 29 93 6.1. Reference Topology . . . . . . . . . . . . . . . . . . . 29 94 6.2. PNF to PNF Connectivity . . . . . . . . . . . . . . . . . 31 95 6.3. VNF to PNF Connectivity . . . . . . . . . . . . . . . . . 32 96 6.4. VNF to VNF Connectivity . . . . . . . . . . . . . . . . . 33 97 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 34 98 8. Security Considerations . . . . . . . . . . . . . . . . . . . 35 99 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 35 100 10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 35 101 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36 102 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 36 103 12.1. Normative References . . . . . . . . . . . . . . . . . . 36 104 12.2. Informative References . . . . . . . . . . . . . . . . . 36 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 40 107 1. Introduction 109 With the introduction of technologies such as 5G, the Internet of 110 Things (IoT), and Industry 4.0, service requirements are changing. 111 In addition to the ever-increasing demand for more capacity, these 112 services have other stringent service requirements that need to be 113 met such as ultra-reliable and/or low-latency communication. 115 Parallel to this, there is a continued trend to move towards network 116 function virtualization. Operators are building digitalized 117 infrastructure capable of hosting numerous virtualized network 118 functions (VNFs). Infrastructure that can scale in and scale out 119 depending on the application demand and can deliver flexibility and 120 service velocity. Much of this virtualization activity is driven by 121 the afore-mentioned emerging technologies as new infrastructure is 122 deployed in support of them. To try and meet the new service 123 requirements some of these VNFs are becoming more dispersed, so it is 124 common for networks to have a mix of centralized medium- or large- 125 sized sized data centers together with more distributed smaller 126 'edge-clouds'. VNFs hosted within these data centers require 127 seamless connectivity to each other, and to their existing physical 128 network function (PNF) counterparts. This connectivity also needs to 129 deliver against agreed SLAs. 131 Coupled with the deployment of virtualization is automation. Many of 132 these VNFs are deployed within SDN-enabled data centers where 133 automation is simply a must-have capability to improve service 134 activation lead-times. The expectation is that services will be 135 instantiated in an abstract point-and-click manner and be 136 automatically created by the underlying network, dynamically adapting 137 to service connectivity changes as virtual entities move between 138 hosts. 140 This document describes an architecture for a Network Function 141 Interconnect (NFIX) that allows for interworking of physical and 142 virtual network functions in a unified and scalable manner. It 143 describes a mechanism for establishing connectivity across multiple 144 discreet domains in both the wide-area network (WAN) and the data 145 center (DC) while maintaining the ability to deliver against SLAs. 147 To achieve this NFIX works with the underlying topology to build a 148 unified over-the-top topology. 150 The NFIX architecture described in this document does not define any 151 new protocols but rather outlines an architecture utilizing a 152 collaboration of existing standards-based protocols. 154 2. Terminology 156 o A physical network function (PNF) refers to a network device such 157 as a Provider Edge (PE) router that connects physically to the 158 wide-area network. 160 o A virtualized network function (VNF) refers to a network device 161 such as a provider edge (PE) router that is hosted on an 162 application server. The VNF may be bare-metal in that it consumes 163 the entire resources of the server, or it may be one of numerous 164 virtual functions instantiated as a VM or number of containers on 165 a given server that is controlled by a hypervisor or container 166 management platform. 168 o A Data Center Interconnect (DCI) refers to the network function 169 that spans the border between the wide-area and the data center 170 networks, typically interworking the different encapsulation 171 techniques employed within each domain. 173 o An Interconnect controller is the controller responsible for 174 managing the NFIX fabric and services. 176 o A DC controller is the term used for a controller that resides 177 within an SDN-enabled data center and is responsible for the DC 178 network(s) 180 3. Motivation 182 Industrial automation and business-critical environments use 183 applications that are demanding on the network. These applications 184 present different requirements from low-latency to high-throughput, 185 to application-specific traffic conditioning, or a combination. The 186 evolution to 5G equally presents challenges for mobile back-, front- 187 and mid-haul networks. The requirement for ultra-reliable low- 188 latency communication means that operators need to re-evaluate their 189 network architecture to meet these requirements. 191 At the same time, the service edge is evolving. Where the service 192 edge device was historically a PNF, the adoption of virtualization 193 means VNFs are becoming more commonplace. Typically, these VNFs are 194 hosted in some form of data center environment but require end-to-end 195 connectivity to other VNFs and/or other PNFs. This represents a 196 challenge because generally transport layer connectivity differs 197 between the WAN and the data center environment. The WAN includes 198 all levels of hierarchy (core, aggregation, access) that form the 199 networks footprint, where transport layer connectivity using IP/MPLS 200 is commonplace. In the data center native IP is commonplace, 201 utilizing network virtualization overlay (NVO) technologies such as 202 virtual extensible LAN (VXLAN) [RFC7348], network virtualization 203 using generic routing encapsulation (NVGRE) [RFC7637], or generic 204 network virtualization encapsulation (GENEVE) [I-D.ietf-nvo3-geneve]. 205 There is a requirement to seamlessly integrate these islands and 206 avoid heavy-lifting at interconnects as well as providing a means to 207 provision end-to-end services with a single touch point at the edge. 209 The service edge boundary is also changing. Some functions that were 210 previously reasonably centralized are now becoming more distributed. 211 One reason for this is to attempt to deal with low latency 212 requirements. Another reason is that operators seek to reduce costs 213 by deploying low/medium-capacity VNFs closer to the edge. Equally, 214 virtualization also sees some of the access network moving towards 215 the core. Examples of this include cloud-RAN or Software-Defined 216 Access Networks. 218 Historically service providers have architected data centers 219 independently from the wide-area network, creating two independent 220 domains or islands. As VNFs become part of the service landscape the 221 service data-path must be extended across the WAN into the data 222 center infrastructure, but in a manner that still allows operators to 223 meet deterministic performance requirements. Methods for stitching 224 WAN and DC infrastructures together with some form of service- 225 interworking at the data center interconnect have been implemented 226 and deployed, but this service-interworking approach has several 227 limitations: 229 o The data center environment typically uses encapsulation 230 techniques such as VXLAN or NVGRE while the WAN typically uses 231 encapsulation techniques such as MPLS [RFC3031]. Underlying 232 optical infrastructure might also need to be programmed. These 233 are incompatible and require interworking at the service layer. 235 o It typically requires heavy-touch service provisioning on the data 236 center interconnect. In an end-to-end service, midpoint 237 provisioning is undesirable and should be avoided. 239 o Automation is difficult; largely due to the first two points but 240 with additional contributing factors. In the virtualization world 241 automation is a must-have capability. 243 o When a service is operating at Layer 3 in a data center with 244 redundant interconnects the risk of routing loops exists. There 245 is no inherent loop avoidance mechanism when redistributing routes 246 between address families so extreme care must be taken. Proposals 247 such as the Domain Path (D-PATH) attribute 248 [I-D.ietf-bess-evpn-ipvpn-interworking] attempt to address this 249 issue but as yet are not widely implemented or deployed. 251 o Some or all the above make the service-interworking gateway 252 cumbersome with questionable scaling attributes. 254 Hence there is a requirement to create an open, scalable, and unified 255 network architecture that brings together the wide-area network and 256 data center domains. It is not an architecture e xclusively targeted 257 at greenfield deployments, nor does it require a flag day upgrade to 258 deploy in a brownfield network. It is an evolutionary step to a 259 consolidated network that uses the constructs of seamless MPLS 260 [I-D.ietf-mpls-seamless-mpls] as a baseline and extends upon that to 261 include topologies that may not be link-state based and to provide 262 end-to-end path control. Overall the NFIX architecture aims to 263 deliver the following: 265 o Allows for an evolving service edge boundary without having to 266 constantly restructure the architecture. 268 o Provides a mechanism for providing seamless connectivity between 269 VNF to VNF, VNF to PNF, and PNF to PNF, with deterministic SLAs, 270 and with the ability to provide differentiated SLAs to suit 271 different service requirements. 273 o Delivers a unified transport fabric using Segment Routing (SR) 274 [RFC8402] where service delivery mandates touching only the 275 service edge without imposing additional encapsulation 276 requirements in the DC. 278 o Embraces automation by providing an environment where any end-to- 279 end connectivity can be instantiated in a single request manner 280 while maintaining SLAs. 282 4. Requirements 284 The following section outlines the requirements that the proposed 285 solution must meet. From an overall perspective, the proposed 286 generic architecture must: 288 o Deliver end-to-end transport LSPs using traffic-engineering (TE) 289 as required to meet appropriate SLAs for the service using(s) 290 using those LSPs. End-to-end refers to VNF and/or PNF 291 connectivity or a combination of both. 293 o Provide a solution that allows for optimal end-to-end path 294 placement; where optimal not only meets the requirements of the 295 path in question but also meets the global network objectives. 297 o Support varying types of VNF physical network attachment and 298 logical (underlay/overlay) connectivity. 300 o Facilitate automation of service provision. As such the solution 301 should avoid heavy-touch service provisioning and decapsulation/ 302 encapsulation at data center interconnects. 304 o Provide a framework for delivering logical end-to-end networks 305 using differentiated logical topologies and/or constraints. 307 o Provide a high level of stability; faults in one domain should not 308 propagate to another domain. 310 o Provide a mechanism for homogeneous end-to-end OAM. 312 o Hide/localize instabilities in the different domains that 313 participate in the end-to-end service. 315 o Provide a mechanism to minimize the label-stack depth required at 316 path head-ends for SR-TE LSPs. 318 o Offer a high level of scalability. 320 o Although not considered in-scope of the current version of this 321 document, the solution should not preclude the deployment of 322 multicast. This subject may be covered in later versions of this 323 document. 325 5. Theory of Operation 327 This section describes the NFIX architecture including the building 328 blocks and protocol machinery that is used to form the fabric. Where 329 considered appropriate rationale is given for selection of an 330 architectural component where other seemingly applicable choices 331 could have been made. 333 5.1. VNF Assumptions 335 For the sake of simplicity, references to VNF are made in a broad 336 sense. The way in which a VNF is instantiated and provided network 337 connectivity will differ based on environment and VNF capability, but 338 for conciseness this is not explicitly detailed with every reference 339 to a VNF. Common examples of VNF variants include but are not 340 limited to: 342 o o A VNF that functions as a routing device and has full IP routing 343 and MPLS capabilities. It can be connected simultaneously to the 344 data center fabric underlay and overlay and serves as the NVO 345 tunnel endpoint [RFC8014]. Examples of this might be a 346 virtualized PE router, or a virtualized Broadband Network Gateway 347 (BNG). 349 o A VNF that functions as a device (host or router) with limited IP 350 routing capability. It does not connect directly to the data 351 center fabric underlay but rather connects to one or more external 352 physical or virtual devices that serve as the NVO tunnel 353 endpoint(s). It may however have single or multiple connections 354 to the overlay. Examples of this might be a mobile network 355 control or management plane function. 357 o A VNF that has no routing capability. It is a virtualized 358 function hosted within an application server and is managed by a 359 hypervisor or container host. The hypervisor/container host acts 360 as the NVO endpoint and interfaces to some form of SDN controller 361 responsible for programming the forwarding plane of the 362 virtualization host using, for example, OpenFlow. Examples of 363 this might be an Enterprise application server, or a web server. 365 Where considered necessary exceptions to the examples provided above 366 or focus on a particular scenario will be highlighted. 368 5.2. Overview 370 The NFIX architecture makes no assumptions about how the network is 371 physically composed, nor does it impose any dependencies upon it. It 372 also makes no assumptions about IGP hierarchies. The use of areas/ 373 levels or discrete IGP instances within the WAN is fully endorsed to 374 enhance scalability and constrain fault propagation. The overall 375 architecture uses the constructs of seamless MPLS as a baseline and 376 extends upon that. The concept of decomposing the network into 377 multiple domains is one that has been widely deployed and has been 378 proven to scale in networks with large numbers of nodes. 380 The proposed architecture uses segment routing (SR) as its preferred 381 choice of transport. Segment routing is chosen for construction of 382 end-to-end LSPs given its ability to traffic-engineer through source- 383 routing while concurrently scaling exceptionally well due to its lack 384 of network state other than the ingress node. This document uses SR 385 instantiated on an MPLS forwarding plane(SR-MPLS), although it does 386 not preclude the use of SRv6 either now or at some point in the 387 future. The rationale for selecting SR-MPLS is simply maturity and 388 more widespread applicability across a potentially broad range of 389 network devices. This document may be updated in future versions to 390 include more description of SRv6 applicability. 392 5.3. Use of a Centralized Controller 394 It is recognized that for most operators the move towards the use of 395 a controller within the wide-area network is a significant change in 396 operating model. In the NFIX architecture it is a necessary 397 component. Its use is not simply to offload inter-domain path 398 calculation from network elements; it provides many more benefits: 400 o It offers the ability to enforce constraints on paths that 401 originate/terminate on different network elements, thereby 402 providing path diversity, and/or bidirectionality/co-routing, and/ 403 or disjointness. 405 o It avoids collisions, re-tries, and packing problems that has been 406 observed in networks using distributed TE path calculation, where 407 head-ends make autonomous decisions. 409 o A controller can take a global view of path placement strategies, 410 including the ability to make path placement decisions over a high 411 number of LSPs concurrently as opposed to considering each LSP 412 independently. In turn, this allows for 'global' optimization of 413 network resources such as available capacity. 415 o A controller can make decisions based on near-real-time network 416 state and optimize paths accordingly. For example, if a network 417 link becomes congested it may recompute some of the paths 418 transiting that link to other links that may not be quite as 419 optimal but do have available capacity. Or if a link latency 420 crosses a certain threshold, it may select to reoptimize some 421 latency-sensitive paths away from that link. 423 o The logic of a controller can be extended beyond pure path 424 computation and placement. If the controller is aware of 425 services, service requirements, and available paths within the 426 network it can cross-correlate between them and ensure that the 427 appropriate paths are used for the appropriate services. 429 o The controller can provide assurance and verification of the 430 underlying SLA provided to a given service. 432 As the main objective of the NFIX architecture is to unify the data 433 center and wide-area network domains, using the term controller is 434 not sufficiently succinct. The centralized controller may need to 435 interface to other controllers that potentially reside within an SDN- 436 enabled data center. Therefore, to avoid interchangeably using the 437 term controller for both functions, we distinguish between them 438 simply by using the terms 'DC controller' which as the name suggests 439 is responsible for the DC, and 'Interconnect controller' responsible 440 for managing the extended SR fabric and services. 442 The Interconnect controller learns wide-area network topology 443 information and allocation of segment routing SIDs within that domain 444 using BGP link-state [RFC7752] with appropriate SR extensions. 445 Equally it learns data center topology information and Prefix-SID 446 allocation using BGP labeled unicast [RFC8277] with appropriate SR 447 extensions, or BGP link-state if a link-state IGP is used within the 448 data center. If Route-Reflection is used for exchange of BGP link- 449 state or labeled unicast NLRI within one or more domains, then the 450 Interconnect controller need only peer as a client with those Route- 451 Reflectors in order to learn topology information. 453 Where BGP link-state is used to learn the topology of a data center 454 (or any IGP routing domain) the BGP-LS Instance Identifier (Instance- 455 ID) is carried within Node/Link/Prefix NLRI and is used to identify a 456 given IGP routing domain. Where labeled unicast BGP is used to 457 discover the topology of one or more data center domains there is no 458 equivalent way for the Interconnect controller to achieve a level of 459 routing domain correlation. The controller may learn some splintered 460 connectivity map consisting of 10 leaf switches, four spine switches, 461 and four DCI's, but it needs some form of key to inform it that leaf 462 switches 1-5, spine switches 1 and 2, and DCI's 1 and 2 belong to 463 data center 1, while leaf switches 6-10, spine switches 3 and 4, and 464 DCI's 3 and 4 belong to data center 2. What is needed is a form of 465 'data center membership identification' to provide this correlation. 466 Optionally this could be achieved at BGP level using a standard 467 community to represent each data center, or it could be done at a 468 more abstract level where for example the DC controller provides the 469 membership identification to the Interconnect controller through an 470 application programming interface (API). 472 Understanding real-time network state is an important part of the 473 Interconnect controllers role, and only with this information is the 474 controller able to make informed decisions and take preventive or 475 corrective actions as necessary. There are numerous methods 476 implemented and deployed that allow for harvesting of network state, 477 including (but not limited to) IPFIX [RFC7011], Netconf/YANG 478 [RFC6241][RFC6020], streaming telemetry, and the BGP Monitoring 479 Protocol (BMP) [RFC7854]. 481 5.4. Transport Layer 483 This section describes the mechanisms and protocols that are used to 484 establish end-to-end transport LSPs; where end-to-end refers to VNF- 485 to-VNF, PNF-to-PNF, or VNF-to-PNF. 487 5.4.1. Intra-Domain Routing 489 This section describes the mechanisms and protocols that are used to 490 establish end-to-end transport LSPs; where end-to-end refers to VNF- 491 to-VNF, PNF-to-PNF, or VNF-to-PNF. 493 5.4.2. Intra-Domain Routing 495 In a seamless MPLS architecture domains are based on geographic 496 dispersion (core, aggregation, access). Within this document a 497 domain is considered as any entity with a captive topology; be it a 498 link-state topology or otherwise. Where reference is made to the 499 wide-area network domain, it refers to one or more domains that 500 constitute the wide-area network domain. 502 This section discusses the basic building blocks required within the 503 wide-area network and the data center, noting from above that the 504 wide-area network may itself consist of multiple domains. 506 5.4.2.1. Wide-Area Network Domains 508 The wide-area network includes all levels of hierarchy (core, 509 aggregation, access) that constitute the networks MPLS footprint as 510 well as the Data Center Interconnects (DCIs). Each domain that 511 constitutes part of the wide-area network runs a link-state interior 512 gateway protocol (IGP) such as ISIS or OSPF, and each domain may use 513 IGP-inherent hierarchy (OSPF areas, ISIS levels) with an assumption 514 that visibility is domain-wide using, for example, L2 to L1 515 redistribution. Alternatively, or additionally, there may be 516 multiple domains that are split by using separate and distinct 517 instances of IGP. There is no requirement for IGP redistribution of 518 any link or loopback addresses between domains. 520 Each IGP should be enabled with the relevant extensions for segment 521 routing [RFC8667][RFC8665], and each SR-capable router should 522 advertise a Node-SID for its loopback address, and an Adjacency-SID 523 (Adj-SID) for every connected interface (unidirectional adjacency) 524 belonging to the SR domain. SR Global Blocks (SRGB) can be allocated 525 to each domain as deemed appropriate to specific network 526 requirements. Border routers belonging to multiple domains have an 527 SRGB for each domain. 529 The default forwarding path for intra-domain transport LSPs that do 530 not require TE is simply an SR LSP containing a single label 531 advertised by the destination as a Node-SID and representing the 532 ECMP-aware shortest path to that destination. Intra-domain TE 533 transport LSPs are constructed as required by the Interconnect 534 controller. Once a path is calculated it is advertised as an 535 explicit SR Policy [I-D.ietf-spring-segment-routing-policy] 536 containing one or more paths expressed as one or more segment-lists. 537 An SR Policy is identified through the tuple [headend, color, 538 endpoint] and this tuple is used extensively by the Interconnect 539 controller to associate services with an underlying SR Policy that 540 meets its objectives. 542 5.4.2.2. Data Center Domain 544 The data center domain includes all fabric switches, network 545 virtualization edge (NVE), and the Data Center Interconnects. The 546 data center routing design may align with the framework of [RFC7938] 547 running eBGP single-hop sessions established over direct point-to- 548 point links, or it may use an IGP for dissemination of topology 549 information. 551 The chosen method of transport or encapsulation within the data 552 center for NFIX is SR-MPLS over IP/UDP [RFC8663] or, where possible, 553 native SR-MPLS. The choice of SR-MPLS over IP/UDP or native SR-MPLS 554 allows for good entropy to maximize the use of equal-cost Clos fabric 555 links and allows for a lightweight interworking function at the DCI 556 without the requirement for midpoint service provisioning. Loopback 557 addresses of network elements within the data center are advertised 558 using labeled unicast BGP with the addition of SR Prefix SID 559 extensions [RFC8669] containing a globally unique and persistent 560 Prefix-SID. The data-plane encapsulation of SR-MPLS over IP/UDP or 561 native SR-MPLS allows network elements within the data center to 562 consume BGP Prefix-SIDs and legitimately use those in the 564 5.4.3. Inter-Domain Routing 566 Inter-domain routing is responsible for establishing connectivity 567 between any domains that form the wide-area network, and between the 568 wide-area network and data center domains. It is considered unlikely 569 that every end-to-end LSP will require a TE path, hence there is a 570 requirement for a default end-to-end forwarding path. This default 571 forwarding path may also become the path of last resort in the event 572 of a non-recoverable failure of a TE path. Similar to the seamless 573 MPLS architecture this inter-domain MPLS connectivity is realized 574 using labeled unicast BGP [RFC8277] with the addition of SR Prefix 575 SID extensions. 577 Within each wide-area network domain all service edge routers, DCIs, 578 and ABRs/ASBRs form part of the labeled BGP mesh, which can be either 579 full-mesh, or more likely based on the use of route-reflection. Each 580 of these routers advertises its respective loopback addresses into 581 labeled BGP together with an MPLS label and a globally unique Prefix- 582 SID. Routes are advertised between wide-area network domains by 583 ABRs/ASBRs that impose next-hop-self on advertised routes. The 584 function of imposing next-hop-self for labeled routes means that the 585 ABR/ASBR allocates a new label for advertised routes and programs a 586 label-swap entry in the forwarding plane for received and advertised 587 routes. In short it becomes part of the forwarding path. 589 DCI routers have labeled BGP sessions towards the wide-area network 590 and labeled BGP sessions towards the data center. Routes are 591 bidirectionally advertised between the domains subject to policy, 592 with the DCI imposing itself as next-hop on advertised routes. As 593 above, the function of imposing next-hop-self for labeled routes 594 implies allocation of a new label for advertised routes and a label- 595 swap entry being programmed in the forwarding plane for received and 596 advertised labels. The DCI thereafter becomes the anchor point 597 between the wide-area network domain and the data center domain. 599 Within the wide-area network next-hops for labeled unicast routes 600 containing Prefix-SIDs are resolved to SR LSPs, and within the data 601 center domain next-hops for labeled unicast routes containing Prefix- 602 SIDs are resolved to SR LSPs or IP/UDP tunnels. This provides end- 603 to-end connectivity without a traffic-engineering capability. 605 5.4.4. Intra-Domain and Inter-Domain Traffic-Engineering 607 A capability to traffic-engineer intra- and inter-domain end-to-end 608 paths is considered a key requirement in order to meet the service 609 objectives previously outlined. To achieve optimal end-to-end path 610 placement the key components to be considered are path calculation, 611 path activation, and FEC-to-path binding procedures. 613 In the NFIX architecture end-to-end path calculation is performed by 614 the Interconnect controller. The mechanics of how the objectives of 615 each path is calculated is beyond the scope of this document. Once a 616 path is calculated based upon its objectives and constraints, the 617 path is advertised from the controller to the LSP headend as an 618 explicit SR Policy containing one or more paths expressed as one or 619 more segment-lists. An SR Policy is identified through the tuple 620 [headend, color, endpoint] and this tuple is used extensively by the 621 Interconnect controller to associate services with an underlying SR 622 Policy that meets its objectives. 624 The segment-list of an SR Policy encodes a source-routed path towards 625 the endpoint. When calculating the segment-list the Interconnect 626 controller makes comprehensive use of the Binding-SID (BSID), 627 instantiating BSID anchors as necessary at path midpoints when 628 calculating and activating a path. The use of BSID is considered 629 fundamental to segment routing as described in 630 [I-D.filsfils-spring-sr-policy-considerations]. It provides opacity 631 between domains, ensuring that any segment churn is constrained to a 632 single domain. It also reduces the number of segments/labels that 633 the headend needs to impose, which is particularly important given 634 that network elements within a data center generally have limited 635 label imposition capabilities. In the context of the NFIX 636 architecture it is also the vehicle that allows for removal of heavy 637 midpoint provisioning at the DCI. 639 For example, assume that VNF1 is situated in data center 1, which is 640 interconnected to the wide-area network via DCI1. VNF1 requires 641 connectivity to VNF2, situated in data center 2, which is 642 interconnected to the wide-area network via DCI2. Assuming there is 643 no existing TE path that meet VNF1's requirements, the Interconnect 644 controller will: 646 o Instantiate an SR Policy on DCI1 with BSID n and a segment-list 647 containing the relevant segments of a TE path to DCI2. DCI1 648 therefore becomes a BSID anchor. 650 o Instantiate an SR Policy on VNF1 with BSID m and a segment-list 651 containing segments {DCI1, n, VNF2}. 653 +---------------+ +----------------+ +---------------+ 654 | Data Center 1 | | Wide-Area | | Data Center 2 | 655 | +----+ +----+ 3 +----+ +----+ | 656 | |VNF1| |DCI1|-1 / \ 5--|DCI2| |VNF2| | 657 | +----+ +----+ \ / \ / +----+ +----+ | 658 | | | 2 4 | | | 659 +---------------+ +----------------+ +---------------+ 660 SR Policy SR Policy 661 BSID m BSID n 662 {DCI1,n,VNF2} {1,2,3,4,5,DCI2} 664 Traffic-Engineered Path using BSID 666 Figure 1 668 5.5. Service Layer 670 The service layer is intended to deliver Layer 2 and/or Layer 3 VPN 671 connectivity between network functions to create an overlay utilizing 672 the transport layer described in section 5.4. To do this the 673 solution employs the EVPN and/or VPN-IPv4/IPv6 address families to 674 exchange Layer 2 and Layer 3 Network Layer Reachability Information 675 (NLRI). When these NLRI are exchanged between domains it is typical 676 for the border router to set next-hop-self on advertised routes. 677 With the proposed transport layer however, this is not required and 678 EVPN/VPN-IPv4/IPv6 routes should be passed end-to-end without transit 679 routers modifying the next-hop attribute. 681 Section 5.4.2 describes the use of labeled unicast BGP to exchange 682 inter-domain routes to establish a default forwarding path. Labeled- 683 unicast BGP is used to exchange prefix reachability between service 684 edge routers, with domain border routes imposing next-hop-self on 685 routes advertised between domains. This provides a default inter- 686 domain forwarding path and provides the required connectivity to 687 establish inter-domain BGP sessions between service edges for the 688 exchange of EVPN and/or VPN-IPv4/IPv6 NLRI. If route-reflection is 689 used for the EVPN and/or VPN-IPv4/IPv6 address families within one or 690 more domains, it may be desirable to create inter-domain BGP sessions 691 between route-reflectors. In this case the peering addresses of the 692 route-reflectors should also be exchanged between domains using 693 labeled unicast BGP. This creates a connectivity model analogous to 694 BGP/MPLS IP-VPN Inter-AS option C [RFC4364]. 696 +----------------+ +----------------+ +----------------+ 697 | +----+ | | +----+ | | +----+ | 698 +----+ | RR | +----+ | RR | +----+ | RR | +----+ 699 | NF | +----+ | DCI| +----+ | DCI| +----+ | NF | 700 +----+ +----+ +----+ +----+ 701 | Domain | | Domain | | Domain | 702 +----------------+ +----------------+ +----------------+ 703 <-------> <-----> NHS <-- BGP-LU ---> NHS <-----> <------> 704 <-------> <--------- EVPN/VPN-IPv4/v6 ----------> <------> 706 Inter-Domain Service Layer 708 Figure 2 710 EVPN and/or VPN-IPv4/v6 routes received from a peer in a different 711 domain will contain a next-hop equivalent to the router that sourced 712 the route. The next-hop of these routes can be resolved to labeled- 713 unicast route (default forwarding path) or to an SR policy (traffic- 714 engineered forwarding path) as appropriate to the service 715 requirements. The exchange of EVPN and/or VPN-IPv4/IPv6 routes in 716 this manner implies that Route-Distinguisher and Route-Target values 717 remain intact end-to-end. 719 The use of end-to-end EVPN and/or VPN-IPv4/IPv6 address families 720 without the imposition of next-hop-self at border routers complements 721 the gateway-less transport layer architecture. It negates the 722 requirement for midpoint service provisioning and as such provides 723 the following benefits: 725 o Avoids the translation of MAC/IP EVPN routes to IP-VPN routes (and 726 vice versa) that is typically associated with service 727 interworking. 729 o Avoids instantiation of MAC-VRFs and IP-VPNs for each tenant 730 resident in the DCI. 732 o Avoids provisioning of demarcation functions between the data 733 center and wide-area network such as QoS, access-control, 734 aggregation and isolation. 736 5.6. Service Differentiation 738 As discussed in section 5.4.3, the use of TE paths is a key 739 capability of the NFIX solution framework described in this document. 740 The Interconnect controller computes end-to-end TE paths between NFs 741 and programs DC nodes, DCIs, ABR/ASBRs, via SR Policy, with the 742 necessary label forwarding entries for each [headend, color, 743 endpoint]. The collection of [headend, endpoint] pairs for the same 744 color constitutes a logical network topology, where each topology 745 satisfies a given SLA requirement. 747 The Interconnect controller discovers the endpoints associated to a 748 given topology (color) upon the reception of EVPN or IPVPN routes 749 advertised by the endpoint. The EVPN and IPVPN NLRIs are advertised 750 by the endpoint nodes along with a color extended community which 751 identifies the topology to which the owner of the NLRI belongs. At a 752 coarse level all the EVPN/IPVPN routes of the same VPN can be 753 advertised with the same color, and therefore a TE topology would be 754 established on a per-VPN basis. At a more granular level IPVPN and 755 especially EVPN provide a more granular way of coloring routes, that 756 will allow the Interconnect controller to associate multiple 757 topologies to the same VPN. For example: 759 o All the EVPN MAC/IP routes for a given VNF may be advertised with 760 the same color. This would allow the Interconnect controller to 761 associate topologies per VNF within the same VPN; that is, VNF1 762 could be blue (e.g., low-latency topology) and VNF2 could be green 763 (e.g., high-throughput). 765 o The EVPN MAC/IP routes and Inclusive Multicast Ethernet Tag (IMET) 766 route for VNF1 may be advertised with different colors, e.g., red 767 and brown, respectively. This would allow the association of 768 e.g., a low-latency topology for unicast traffic to VNF1 and best- 769 effort topology for BUM traffic to VNF1. 771 o Each EVPN MAC/IP route or IP-Prefix route from a given VNF may be 772 advertised with different color. This would allow the association 773 of topologies at the host level or host route granularity. 775 5.7. Automated Service Activation 777 The automation of network and service connectivity for instantiation 778 and mobility of virtual machines is a highly desirable attribute 779 within data centers. Since this concerns service connectivity, it 780 should be clear that this automation is relevant to virtual functions 781 that belong to a service as opposed to a virtual network function 782 that delivers services, such as a virtual PE router. 784 Within an SDN-enabled data center, a typical hierarchy from top to 785 bottom would include a policy engine (or policy repository), one or 786 more DC controllers, numerous hypervisors/container hosts that 787 function as NVO endpoints, and finally the virtual 788 machines(VMs)/containers, which we'll refer to generically as 789 virtualization hosts. 791 The mechanisms used to communicate between the policy engine and DC 792 controller, and between the DC controller and hypervisor/container 793 are not relevant here and as such they are not discussed further. 794 What is important is the interface and information exchange between 795 the Interconnect controller and the data center SDN functions: 797 o The Interconnect controller interfaces with the data center policy 798 engine and publishes the available colors, where each color 799 represents a topological service connectivity map that meets a set 800 of constraints and SLA objectives. This interface is a 801 straightforward API. 803 o The Interconnect controller interfaces with the DC controller to 804 learn overlay routes. This interface is BGP and uses the EVPN 805 Address Family. 807 With the above framework in place, automation of network and service 808 connectivity can be implemented as follows: 810 o The virtualization host is turned-up. The NVO endpoint notifies 811 the DC controller of the startup. 813 o The DC controller retrieves service information, IP addressing 814 information, and service 'color' for the virtualization host from 815 the policy engine. The DC controller subsequently programs the 816 associated forwarding information on the virtualization host. 817 Since the DC controller is now aware of MAC and IP address 818 information for the virtualization host, it advertises that 819 information as an EVPN MAC Advertisement Route into the overlay. 821 o The Interconnect controller receives the EVPN MAC Advertisement 822 Route (potentially via a Route-Reflector) and correlates it with 823 locally held service information and SLA requirements using Route 824 Target and Color communities. If the relevant SR policies are not 825 already in place to support the service requirements and logical 826 connectivity, including any binding-SIDs, they are calculated and 827 advertised to the relevant headends. 829 The same automated service activation principles can also be used to 830 support the scenario where virtualization hosts are moved between 831 hypervisors/container hosts for resourcing or other reasons. We 832 refer to this simply as mobility. If a virtualization host is turned 833 down the parent NVO endpoint notifies the DC controller, which in 834 turn notifies the policy engine and withdraws any EVPN MAC 835 Advertisement Routes. Thereafter all associated state is removed. 836 When the virtualization host is turned up on a different hypervisor/ 837 container host, the automated service connectivity process outlined 838 above is simply repeated. 840 5.8. Service Function Chaining 842 Service Function Chaining (SFC) defines an ordered set of abstract 843 service functions and the subsequent steering of traffic through 844 them. Packets are classified at ingress for processing by the 845 required set of service functions (SFs) in an SFC-capable domain and 846 are then forwarded through each SF in turn for processing. The 847 ability to dynamically construct SFCs containing the relevant SFs in 848 the right sequence is a key requirement for operators. 850 To enable flexible service function deployment models that support 851 agile service insertion the NFIX architecture adopts the use of BGP 852 as the control plane to distribute SFC information. The BGP control 853 plane for Network Service Header (NSH) SFC 854 [I-D.ietf-bess-nsh-bgp-control-plane] is used for this purpose and 855 defines two route types; the Service Function Instance Route (SFIR) 856 and the Service Function Path Route (SFPR). 858 The SFIR is used to advertise the presence of a service function 859 instance (SFI) as a function type (i.e. firewall, TCP optimizer) and 860 is advertised by the node hosting that SFI. The SFIR is advertised 861 together with a BGP Tunnel Encapsulation attribute containing details 862 of how to reach that particular service function through the underlay 863 network (i.e. IP address and encapsulation information). 865 The SFPRs contain service function path (SFP) information and one 866 SFPR is originated for each SFP. Each SFPR contains the service path 867 identifier (SPI) of the path, the sequence of service function types 868 that make up the path (each of which has at least one instance 869 advertised in an SFIR), and the service index (SI) for each listed 870 service function to identify its position in the path. 872 Once a Classifier has determined which flows should be mapped to a 873 given SFP, it imposes an NSH [RFC8300] on those packets, setting the 874 SPI to that of the selected service path (advertised in an SFPR), and 875 the SI to the first hop in the path. As NSH is encapsulation 876 agnostic, the NSH encapsulated packet is then forwarded through the 877 appropriate tunnel to reach the service function forwarder (SFF) 878 supporting that service function instance (advertised in an SFIR). 879 The SFF removes the tunnel encapsulation and forwards the packet with 880 the NSH to the relevant SF based upon a lookup of the SPI/SI. When 881 it is returned from the SF with a decremented SI value, the SFF 882 forwards the packet to the next hop in the SFP using the tunnel 883 information advertised by that SFI. This procedure is repeated until 884 the last hop of the SFP is reached. 886 The use of the NSH in this manner allows for service chaining with 887 topological and transport independence. It also allows for the 888 deployment of SFIs in a condensed or dispersed fashion depending on 889 operator preference or resource availability. Service function 890 chains are built in their own overlay network and share a common 891 underlay network, where that common underlay network is the NFIX 892 fabric described in section 5.4. BGP updates containing an SFIR or 893 SFPR are advertised in conjunction with one or more Route Targets 894 (RTs), and each node in a service function overlay network is 895 configured with one or more import RTs. As a result, nodes will only 896 import routes that are applicable and that local policy dictates. 897 This provides the ability to support multiple service function 898 overlay networks or the construction of service function chains 899 within L3VPN or EVPN services. 901 Although SFCs are constructed in a unidirectional manner, the BGP 902 control plane for NSH SFC allows for the optional association of 903 multiple paths (SFPRs). This provides the ability to construct a 904 bidirectional service function chain in the presence of multiple 905 equal-cost paths between source and destination to avoid problems 906 that SFs may suffer with traffic asymmetry. 908 The proposed SFC model can be considered decoupled in that the use of 909 SR as a transport between SFFs is completely independent of the use 910 of NSH to define the SFC. That is, it uses an NSH-based SFC and SR 911 is just one of many encapsulations that could be used between SFFs. 912 A similar more integrated approach proposes encoding a service 913 function as a segment so that an SFC can be constructed as a segment- 914 list. In this case it can be considered an SR-based SFC with an NSH- 915 based service plane since the SF is unaware of the presence of the 916 SR. Functionally both approaches are very similar and as such both 917 could be adopted and could work in parallel. Construction of SFCs 918 based purely on SR (SF is SR-aware) are not considered at this time. 920 5.9. Stability and Availability 922 Any network architecture should have the capability to self-restore 923 following the failure of a network element. The time to reconverge 924 following the failure needs to be minimal to avoid evident 925 disruptions in service. This section discusses protection mechanisms 926 that are available for use and their applicability to the proposed 927 architecture. 929 5.9.1. IGP Reconvergence 931 Within the construct of an IGP topology the Topology Independent Loop 932 Free Alternate (TI-LFA) [I-D.ietf-rtgwg-segment-routing-ti-lfa] can 933 be used to provide a local repair mechanism that offers both link and 934 node protection. 936 TI-LFA is a repair mechanism, and as such it is reactive and 937 initially needs to detect a given failure. To provide fast failure 938 detection the Bidirectional Forwarding Mechanism (BFD) is used. 939 Consideration needs to be given to the restoration capabilities of 940 the underlying transmission when deciding values for message 941 intervals and multipliers to avoid race conditions, but failure 942 detection in the order of 50 milliseconds can reasonably be 943 anticipated. Where Link Aggregation Groups (LAG) are used, micro-BFD 944 [RFC7130] can be used to similar effect. Indeed, to allow for 945 potential incremental growth in capacity it is not uncommon for 946 operators to provision all network links as LAG and use micro-BFD 947 from the outset. 949 5.9.2. Data Center Reconvergence 951 Clos fabrics are extremely common within data centers, and 952 fundamental to a Clos fabric is the ability to load-balance using 953 Equal Cost Multipath (ECMP). The number of ECMP paths will vary 954 dependent on the number of devices in the parent tier but will never 955 be less than two for redundancy purposes with traffic hashed over the 956 available paths. In this scenario the availability of a backup path 957 in the event of failure is implicit. Commonly within the DC, rather 958 than computing protect paths (like LFA), techniques such as 'fast 959 rehash' are often utilized. In this particular case, the failed 960 next-hop is removed from the multi-path forwarding data structure and 961 traffic is then rehashed over the remaining active paths. 963 In BGP-only data centers this relies on the implementation of BGP 964 multipath. As network elements in the lower tier of a Clos fabric 965 will frequently belong to different ASNs, this includes the ability 966 to load-balance to a prefix with different AS_PATH attribute values 967 while having the same AS_PATH length; sometimes referred to as 968 'multipath relax' or 'multipath multiple-AS' [RFC7938]. 970 Failure detection relies upon declaring a BGP session down and 971 removing any prefixes learnt over that session as soon as the link is 972 declared down. As links between network elements predominantly use 973 direct point-to-point fiber, a link failure should be detected within 974 milliseconds. BFD is also commonly used to detect IP layer failures. 976 5.9.3. Exchange of Inter-Domain Routes 978 Labeled unicast BGP together with SR Prefix-SID extensions are used 979 to exchange PNF and/or VNF endpoints between domains to create end- 980 to-end connectivity without TE. When advertising between domains we 981 assume that a given BGP prefix is advertised by at least two border 982 routers (DCIs, ABRs, ASBRs) making prefixes reachable via at least 983 two next-hops. 985 BGP Prefix Independent Convergence (PIC) [I-D.ietf-rtgwg-bgp-pic] 986 allows failover to a pre-computed and pre-installed secondary next- 987 hop when the primary next-hop fails and is independent of the number 988 of destination prefixes that are affected by the failure. When the 989 primary BGP next-hop fails, it should be clear that BGP PIC depends 990 on the availability o f a secondary next-hop in the Pathlist. To 991 ensure that multiple paths to the same destination are visible the 992 BGP ADD-PATH [RFC7911] can be used to allow for advertisement of 993 multiple paths for the same address prefix. Dual-homed EVPN/IP-VPN 994 prefixes also have the alternative option of allocating different 995 Route-Distinguishers (RDs). To trigger the switch from primary to 996 secondary next-hop PIC needs to detect the failure and many 997 implementations support 'next-hop tracking' for this purpose. Next- 998 hop tracking monitors the routing-table and if the next-hop prefix is 999 removed will immediately invalidate all BGP prefixes learnt through 1000 that next-hop. In the absence of next-hop tracking, multihop BFD 1001 [RFC5883] could optionally be used as a fast failure detection 1002 mechanism. 1004 5.9.4. Controller Redundancy 1006 With the Interconnect controller providing an integral part of the 1007 networks' capabilities a redundant controller design is clearly 1008 prudent. To this end we can consider both availability and 1009 redundancy. Availability refers to the survivability of a single 1010 controller system in a failure scenario. A common strategy for 1011 increasing the availability of a single controller system is to build 1012 the system in a high-availability cluster such that it becomes a 1013 confederation of redundant constituent parts as opposed to a single 1014 monolithic system. Should a single part fail, the system can still 1015 survive without the requirement to failover to a standby controller 1016 system. Methods for detection of a failure of one or more member 1017 parts of the cluster are implementation specific. 1019 To provide contingency for a complete system failure a geo-redundant 1020 standby controller system is required. When redundant controllers 1021 are deployed a coherent strategy is needed that provides a master/ 1022 standby election mechanism, the ability to propagate the outcome of 1023 that election to network elements as required, an inter-system 1024 failure detection mechanism, and the ability to synchronize state 1025 across both systems such that the standby controller is fully aware 1026 of current state should it need to transition to master controller. 1028 Master/standby election, state synchronisation, and failure detection 1029 between geo-redundant sites can largely be considered a local 1030 implementation matter. The requirement to propagate the outcome of 1031 the master/standby election to network elements depends on a) the 1032 mechanism that is used to instantiate SR policies, and b) whether the 1033 SR policies are controller-initiated or headend-initiated, and these 1034 are discussed in the following sub-sections. In either scenario, 1035 state of SR policies should be advertised northbound to both master/ 1036 standby controllers using either PCEP LSP State Report messages or SR 1037 policy extensions to BGP link-state 1038 [I-D.ietf-idr-te-lsp-distribution]. 1040 5.9.4.1. SR Policy Initiator 1042 Controller-initiated SR policies are suited for auto-creation of 1043 tunnels based on service route discovery and policy-driven route/flow 1044 programming and are ephemeral. Headend-initiated tunnels allow for 1045 permanent configuration state to be held on the headend and are 1046 suitable for static services that are not subject to dynamic changes. 1047 If all SR policies are controller-initiated, it negates the 1048 requirement to propagate the outcome of the master/standby election 1049 to network elements. This is because headends have no requirement 1050 for unsolicited requests to a controller, and therefore have no 1051 requirement to know which controller is master and which one is 1052 standby. A headend may respond to a message from a controller, but 1053 it is not unsolicited. 1055 If some or all SR policies are headend-initiated, then the 1056 requirement to propagate the outcome of the master/standby election 1057 exists. This is further discussed in the following sub-section. 1059 5.9.4.2. SR Policy Instantiation Mechanism 1061 While candidate paths of SR policies may be provided using BGP, PCEP, 1062 Netconf, or local policy/configuration, this document primarily 1063 considers the use of PCEP or BGP. 1065 When PCEP [RFC5440][RFC8231][RFC8281] is used for instantiation of 1066 candidate paths of SR policies 1067 [I-D.barth-pce-segment-routing-policy-cp] every headend/PCC should 1068 establish a PCEP session with the master and standby controllers. To 1069 signal standby state to the PCC the standby controller may use a PCEP 1070 Notification message to set the PCEP session into overload state. 1071 While in this overload state the standby controller will accept path 1072 computation LSP state report (PCRpt) messages without delegation but 1073 will reject path computation requests (PCReq) and any path 1074 computation reports (PCRpt) with the delegation bit set. Further, 1075 the standby controller will not path computation originate initiate 1076 messages (PCInit) or path computation update request messages 1077 (PCUpd). In the event of the failure of the master controller, the 1078 standby controller will transition to active and remove the PCEP 1079 overload state. Following expiration of the PCEP redelegation 1080 timeout at the PCC any LSPs will be redelegated to the newly 1081 transitioned active controller. LSP state is not impacted unless 1082 redelegation is not possible before the state timeout interval 1083 expires. 1085 When BGP is used for instantiation of SR policies every headend 1086 should establish a BGP session with the master and standby controller 1087 capable of exchanging SR TE Policy SAFI. Candidate paths of SR 1088 policies are advertised only by the active controller. If the master 1089 controller should experience a failure, then SR policies learnt from 1090 that controller may be removed before they are re-advertised by the 1091 standby (or newly-active) controller. To avoid this possibility two 1092 options are possible: 1094 o Provide a static backup SR policy. 1096 o Fallback to the default forwarding path. 1098 5.9.5. Path and Segment Liveliness 1100 When using traffic-engineered SR paths only the ingress router holds 1101 any state. The exception here is where BSIDs are used, which also 1102 implies some state is maintained at the BSID anchor. As there is no 1103 control plane set-up, it follows that there is no feedback loop from 1104 transit nodes of the path to notify the headend when a non-adjacent 1105 point of the SR path fails. The Interconnect controller however is 1106 aware of all paths that are impacted by a given network failure and 1107 should take the appropriate action. This action could include 1108 withdrawing an SR policy if a suitable candidate path is already in 1109 place, or simply sending a new SR policy with a different segment- 1110 list and a higher preference value assigned to it. 1112 Verification of data plane liveliness is the responsibility of the 1113 path headend. A given SR policy may be associated with multiple 1114 candidate paths and for the sake of clarity, we'll assume two for 1115 redundancy purposes (which can be diversely routed). Verification of 1116 the liveliness of these paths can be achieved using seamless BFD 1117 (S-BFD)[RFC7880], which provides an in-band failure detection 1118 mechanism capable of detecting failure in the order of milliseconds. 1119 Upon failure of the active path, failover to a secondary candidate 1120 path can be activated at the path headend. Details of the actual 1121 failover and revert mechanisms are a local implementation matter. 1123 S-BFD provides a fast and scalable failure detection mechanism but is 1124 unlikely to be implemented in many VNFs given their inability to 1125 offload the process to purpose-built hardware. In the absence of an 1126 active failure detection mechanism such as S-BFD the failover from 1127 active path to secondary candidate path can be triggered using 1128 continuous path validity checks. One of the criteria that a 1129 candidate path uses to determine its validity is the ability to 1130 perform path resolution for the first SID to one or more outgoing 1131 interface(s) and next-hop(s). From the perspective of the VNF 1132 headend the first SID in the segment-list will very likely be the DCI 1133 (as BSID anchor) but could equally be another Prefix-SID hop within 1134 the data center. Should this segment experience a non-recoverable 1135 failure, the headend will be unable to resolve the first SID and the 1136 path will be considered invalid. This will trigger a failover action 1137 to a secondary candidate path. 1139 Injection of S-BFD packets is not just constrained to the source of 1140 an end-to-end LSP. When an S-BFD packet is injected into an SR 1141 policy path it is encapsulated with the label stack of the associated 1142 segment-list. It is possible therefore to run S-BFD from a BSID 1143 anchor for just that section of the end-to-end path (for example, 1144 from DCI to DCI). This allows a BSID anchor to detect failure of a 1145 path and take corrective action, while maintaining opacity between 1146 domains. 1148 5.10. Scalability 1150 There are many aspects to consider regarding scalability of the NFIX 1151 architecture. The building blocks of NFIX are standards-based 1152 technologies individually designed to scale for internet provider 1153 networks. When combined they provide a flexible and scalable 1154 solution: 1156 o BGP has been proven to scale and operate with millions of routes 1157 being exchanged. Specifically, BGP labeled unicast has been 1158 deployed and proven to scale in existing seamless-MPLS networks. 1160 o By placing forwarding instructions in the header of a packet, 1161 segment routing reduces the amount of state required in the 1162 network allowing the scale of greater number of transport tunnels. 1163 This aids in the feasibility of the NFIX architecture to permit 1164 the automated aspects of SR policy creation without having an 1165 impact on the state in the core of the network. 1167 o The choice of utilizing native SR-MPLS or SR over IP in the data 1168 center continues to permit horizontal scaling without introducing 1169 new state inside of the data center fabric while still permitting 1170 seamless end to end path forwarding integration. 1172 o BSIDs play a key role in the NFIX architecture as their use 1173 provides the ability to traffic-engineer across large network 1174 topologies consisting of many hops regardless of hardware 1175 capability at the headend. From a scalability perspective the use 1176 of BSIDs facilitates better scale due to the fact that detailed 1177 information about the SR paths in a domain has been abstracted and 1178 localized to the BSID anchor point only. When BSIDs are re-used 1179 amongst one or many headends they reduce the amount of path 1180 calculation and updates required at network edges while still 1181 providing seamless end to end path forwarding. 1183 o The architecture of NFIX continues to use an independent DC 1184 controller. This allows continued independent scaling of data 1185 center management in both policy and local forwarding functions, 1186 while off-loading the end-to-end optimal path placement and 1187 automation to the Interconnect controller. The optimal path 1188 placement is already a scalable function provided in a PCE 1189 architecture. The Interconnect controller must compute paths, but 1190 it is not burdened by the management of virtual entity lifecycle 1191 and associated forwarding policies. 1193 It must be acknowledged that with the amalgamation of the technology 1194 building blocks and the automation required by NFIX, there is an 1195 additional burden on the Interconnect controller. The scaling 1196 considerations are dependent on many variables, but an implementation 1197 of a Interconnect controller shares many overlapping traits and 1198 scaling concerns as PCE, where the controller and PCE both must: 1200 o Discover and listen to topological state changes of the IP/MPLS 1201 topology. 1203 o Compute traffic-engineered intra and inter domain paths across 1204 large service provider topologies. 1206 o Synchronize, track and update thousands of LSPs to network devices 1207 upon network state changes. 1209 Both entail topologies that contain tens of thousands of nodes and 1210 links. The Interconnect controller in an NFIX architecture takes on 1211 the additional role of becoming end to end service aware and 1212 discovering data center entities that were traditionally excluded 1213 from a controllers scope. Although not exhaustive, an NFIX 1214 Interconnect controller is impacted by some of the following: 1216 o The number of individual services, the number of endpoints that 1217 may exist in each service, the distribution of endpoints in a 1218 virtualized environment, and how many data centers may exist. 1219 Medium or large sized data centers may be capable to host more 1220 virtual endpoints per host, but with the move to smaller edge- 1221 clouds the number of headends that require inter-connectivity 1222 increases compared to the density of localized routing in a 1223 centralized data center model. The outcome has an impact on the 1224 number of headend devices which may require tunnel management by 1225 the Interconnect controller. 1227 o Assuming a given BSID satisfies SLA, the ability to re-use BSIDs 1228 across multiple services reduces the number of paths to track and 1229 manage. However, the number of color or unique SLA definitions, 1230 and criteria such as bandwidth constraints impacts WAN traffic 1231 distribution requirements. As BSIDs play a key role for VNF 1232 connectivity, this potentially increases the number of BSID paths 1233 required to permit appropriate traffic distribution. This also 1234 impacts the number of tunnels which may be re-used on a given 1235 headend for different services. 1237 o The frequency of virtualized hosts being created and destroyed and 1238 the general activity within a given service. The controller must 1239 analyze, track, and correlate the activity of relevant BGP routes 1240 to track addition and removal of service host or host subnets, and 1241 determine whether new SR policies should be instantiated, or stale 1242 unused SR policies should be removed from the network. 1244 o The choice of SR instantiation mechanism impacts the number of 1245 communication sessions the controller may require. For example, 1246 the BGP based mechanism may only require a small number of 1247 sessions to route reflectors, whereas PCEP may require a 1248 connection to every possible leaf in the network and any BSID 1249 anchors. 1251 o The number of hops within one or many WAN domains may affect the 1252 number of BSIDs required to provide transit for VNF/PNF, PNF/PNF, 1253 or VNF/VNF inter-connectivity. 1255 o Relative to traditional WAN topologies, traditional data centers 1256 are generally topologically denser in node and link connectivity 1257 which is required to be discovered by the Interconnect controller, 1258 resulting in a much larger, dense link-state database on the 1259 Interconnect controller. 1261 5.10.1. Asymmetric Model B for VPN Families 1263 With the instantiation of multiple TE paths between any two VNFs in 1264 the NFIX network, the number of SR Policy (remote endpoint, color) 1265 routes, BSIDs and labels to support on VNFs becomes a choke point in 1266 the architecture. The fact that some VNFs are limited in terms of 1267 forwarding resources makes this aspect an important scale issue. 1269 As an example, if VNF1 and VNF2 in Figure 1 are associated to 1270 multiple topologies 1..n, the Interconnect controller will 1271 instantiate n TE paths in VNF1 to reach VNF2: 1273 [VNF1,color-1,VNF2] --> BSID 1 1275 [VNF1,color-2,VNF2] --> BSID 2 1277 ... 1279 [VNF1,color-n,VNF2] --> BSID n 1281 Similarly, m TE paths may be instantiated on VNF1 to reach VNF3, 1282 another p TE paths to reach VNF4, and so on for all the VNFs that 1283 VNF1 needs to communicate with in DC2. As it can be observed, the 1284 number of forwarding resources to be instantiated on VNF1 may 1285 significantly grow with the number of remote [endpoint, color] pairs, 1286 compared with a best-effort architecture in which the number 1287 forwarding resources in VNF1 grows with the number of endpoints only. 1289 This scale issue on the VNFs can be relieved by the use of an 1290 asymmetric model B service layer. The concept is illustrated in 1291 Figure 3. 1293 +------------+ 1294 <-------------------------------------| WAN | 1295 | SR Policy +-------------------| Controller | 1296 | BSID m | SR Policy +------------+ 1297 v {DCI1,n,DCI2} v BSID n 1298 {1,2,3,4,5,DCI2} 1299 +----------------+ +----------------+ +----------------+ 1300 | +----+ | | | | +----+ | 1301 +----+ | RR | +----+ +----+ | RR | +----+ 1302 |VNF1| +----+ |DCI1| |DCI2| +----+ |VNF2| 1303 +----+ +----+ +----+ +----+ 1304 | DC1 | | WAN | | DC2 | 1305 +----------------+ +----------------+ +----------------+ 1307 <-------- <-------------------------- NHS <------ <------ 1308 EVPN/VPN-IPv4/v6(colored) 1310 +-----------------------------------> +-------------> 1311 TE path to DCI2 ECMP path to VNF2 1312 (BSID to segment-list 1313 expansion on DCI1) 1315 Asymmetric Model B Service Layer 1317 Figure 3 1319 Consider the different n topologies needed between VNF1 and VNF2 are 1320 really only relevant to the different TE paths that exist in the WAN. 1321 The WAN is the domain in the network where there can be significant 1322 differences in latency, throughput or packet loss depending on the 1323 sequence of nodes and links the traffic goes through. Based on that 1324 assumption for traffic from VNF1 to DCI2 in Figure 3, traffic from 1325 DCI2 to VNF2 can simply take an ECMP path. In this case an 1326 asymmetric model B Service layer can significantly relieve the scale 1327 pressure on VNF1. 1329 From a service layer perspective, the NFIX architecture described up 1330 to now can be considered 'symmetric', meaning that the EVPN/IPVPN 1331 advertisements from e.g., VNF2 in Figure 1, are received on VNF1 with 1332 the next-hop of VNF2, and vice versa for VNF1's routes on VNF2. SR 1333 Policies to each VNF2 [endpoint, color] are then required on the 1334 VNF1. 1336 In the 'asymmetric' service design illustrated in Figure 3, VNF2's 1337 EVPN/IPVPN routes are received on VNF1 with the next-hop of DCI2, and 1338 VNF1's routes are received on VNF2 with next-hop of DCI1. Now SR 1339 policies instantiated on VNFs can be reduced to only the number of TE 1340 paths required to reach the remote DCI. For example, considering n 1341 topologies, in a symmetric model VNF1 has to be instantiated with n 1342 SR policy paths per remote VNF in DC2, whereas in the asymmetric 1343 model of Figure 3, VNF1 only requires n SR policy paths per DC, i.e., 1344 to DCI2. 1346 Asymmetric model B is a simple design choice that only requires the 1347 ability (on the DCI nodes) to set next-hop-self on the EVPN/IPVPN 1348 routes advertised to the WAN neighbors and not do next-hop-self for 1349 routes advertised to the DC neighbors. With this option, the 1350 Interconnect controller only needs to establish TE paths from VNFs to 1351 remote DCIs, as opposed to VNFs to remote VNFs. 1353 6. Illustration of Use 1355 For the purpose of illustration, this section provides some examples 1356 of how different end-to-end tunnels are instantiated (including the 1357 relevant protocols, SID values/label stacks etc.) and how services 1358 are then overlaid onto those LSPs. 1360 6.1. Reference Topology 1362 The following network diagram illustrates the reference network 1363 topology that is used for illustration purposes in this section. 1364 Within the data centers leaf and spine network elements may be 1365 present but are not shown for the purpose of clarity. 1367 +----------+ 1368 |Controller| 1369 +----------+ 1370 / | \ 1371 +----+ +----+ +----+ +----+ 1372 ~ ~ ~ ~ | R1 |----------| R2 |----------| R3 |-----|AGN1| ~ ~ ~ ~ 1373 ~ +----+ +----+ +----+ +----+ ~ 1374 ~ DC1 | / | | DC2 ~ 1375 +----+ | L=5 +----+ L=5 / | +----+ +----+ 1376 | Sn | | +-------| R4 |--------+ | |AGN2| | Dn | 1377 +----+ | / M=20 +----+ M=20 | +----+ +----+ 1378 ~ | / | | ~ 1379 ~ +----+ +----+ +----+ +----+ +----+ ~ 1380 ~ ~ ~ ~ | R5 |-----| R6 |----| R7 |-----| R8 |-----|AGN3| ~ ~ ~ ~ 1381 +----+ +----+ +----+ +----+ +----+ 1383 Reference Topology 1385 Figure 4 1387 The following applies to the reference topology in figure 4: 1389 o Data center 1 and data center 2 both run BGP/SR. Both data 1390 centers run leaf/spine topologies, which are not shown for the 1391 purpose of clarity. 1393 o R1 and R5 function as data center interconnects for DC 1. AGN1 1394 and AGN3 function as data center interconnects for DC 2. 1396 o Routers R1 through R8 form an independent ISIS-OSPF/SR instance. 1398 o Routers R3, R8, AGN1, AGN2, and AGN2 form an independent ISIS- 1399 OSPF/SR instance. 1401 o All IGP link metrics within the wide area network are metric 10 1402 except for links R5-R4 and R4-R3 which are both metric 20. 1404 o All links have a unidirectional latency of 10 milliseconds except 1405 for links R5-R4 and R4-R3 which both have a unidirectional latency 1406 of 5 milliseconds. 1408 o Source 'Sn' and destination 'Dn' represent one or more network 1409 functions. 1411 6.2. PNF to PNF Connectivity 1413 The first example demonstrates the simplest form of connectivity; PNF 1414 to PNF. The example illustrates the instantiation of a 1415 unidirectional TE path from R1 to AGN2 and its consumption by an EVPN 1416 service. The service has a requirement for high-throughput with no 1417 strict latency requirements. These service requirements are 1418 catalogued and represented using the color blue. 1420 o An EVPN service is provisioned at R1 and AGN2. 1422 o The Interconnect controller computes the path from R1 to AGN2 and 1423 calculates that the optimal path based on the service requirements 1424 and overall network optimization is R1-R5-R6-R7-R8-AGN3-AGN2. The 1425 segment-list to represent the calculated path could be constructed 1426 in numerous ways. It could be strict hops represented by a series 1427 of Adj-SIDs. It could be loose hops using ECMP-aware Node-SIDs, 1428 for example {R7, AGN2}, or it could be a combination of both Node- 1429 SIDs and Adj-SIDs. Equally, BSIDs could be used to reduce the 1430 number of labels that need to be imposed at the headend. In this 1431 example, strict Adj-SID hops are used with a BSID at the area 1432 border router R8, but this should not be interpreted as the only 1433 way a path and segment-list can be represented. 1435 o The Interconnect controller advertises a BGP SR Policy to R8 with 1436 BSID 1000, and a segment-list containing segments {AGN3, AGN2}. 1438 o The Interconnect controller advertises a BGP SR Policy to R1 with 1439 BSID 1001, and a segment-list containing segments {R5, R6, R7, R8, 1440 1000}. The policy is identified using the tuple [headed = R1, 1441 color = blue, endpoint = AGN2]. 1443 o AGN2 advertises an EVPN MAC Advertisement Route for MAC M1, which 1444 is learned by R1. The route has a next-hop of AGN2, an MPLS label 1445 of L1, and it carries a color extended community with the value 1446 blue. 1448 o R1 has a valid SR policy [color = blue, next-hop = AGN2] with 1449 segment-list {R5, R6, R7, R8, 1000}. R1 therefore associates the 1450 MAC address M1 with that policy and programs the relevant 1451 information into the forwarding path. 1453 o The Interconnect controller also learns the EVPN MAC Route 1454 advertised by AGN2. The purpose of this is two-fold. It allows 1455 the controller to correlate the service overlay with the 1456 underlying transport LSPs, thus creating a service connectivity 1457 map. It also allows the controller to dynamically create LSPs 1458 based upon service requirements if they do not already exist, or 1459 to optimize them if network conditions change. 1461 6.3. VNF to PNF Connectivity 1463 The next example demonstrates VNF to PNF connectivity and illustrates 1464 the instantiation of a unidirectional TE path from S1 to AGN2. The 1465 path is consumed by an IP-VPN service that has a basic set of service 1466 requirements and as such simply uses IGP metric as a path computation 1467 objective. These basic service requirements are cataloged and 1468 represented using the color red. 1470 In this example S1 is a VNF with full IP routing and MPLS capability 1471 that interfaces to the data center underlay/overlay and serves as the 1472 NVO tunnel endpoint. 1474 o An IP-VPN service is provisioned at S1 and AGN2. 1476 o The Interconnect controller computes the path from S1 to AGN2 and 1477 calculates that the optimal path based on IGP metric is 1478 R1-R2-R3-AGN1-AGN2. 1480 o The Interconnect controller advertises a BGP SR Policy to R1 with 1481 BSID 1002, and a segment-list containing segments {R2, R3, AGN1, 1482 AGN2}. 1484 o The Interconnect controller advertises a BGP SR Policy to S1 with 1485 BSID 1003, and a segment-list containing segments {R1, 1002}. The 1486 policy is identified using the tuple [headend = S1, color = red, 1487 endpoint = AGN2]. 1489 o Source S1 learns an VPN-IPv4 route for prefix P1, next-hop AGN2. 1490 The route has an VPN label of L1, and it carries a color extended 1491 community with value red. 1493 o S1 has a valid SR policy [color = red, endpoint = AGN2] with 1494 segment-list {R1, 1002} and BSID 1003. S1 therefore associates 1495 the VPN-IPv4 prefix P1 with that policy and programs the relevant 1496 information into the forwarding path. 1498 o As in the previous example the Interconnect controller also learns 1499 the VPN-IPv4 route advertised by AGN2 in order to correlate the 1500 service overlay with the underlying transport LSPs, creating or 1501 optimizing them as required. 1503 6.4. VNF to VNF Connectivity 1505 The last example demonstrates VNF to VNF connectivity and illustrates 1506 the instantiation of a unidirectional TE path from S2 to D2. The 1507 path is consumed by an EVPN service that requires low latency as a 1508 service requirement and as such uses latency as a path computation 1509 objective. This service requirement is cataloged and represented 1510 using the color green. 1512 In this example S2 is a VNF that has no routing capability. It is 1513 hosted by hypervisor H1 that in turn has an interface to a DC 1514 controller through which forwarding instructions are programmed. H1 1515 serves as the NVO tunnel endpoint and overlay next-hop. 1517 D2 is a VNF with partial routing capability that is connected to a 1518 leaf switch L1. L1 connects to underlay/overlay in data center 2 and 1519 serves as the NVO tunnel endpoint for D2. L1 advertises BGP Prefix- 1520 SID 9001 into the underlay. 1522 o The relevant details of the EVPN service are entered in the data 1523 center policy engines within data center 1 and 2. 1525 o Source S2 is turned-up. Hypervisor H1 notifies its parent DC 1526 controller, which in turn retrieves the service (EVPN) 1527 information, color, IP and MAC information from the policy engine 1528 and subsequently programs the associated forwarding entries onto 1529 S2. The DC controller also dynamically advertises an EVPN MAC 1530 Advertisement Route for S2's IP and MAC into the overlay with 1531 next-hop H1. (This would trigger the return path set-up between 1532 L1 and H2 not covered in this example.) 1534 o The DC controller in data center 1 learns an EVPN MAC 1535 Advertisement Route for D2, MAC M, next-nop L1. The route has an 1536 MPLS label of L2, and it carries a color extended community with 1537 the value green. 1539 o The Interconnect controller computes the path between H1 and L1 1540 and calculates that the optimal path based on latency is 1541 R5-R4-R3-AGN1. 1543 o The Interconnect controller advertises a BGP SR Policy to R5 with 1544 BSID 1004, and a segment-list containing segments {R4, R3, AGN1}. 1546 o The Interconnect controller advertises a BGP SR Policy to the DC 1547 controller in data center 1 with BSID 1005 and a segment-list 1548 containing segments {R5, 1004, 9001}. The policy is identified 1549 using the tuple [headend = H1, color = green, endpoint = L1]. 1551 o The DC controller in data center 1 has a valid SR policy [color = 1552 green, endpoint = L1] with segment-list {R5, 1004, 9001} and BSID 1553 1005. The controller therefore associates the MAC Advertisement 1554 Route with that policy, and programs the associated forwarding 1555 rules into S2. 1557 o As in the previous example the Interconnect controller also learns 1558 the MAC Advertisement Route advertised by D2 in order to correlate 1559 the service overlay with the underlying transport LSPs, creating 1560 or optimizing them as required. 1562 7. Conclusions 1564 The NFIX architecture provides an evolutionary path to a unified 1565 network fabric. It uses the base constructs of seamless-MPLS and 1566 adds end-to-end transport LSPs capable of delivering against SLAs, 1567 seamless data center interconnect, service differentiation, service 1568 function chaining, and a Layer-2/Layer-3 infrastructure capable of 1569 interconnecting PNF-to-PNF, PNF-to-VNF, and VNF-to-VNF. 1571 NFIX establishes a dynamic, seamless, and automated connectivity 1572 model that overcomes the operational barriers and interworking issues 1573 between data centers and the wide-area network and delivers the 1574 following using standards-based protocols: 1576 o A unified routing control plane: Multiprotocol BGP (MP-BGP) to 1577 acquire inter-domain NLRI from the IP/MPLS transport underlay and 1578 the virtualized IP-VPN/EVPN service overlay. 1580 o A unified forwarding control plane: SR provides dynamic service 1581 tunnels with fast restoration options to meet deterministic 1582 bandwidth, latency and path diversity constraints. SR utilizes 1583 the appropriate data path encapsulation for seamless, end-to-end 1584 connectivity between distributed edge and core data centers across 1585 the wide-area network. 1587 o Service Function Chaining: Leverage SFC extensions for BGP and 1588 segment routing to interconnect network and service functions into 1589 SFPs, with support for various data path implementations. 1591 o Service Differentiation: Provide a framework that allows for 1592 construction of logical end-to-end networks with differentiated 1593 logical topologies and/or constraints through use of SR policies 1594 and coloring. 1596 o Automation: Facilitates automation of service provisioning and 1597 avoids heavy service interworking at DCIs. 1599 NFIX is deployable on existing data center and wide-area network 1600 infrastructures and allows the underlying data forwarding plane to 1601 evolve with minimal impact on the services plane. 1603 8. Security Considerations 1605 The NFIX architecture based on SR-MPLS is subject to the same 1606 security concerns as any MPLS network. No new protocols are 1607 introduced, hence security issues of the protocols encompassed by 1608 this architecture are addressed within the relevant individual 1609 standards documents. It is recommended that the security framework 1610 for MPLS and GMPLS networks defined in [RFC5920] are adhered to. 1611 Although [RFC5920] focuses on the use of RSVP-TE and LDP control 1612 plane, the practices and procedures are extendable to an SR-MPLS 1613 domain. 1615 The NFIX architecture makes extensive use of Multiprotocol BGP, and 1616 it is recommended that the TCP Authentication Option (TCP-AO) 1617 [RFC5925] is used to protect the integrity of long-lived BGP sessions 1618 and any other TCP-based protocols. 1620 Where PCEP is used between controller and path headend the use of 1621 PCEPS [RFC8253] is recommended to provide confidentiality to PCEP 1622 communication using Transport Layer Security (TLS). 1624 9. Acknowledgements 1626 The authors would like to acknowledge Mustapha Aissaoui, Wim 1627 Henderickx, and Gunter Van de Velde. 1629 10. Contributors 1631 The following people contributed to the content of this document 1632 Juan Rodriguez 1633 Nokia 1634 United States of America 1636 Email: juan.rodriguez@nokia.com 1638 Jorge Rabadan 1639 Nokia 1640 United States of America 1642 Email: jorge.rabadan@nokia.com 1644 Figure 5 1646 11. IANA Considerations 1648 This memo does not include any requests to IANA for allocation. 1650 12. References 1652 12.1. Normative References 1654 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1655 Requirement Levels", BCP 14, RFC 2119, March 1997, 1656 . 1658 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1659 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1660 May 2017, . 1662 12.2. Informative References 1664 [I-D.ietf-nvo3-geneve] 1665 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 1666 Network Virtualization Encapsulation", draft-ietf- 1667 nvo3-geneve-14 (work in progress), September 2019. 1669 [I-D.ietf-mpls-seamless-mpls] 1670 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 1671 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 1672 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 1674 [I-D.ietf-bess-evpn-ipvpn-interworking] 1675 Rabadan, J. and A. Sajassi, "EVPN Interworking with 1676 IPVPN", draft-ietf-bess-evpn-ipvpn-interworking-02 (work 1677 in progress), November 2019. 1679 [I-D.ietf-spring-segment-routing-policy] 1680 Filsfils, C., Sivabalan, S., Voyer, D., Bogdanov, A., and 1681 P. Mattes, "Segment Routing Policy Architecture", draft- 1682 ietf-spring-segment-routing-policy-06 (work in progress), 1683 December 2019. 1685 [I-D.ietf-rtgwg-segment-routing-ti-lfa] 1686 Litkowski, S., Bashandy, A., Filsfils, C., Decraene, B., 1687 Francois, P., Voyer, D., Clad, F., and P. Camarillo, 1688 "Topology Independent Fast Reroute using Segment Routing", 1689 draft-ietf-rtgwg-segment-routing-ti-lfa-03 (work in 1690 progress), March 2020. 1692 [I-D.ietf-bess-nsh-bgp-control-plane] 1693 Farrel, A., Drake, J., Rosen, E., Uttaro, J., and L. 1694 Jalil, "BGP Control Plane for NSH SFC", draft-ietf-bess- 1695 nsh-bgp-control-plane-13 (work in progress), December 1696 2019. 1698 [I-D.ietf-idr-te-lsp-distribution] 1699 Previdi, S., Talaulikar, K., Dong, J., Chen, M., Gredler, 1700 H., and J. Tantsura, "Distribution of Traffic Engineering 1701 (TE) Policies and State using BGP-LS", draft-ietf-idr-te- 1702 lsp-distribution-12 (work in progress), October 2019. 1704 [I-D.barth-pce-segment-routing-policy-cp] 1705 Koldychev, M., Sivabalan, S., Barth, C., Li, C., and H. 1706 Bidgoli, "PCEP extension to support Segment Routing Policy 1707 Candidate Paths", draft-barth-pce-segment-routing-policy- 1708 cp-04 (work in progress), October 2019. 1710 [I-D.filsfils-spring-sr-policy-considerations] 1711 Filsfils, C., Talaulikar, K., Krol, P., Horneffer, M., and 1712 P. Mattes, "SR Policy Implementation and Deployment 1713 Considerations", draft-filsfils-spring-sr-policy- 1714 considerations-04 (work in progress), October 2019. 1716 [I-D.ietf-rtgwg-bgp-pic] 1717 Bashandy, A., Filsfils, C., and P. Mohapatra, "BGP Prefix 1718 Independent Convergence", draft-ietf-rtgwg-bgp-pic-11 1719 (work in progress), February 2020. 1721 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 1722 BGP for Routing in Large-Scale Data Centers", RFC 7938, 1723 DOI 10.17487/RFC7938, August 2016, 1724 . 1726 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 1727 S. Ray, "North-Bound Distribution of Link-State and 1728 Traffic Engineering (TE) Information Using BGP", RFC 7752, 1729 DOI 10.17487/RFC7752, March 2016, 1730 . 1732 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 1733 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 1734 . 1736 [RFC8667] Previdi, S., Ed., Ginsberg, L., Ed., Filsfils, C., 1737 Bashandy, A., Gredler, H., and B. Decraene, "IS-IS 1738 Extensions for Segment Routing", RFC 8667, 1739 DOI 10.17487/RFC8667, December 2019, 1740 . 1742 [RFC8665] Psenak, P., Ed., Previdi, S., Ed., Filsfils, C., Gredler, 1743 H., Shakir, R., Henderickx, W., and J. Tantsura, "OSPF 1744 Extensions for Segment Routing", RFC 8665, 1745 DOI 10.17487/RFC8665, December 2019, 1746 . 1748 [RFC8669] Previdi, S., Filsfils, C., Lindem, A., Ed., Sreekantiah, 1749 A., and H. Gredler, "Segment Routing Prefix Segment 1750 Identifier Extensions for BGP", RFC 8669, 1751 DOI 10.17487/RFC8669, December 2019, 1752 . 1754 [RFC8663] Xu, X., Bryant, S., Farrel, A., Hassan, S., Henderickx, 1755 W., and Z. Li, "MPLS Segment Routing over IP", RFC 8663, 1756 DOI 10.17487/RFC8663, December 2019, 1757 . 1759 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 1760 "Advertisement of Multiple Paths in BGP", RFC 7911, 1761 DOI 10.17487/RFC7911, July 2016, 1762 . 1764 [RFC7880] Pignataro, C., Ward, D., Akiya, N., Bhatia, M., and S. 1765 Pallagatti, "Seamless Bidirectional Forwarding Detection 1766 (S-BFD)", RFC 7880, DOI 10.17487/RFC7880, July 2016, 1767 . 1769 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1770 Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 1771 2006, . 1773 [RFC5920] Fang, L., Ed., "Security Framework for MPLS and GMPLS 1774 Networks", RFC 5920, DOI 10.17487/RFC5920, July 2010, 1775 . 1777 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1778 "Specification of the IP Flow Information Export (IPFIX) 1779 Protocol for the Exchange of Flow Information", STD 77, 1780 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1781 . 1783 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1784 and A. Bierman, Ed., "Network Configuration Protocol 1785 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1786 . 1788 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1789 the Network Configuration Protocol (NETCONF)", RFC 6020, 1790 DOI 10.17487/RFC6020, October 2010, 1791 . 1793 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1794 Monitoring Protocol (BMP)", RFC 7854, 1795 DOI 10.17487/RFC7854, June 2016, 1796 . 1798 [RFC8300] Quinn, P., Ed., Elzur, U., Ed., and C. Pignataro, Ed., 1799 "Network Service Header (NSH)", RFC 8300, 1800 DOI 10.17487/RFC8300, January 2018, 1801 . 1803 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 1804 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 1805 DOI 10.17487/RFC5440, March 2009, 1806 . 1808 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 1809 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 1810 eXtensible Local Area Network (VXLAN): A Framework for 1811 Overlaying Virtualized Layer 2 Networks over Layer 3 1812 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 1813 . 1815 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 1816 Virtualization Using Generic Routing Encapsulation", 1817 RFC 7637, DOI 10.17487/RFC7637, September 2015, 1818 . 1820 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 1821 Label Switching Architecture", RFC 3031, 1822 DOI 10.17487/RFC3031, January 2001, 1823 . 1825 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 1826 Narten, "An Architecture for Data-Center Network 1827 Virtualization over Layer 3 (NVO3)", RFC 8014, 1828 DOI 10.17487/RFC8014, December 2016, 1829 . 1831 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 1832 Decraene, B., Litkowski, S., and R. Shakir, "Segment 1833 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 1834 July 2018, . 1836 [RFC5883] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 1837 (BFD) for Multihop Paths", RFC 5883, DOI 10.17487/RFC5883, 1838 June 2010, . 1840 [RFC8231] Crabbe, E., Minei, I., Medved, J., and R. Varga, "Path 1841 Computation Element Communication Protocol (PCEP) 1842 Extensions for Stateful PCE", RFC 8231, 1843 DOI 10.17487/RFC8231, September 2017, 1844 . 1846 [RFC8281] Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "Path 1847 Computation Element Communication Protocol (PCEP) 1848 Extensions for PCE-Initiated LSP Setup in a Stateful PCE 1849 Model", RFC 8281, DOI 10.17487/RFC8281, December 2017, 1850 . 1852 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 1853 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 1854 June 2010, . 1856 [RFC8253] Lopez, D., Gonzalez de Dios, O., Wu, Q., and D. Dhody, 1857 "PCEPS: Usage of TLS to Provide a Secure Transport for the 1858 Path Computation Element Communication Protocol (PCEP)", 1859 RFC 8253, DOI 10.17487/RFC8253, October 2017, 1860 . 1862 Authors' Addresses 1863 Colin Bookham (editor) 1864 Nokia 1865 740 Waterside Drive 1866 Almondsbury, Bristol 1867 UK 1869 Email: colin.bookham@nokia.com 1871 Andrew Stone 1872 Nokia 1873 600 March Road 1874 Kanata, Ontario 1875 Canada 1877 Email: andrew.stone@nokia.com 1879 Jeff Tantsura 1880 Apstra 1881 333 Middlefield Road #200 1882 Menlo Park, CA 94025 1883 USA 1885 Email: jefftant.ietf@gmail.com 1887 Muhammad Durrani 1888 Equinix Inc 1889 1188 Arques Ave 1890 Sunnyvale CA 1891 USA 1893 Email: mdurrani@equinix.com