idnits 2.17.00 (12 Aug 2021) /tmp/idnits1497/draft-mhmcsfh-ippm-pam-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document date (4 March 2022) is 71 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 0 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group G. Mirsky 3 Internet-Draft J. Halpern 4 Intended status: Standards Track Ericsson 5 Expires: 5 September 2022 X. Min 6 ZTE Corp. 7 A. Clemm 8 J. Strassner 9 Futurewei 10 J. Francois 11 Inria 12 L. Han 13 China Mobile 14 4 March 2022 16 Precision Availability Metrics for SLO-Governed End-to-End Services 17 draft-mhmcsfh-ippm-pam-00 19 Abstract 21 This document defines a set of metrics for networking services with 22 performance requirements expressed as Service Level Objectives (SLO). 23 These metrics, referred to as Precision Availability Metrics (PAM), 24 can be used to assess the service levels that are being delivered. 25 Specifically, PAM can be used to determine the degree of compliance 26 with which service levels are being delivered relative to pre-defined 27 SLOs. PAM can be used to provide a service according to its SLO as 28 part of accounting records, to account for the actual quality with 29 which services were delivered and whether or not any SLO violations 30 occurred. Also, PAM can be used to continuously monitor the quality 31 with which the service is delivered. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on 5 September 2022. 50 Copyright Notice 52 Copyright (c) 2022 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 57 license-info) in effect on the date of publication of this document. 58 Please review these documents carefully, as they describe your rights 59 and restrictions with respect to this document. Code Components 60 extracted from this document must include Revised BSD License text as 61 described in Section 4.e of the Trust Legal Provisions and are 62 provided without warranty as described in the Revised BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 2. Conventions used in this document . . . . . . . . . . . . . . 4 68 2.1. Terminology and Acronyms . . . . . . . . . . . . . . . . 4 69 3. Performance Availability Metrics . . . . . . . . . . . . . . 4 70 3.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 5 71 3.2. Derived Performance Availability Metrics . . . . . . . . 6 72 3.3. Network Availability in Performance Availability 73 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 7 74 4. Statistical SLO . . . . . . . . . . . . . . . . . . . . . . . 7 75 5. Availability of Anything-as-a-Service . . . . . . . . . . . . 8 76 6. Other PAM Benefits . . . . . . . . . . . . . . . . . . . . . 10 77 7. Discussion Items . . . . . . . . . . . . . . . . . . . . . . 10 78 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 79 9. Security Considerations . . . . . . . . . . . . . . . . . . . 11 80 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 11 81 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 82 11.1. Informative References . . . . . . . . . . . . . . . . . 12 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 85 1. Introduction 87 Network operators and network users often need to assess the quality 88 with which network services are being delivered. In particular in 89 cases where service level guarantees are given and service level 90 objectives (SLOs) are defined, it is essential to provide a measure 91 of the degree with which actual service levels that are delivered 92 comply with SLOs that were promised. Examples of service levels 93 include end-to-end latency and packet loss. Simple examples of SLOs 94 associated with such service levels would be target values for the 95 maximum end-to-end latency or maximum amount of loss that would be 96 deemed acceptable. 98 To express the quality of delivered networking services versus their 99 SLOs, corresponding metrics are needed that can be used to 100 characterize the quality of the service being provided. Of concern 101 is not so much the absolute service level (for example, actual 102 latency experienced), but whether the service is provided in 103 accordance with the contracted service levels. For instance, whether 104 the latency that is experienced falls within the acceptable range 105 that has been contracted for the service. The specific quality 106 depends on the SLO that is in effect. Different groups of 107 applications set forth requirements for varying sets of service 108 levels with different target values. Such applications range from 109 Augmented Reality/Virtual Reality to mission-critical controlling 110 industrial processes. A non-conformance to an SLO might result in 111 degradation of the quality of experience for gamers up to 112 jeopardizing the safety of a large area. However, as those 113 applications represent significant business opportunities, they 114 demand dependable technical solutions. 116 The same service level may be deemed perfectly acceptable for one 117 application, while unacceptable for another, depending on the needs 118 of the application. Hence it is not sufficient to simply measure 119 service levels per se over time, but to assess the quality of the 120 service being provided with the applicable SLO in mind. However, at 121 this point, there are no metrics in place that are able to account 122 for the quality with which services are delivered relative to their 123 SLOs, and whether their SLOs are being delivered on at all times. 124 Such metrics and the instrumentation to support them are essential 125 for a number of purposes, including monitoring (to ensure that 126 networking services are performing according to their objectives) as 127 well as accounting (to maintain a record of service levels delivered, 128 important for monetization of such services as well as for triaging 129 of problems). 131 The current state-of-the-art of metrics available today includes (for 132 example) interface metrics, useful to obtain data on traffic volume 133 and behavior that can be observed at an interface [RFC2863] and 134 [RFC8343] but agnostic of actual end-to-end service levels and not 135 specific to distinct flows. Flow records [RFC7011] and [RFC7012] 136 maintain statistics about flows, including flow volume and flow 137 duration, but again, contain very little information about end-to-end 138 service levels, let alone whether the service levels delivered to 139 meet their targets, i.e., their associated SLOs. 141 This specification introduces a new set of metrics, Precision 142 Availability Metrics (PAM), aimed at capturing end-to-end service 143 levels for a flow, specifically the degree to which flows comply with 144 the SLOs that are in effect. The term "availability" reflects the 145 fact that a service which is characterized by its SLOs is considered 146 unavailable whenever those SLOs are violated, even if basic 147 connectivity is still working. "Precision" refers to the fact that 148 services whose end-to-end service levels are governed by SLOs, and 149 which must therefore be precisely delivered according to the 150 associated quality and performance requirements. It should be noted 151 that "precision" refers to what is being assessed, not to the 152 mechanism used to measure it; in other words, it does not refer to 153 the precision of the mechanism with which actual service levels are 154 measured. The specification and implementation of methods that 155 provide for accurate measurements is a separate topic independent of 156 the definition of the metrics in which the results of such 157 measurements would be expressed. 159 [Ed.note: It should be noted that at this point, the set of metrics 160 proposed here is intended as a "starter set" that is intended to 161 spark further discussion. Other metrics are certainly conceivable; 162 we expect that the list of metrics will evolve as part of the Working 163 Group discussions.] 165 2. Conventions used in this document 167 2.1. Terminology and Acronyms 169 [Ed.Note: needs updating.] 171 PAM Precision Availability Metric 173 OAM Operations, Administration, and Maintenance 175 EI Errored Interval 177 EIR Errored Interval Ratio 179 SEI Severely Errored Interval 181 SEIR Severely Errored Interval Ratio 183 EFI Error-Free Interval 185 3. Performance Availability Metrics 186 3.1. Preliminaries 188 When analyzing the availability metrics of a service flow between two 189 nodes, we need to select a time interval as the unit of PAM. In 190 [ITU.G.826], a time interval of one second is used. That is 191 reasonable, but some services may require different granularity. For 192 that reason, the time interval in PAM is viewed as a variable 193 parameter though constant for a particular measurement session. 194 Further, for the purpose of PAM, each time interval, e.g., second or 195 decamillisecond, is classified either as Errored Interval (EI), 196 Severely Errored Interval (SEI), or Error-Free Interval (EFI). These 197 are defined as follows: 199 * An EI is a time interval during which at least one of the 200 performance parameters degraded below its pre-defined optimal 201 level threshold or a defect was detected. 203 * An SEI is a time interval during which at least one the 204 performance parameters degraded below its pre-defined critical 205 threshold or a defect was detected. 207 * Consequently, an EFI is a time interval during which all 208 performance objectives are at or above their respective pre- 209 defined optimal levels, and no defect has been detected. 211 The definition of a state of a defect in the network is also 212 necessary for understanding the PAM. In this document, the defect is 213 interpreted as the state of inability to communicate between a 214 particular set of nodes. It is important to note that it is being 215 defined as a state, and thus, it has conditions that define entry 216 into it and exit out of it. Also, the state of defect exists only in 217 connection to the particular group of nodes in the network, not the 218 network as a domain. 220 From these defitions, a set of basic metrics can be defined that 221 count the numbers of time intervals that fall into each category: 223 * EI count. 225 * SEI count. 227 * EFI count. 229 3.2. Derived Performance Availability Metrics 231 A set of metrics can be created based on PAM introduced in Section 3. 232 In this document, these metrics are referred to as derived PAM. Some 233 of these metrics are modeled after Mean Time Between Failure (MTBF) 234 metrics - a "failure" in this context referring to a failure to 235 deliver a packet according to its SLO. 237 * Time since the last errored interval (e.g., since last errored ms, 238 since last errored second). (This parameter is suitable for the 239 monitoring of the current health.) [Ed. note: Need a definition 240 of "current health". Is there an alternative to "current"? Past 241 health?] 243 * Packets since the last errored packet. (This parameter is 244 suitable for the monitoring of the current health.) 246 * Mean time between EIs (e.g., between errored milliseconds, errored 247 seconds) is the arithmetic mean of time between consecutive EIs. 249 * Mean packets between EIs is the arithmetic mean of the number of 250 SLO-compliant packets between consecutive EIs. (Another variation 251 of "MTBF" in a service setting.) 253 An analogous set of metrics can be produced for SEI: 255 * Time since the last SEI (e.g., since last errored ms, since last 256 errored second). (This parameter is suitable for the monitoring 257 of the current health.) 259 * Mean time between SEIs (e.g., between severely errored 260 milliseconds, severely errored seconds) is the arithmetic mean of 261 time between consecutive SEIs. 263 * Mean packets between SEIs is the arithmetic mean of the number of 264 SLO-compliant packets between consecutive SEIs. (Another 265 variation of "MTBF" in a service setting.) 267 Determining the period in which the path is currently PAM-wise is 268 helpful. But because switching between periods requires ten 269 consecutive intervals, shorter conditions may not be adequately 270 reflected. Two additional PAMs can be used, and they are defined as 271 follows: 273 * errored interval ratio (EIR) is the ratio of EI to the total 274 number of time unit intervals in a time of the availability 275 periods during a fixed measurement interval. 277 * severely errored interval ratio (SESR) - is the ratio of SEIs to 278 the total number of time unit intervals in a time of the 279 availability periods during a fixed measurement interval. 281 3.3. Network Availability in Performance Availability Metrics 283 The definitions of EI, SEI, and EFI allow for characterization of the 284 communication between two nodes relative to the level of required and 285 acceptable performance and when performance degrades below the 286 acceptable level. The former condition in this document referred to 287 as network availability. The latter - network unavailability. Based 288 on the definitions, SEI is the one time interval of network 289 unavailability while EI and EFI present an interval of network 290 availability. But since the conditions of the network are 291 everchanging periods of network availability and unavailability need 292 to be defined with duration larger than one time interval to reduce 293 the number of state changes while correctly reflecting the network 294 condition. The method to determine the state of the network in terms 295 of PAM is described below: 297 * If ten consecutive SEIs been detected, then the PAM state of the 298 network is determined as unavailability, and the beginning of that 299 period of unavailability state is at the start of the first SEI in 300 the sequence of the consecutive SEIs. 302 * Similarly, ten consecutive non-SEIs, i.e., either EIs or EFIs, 303 indicate that the network is in the availability period, i.e., 304 available. The start of that period is at the beginning of the 305 first non-SEI. 307 * Resulting from these two definitions, a sequence of less than ten 308 consecutive SEIs or non-SEIs does not change the PAM state of the 309 network. For example, if the PAM state is determined as 310 unavailability, a sequence of seven EFIs is not viewed as an 311 availability period. 313 4. Statistical SLO 315 It should be noted that certain Service Level Agreements (SLA) may be 316 statistical, requiring the service levels of packets in a flow to 317 adhere to specific distributions. For example, an SLA might state 318 that any given SLO applies only to a certain percentage of packets, 319 allowing for a certain level of, for example, packet loss and/or 320 exceeding packet delay threshold take place. Each such event, in 321 that case, does not necessarily constitute an SLO violation. 322 However, it is still useful to maintain those statistics, as the 323 number of out-of-SLO packets still matters when looked at in 324 proportion to the total number of packets. 326 Along that vein, an SLA might establish an SLO of, say, end-to-end 327 latency to not exceed 20ms for 99% of packets, to not exceed 25ms for 328 99.999% of packets, and to never exceed 30ms for anything beyond. In 329 that case, any individual packet missing the 20 ms latency target 330 cannot be considered an SLO violation in itself, but compliance with 331 the SLO may need to be assessed after the fact. 333 To support statistical SLAs more directly, it is feasible to support 334 additional metrics, such as metrics that represent histograms for 335 service level parameters with buckets corresponding to individual 336 service level objectives. For the example just given, a histogram 337 for a given flow could be maintained with three buckets: one 338 containing the count of packets within 20ms, a second with a count of 339 packets between 20 and 25ms (or simply all within 25ms), a third with 340 a count of packets between 25 and 30ms (or merely all packets within 341 30ms, and a fourth with a count of anything beyond (or simply a total 342 count). Of course, the number of buckets and the boundaries between 343 those buckets should correspond to the needs of the application 344 respectively SLA, i.e., to the specific guarantees and SLOs that were 345 provided. The definition of histogram metrics is for further study. 347 5. Availability of Anything-as-a-Service 349 Anything as a service (XaaS) describes a general category of services 350 related to cloud computing and remote access. These services include 351 the vast number of products, tools, and technologies that are 352 delivered to users as a service over the Internet. In this document, 353 the availability of XaaS is viewed as the ability to access the 354 service over a period of time with pre-defined performance 355 objectives. Among the advantages of the XaaS model are: 357 * Improving the expense model by purchasing services from providers 358 on a subscription basis rather than buying individual products, 359 e.g., software, hardware, servers, security, infrastructure, and 360 install them on-site, and then link everything together to create 361 networks. 363 * Speeding new apps and business processes by quickly adapting to 364 changing market conditions with new applications or solutions. 366 * Shifting IT resources to specialized higher-value projects that 367 use the core expertise of the company. 369 But XaaS model also has potential challenges: 371 * Possible downtime resulting from issues of internet reliability, 372 resilience, provisioning, and managing the infrastructure 373 resources. 375 * Performance issues caused by depleted resources like bandwidth, 376 computing power, inefficiencies of virtualized environments, 377 ongoing management and security of multi-cloud services. 379 * Complexity impacts enterprise IT team that must remain in the 380 process of the continued learning of the provided services. 382 The framework and metrics of the PAM defined in Section 3 allow a 383 provider of XaaS and their customers to quantify, measure, monitor 384 for conformance what is often referred to as an ephemeral - 385 availability of the service to be delivered. There are other 386 definitions and methods of expressing availability. For example, 387 [HighAvailability-WP] uses the following equation: 389 Availability Average = MTBF/(MTBF + MTRR), 390 where: 391 MTBF (Mean Time Between Failures) - mean time between 392 individual component failures. For example, a hard drive 393 malfunction or hypervisor reboot. 394 MTTR (Mean Time To Repair) - refers to how long it takes to fix 395 the broken component or the application to come back online, 397 While this approach estimates the expected availability of a XaaS, 398 the PAM reflects near-real-time availability of a service as 399 experienced by a user. It also provides valuable data for more 400 accurate and realistic MTBF and MTTR in the particular environment, 401 and simplifies comparison of different solutions that may use 402 redundant servers (web and database), load balancers. 404 In another field of communication, mobile voice and data services, 405 the definition of service availability is understood as "the 406 probability of successful service reception: a given area is declared 407 "in-coverage" if the service in that area is available with a pre- 408 specified minimum rate of success. Service availability has the 409 advantage of being more easily understandable for consumers and is 410 expressed as a percentage of the number of attempts to access a given 411 service." [BEREC-CP]. The definition of the availability used in 412 the PAM throughout this document is close to the quoted above. It 413 might be considered as the extension that allows regulators, 414 operators, and consumers to compare not only the rate of successfully 415 establishing a connection but the quality of the connection during 416 its lifetime. 418 6. Other PAM Benefits 420 PAM provides a number of important benefits with other, more 421 conventional performance metrics. Without PAM, it would be possible 422 to conduct ongoing measurements of service levels and maintain a 423 time-series of service level records, then assess compliance with 424 specific SLOs after the fact. However, doing so would require the 425 collection of vast amounts of data that would need to be generated, 426 exported, transmitted, collected, and stored. In addition, extensive 427 postprocessing would be required to compare that data against SLOs 428 and analyze its compliance. Being able to perform these tasks at 429 scale and in real-time would present significant additional 430 challenges. 432 Adding PAM allows for a more compact expression of service level 433 compliance. In that sense, PAM does not simply represent raw data 434 but expresses actionable information. In conjunction with proper 435 instrumentation, PAM can thus help avoid expensive postprocessing. 437 7. Discussion Items 439 The following items require further discussion: 441 * Terminology - "Errored" vs. "Violated". The key metrics defined 442 in this draft refer to intervals during which violations of 443 objectives for service level parameters occur as "errored". The 444 term "errored" was chosen in continuity with the concept of 445 "errored seconds", often used in transmission systems. However, 446 "violated" may be a more accurate term, as the metrics defined 447 here are not "errors" in an absolute sense, but relative to a set 448 of defined objectives. 450 * Metrics. The foundational metrics defined in this draft refer to 451 errored/violated intervals. In addition, counts of errors/ 452 violations related to individual packets may also need to be 453 maintained. Metrics referring to violated/errored packets, i.e. 454 packets that on an individual basis miss a performance objective 455 may be added in a later revision of this document. 457 The following is a list of items for which further discussion is 458 needed as to whether they should be included in the scope of this 459 specification: 461 * A YANG data model. 463 * A set of IPFIX Information Elements. 465 * Statistical metrics: e.g., histograms/buckets. 467 * Policies regarding the definition of "errored" and "severely 468 errored" time interval. 470 * Additional second-order metrics, such as "longest disruption of 471 service time" (measuring consecutive time units with SEIs). 473 8. IANA Considerations 475 TBA 477 9. Security Considerations 479 Instrumentation for metrics that are used to assess compliance with 480 SLOs constitute an attractive target for an attacker. By interfering 481 with the maintaining of such metrics, services could be falsely 482 identified as complying (when they are not) or vice-versa flagged as 483 being non-compliant (when indeed they are). While this document does 484 not specify how networks should be instrumented to maintain the 485 identified metrics. Such instrumentation needs to be adequately 486 secured to ensure accurate measurements and prohibit tampering with 487 metrics being kept. 489 Where metrics are being defined relative to an SLO, the configuration 490 of those SLOs needs to be adequately secured. Likewise, where SLOs 491 can be adjusted, the correlation between any metrics instance and a 492 particular SLO must be clear. The same service levels that 493 constitute SLO violations for one flow that should be maintained as 494 part of the "errored time units" and related metrics, may be 495 perfectly compliant for another flow. In cases when it is impossible 496 to tie together SLOs and PAM properly, it will be preferable to 497 merely maintain statistics about service levels delivered (for 498 example, overall histograms of end-to-end latency) without assessing 499 which constitutes violations. 501 By the same token, where the definition of what constitutes a 502 "severe" or a "significant" error depends on policy or context. The 503 configuration of such policy or context needs to be specially 504 secured. Also, the configuration of this policy must be bound to the 505 metrics being maintained. This way, it will be clear which policy 506 was in effect when those metrics were being assessed. An attacker 507 that can tamper with such policies will render the corresponding 508 metrics useless (in the best case) or misleading (in the worst case). 510 10. Acknowledgments 512 TBA 514 11. References 515 11.1. Informative References 517 [BEREC-CP] Body of European Regulators for Electronic Communications, 518 "BEREC Common Position on information to consumers on 519 mobile coverage", Common Approaches/Positions BoR (18) 520 237, June 2018, . 525 [HighAvailability-WP] 526 Avi Freedman, Server Central, "High Availability in Cloud 527 and Dedicated Infrastructure", . 531 [ITU.G.826] 532 ITU-T, "End-to-end error performance parameters and 533 objectives for international, constant bit-rate digital 534 paths and connections", ITU-T G.826, December 2002. 536 [RFC2863] McCloghrie, K. and F. Kastenholz, "The Interfaces Group 537 MIB", RFC 2863, DOI 10.17487/RFC2863, June 2000, 538 . 540 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 541 "Specification of the IP Flow Information Export (IPFIX) 542 Protocol for the Exchange of Flow Information", STD 77, 543 RFC 7011, DOI 10.17487/RFC7011, September 2013, 544 . 546 [RFC7012] Claise, B., Ed. and B. Trammell, Ed., "Information Model 547 for IP Flow Information Export (IPFIX)", RFC 7012, 548 DOI 10.17487/RFC7012, September 2013, 549 . 551 [RFC8343] Bjorklund, M., "A YANG Data Model for Interface 552 Management", RFC 8343, DOI 10.17487/RFC8343, March 2018, 553 . 555 Authors' Addresses 557 Greg Mirsky 558 Ericsson 559 Email: gregimirsky@gmail.com 560 Joel Halpern 561 Ericsson 562 Email: joel.halpern@ericsson.com 564 Xiao Min 565 ZTE Corp. 566 Email: xiao.min2@zte.com.cn 568 Alexander Clemm 569 Futurewei 570 2330 Central Expressway 571 Santa Clara, CA 95050 572 United States of America 573 Email: ludwig@clemm.org 575 John Strassner 576 Futurewei 577 2330 Central Expressway 578 Santa Clara, CA 95050 579 United States of America 580 Email: strazpdj@gmail.com 582 Jerome Francois 583 Inria 584 615 Rue du Jardin Botanique 585 54600 Villers-les-Nancy 586 France 587 Email: jerome.francois@inria.fr 589 Liuyan Han 590 China Mobile 591 32 XuanWuMenXi Street 592 Beijing 593 100053 594 China 595 Email: hanliuyan@chinamobile.com