idnits 2.17.00 (12 Aug 2021) /tmp/idnits55426/draft-geib-ippm-metrictest-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 0 form feeds but 19 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 236 has weird spacing: '... metric advan...' -- The document date (October 26, 2009) is 4589 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 4814' is mentioned on line 305, but not defined ** Obsolete normative reference: RFC 2679 (Obsoleted by RFC 7679) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force R. Geib, Ed. 2 Internet-Draft Deutsche Telekom 3 Intended status: Informational A. Morton 4 Expires: April 29, 2010 AT&T Labs 5 R. Fardid 6 Covad Communications 7 October 26, 2009 9 IPPM standard compliance testing 10 draft-geib-ippm-metrictest-01 12 Status of this Memo 14 This Internet-Draft is submitted to IETF in full conformance with the 15 provisions of BCP 78 and BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on April 29, 2010. 35 Copyright Notice 37 Copyright (c) 2009 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents in effect on the date of 42 publication of this document (http://trustee.ietf.org/license-info). 43 Please review these documents carefully, as they describe your rights 44 and restrictions with respect to this document. 46 Abstract 48 This document specifies tests to determine if multiple, independent, 49 and interoperable implementations of a metrics specification document 50 are at hand so that the metrics specification can be advanced to an 51 Internet standard. Results of different IPPM implementations can be 52 compared if they measure under the same underlying network 53 conditions. Results are compared using state of the art statistical 54 methods. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 60 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 4 61 3. Verification of conformance to a metric specification . . . . 6 62 3.1. Tests of an individual implementation against a metric 63 specification . . . . . . . . . . . . . . . . . . . . . . 6 64 3.2. Test set up resulting in identical live network 65 testing conditions . . . . . . . . . . . . . . . . . . . . 7 66 3.3. Tests two or more different implementations against a 67 metric specification . . . . . . . . . . . . . . . . . . . 9 68 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 10 69 3.5. Recommended Metric Verification Measurement Process . . . 11 70 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 71 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 13 72 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 73 7. Security Considerations . . . . . . . . . . . . . . . . . . . 14 74 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 75 8.1. Normative References . . . . . . . . . . . . . . . . . . . 14 76 8.2. Informative References . . . . . . . . . . . . . . . . . . 14 77 Appendix A. Further ideas on statistical tests . . . . . . . . . 15 78 Appendix B. Verification of measurement precision by 79 statistical methods . . . . . . . . . . . . . . . . . 17 80 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 82 1. Introduction 84 Draft bradner-metrictest [bradner-metrictest] states: 86 The Internet Standards Process RFC2026 [RFC2026] requires that for a 87 IETF specification to advance beyond the Proposed Standard level, at 88 least two genetically unrelated implementations must be shown to 89 interoperate correctly with all features and options. There are two 90 distinct reasons for this requirement. 92 In the case of a protocol specification, the notion of 93 "interoperability" is reasonably intuitive - the implementations must 94 successfully "talk to each other", while exercising all features and 95 options. 97 In the case of a specification for a performance metric, network 98 latency for example, exactly what constitutes "interoperation" is 99 less obvious. The IESG didn't yet decide how to judge "metric 100 specification interoperability" in the context of the IETF Standards 101 Process and this new draft suggests a methodology which (hopefully) 102 is suitable for IPPM metrics. General applicability of the methods 103 proposed in the following should however not be excluded. 105 A metric specification describes a method of testing and a way to 106 report the results of this testing. One example of such a metric 107 would be a way to test and report the latency that data packets would 108 incur while being sent from one network location to another. 110 Since implementations of testing metrics are by their nature stand- 111 alone and do not interact with each other, the level of the 112 interoperability called for in the IETF standards process cannot be 113 simply determined by seeing that the implementations interact 114 properly. Instead, verifying equivalence by proofing that different 115 implementations verifiably give statistically equivalent results 116 Verifiable equivalence may take the place of interoperability. 118 This document defines the process of verifying equivalence by using a 119 specified test set up to create the required separate data sets 120 (which may be seen as samples taken from the same underlying 121 distribution) and then apply state of the art statistical methods to 122 verify equivalence of the results. To illustrate application of the 123 process defined her, validating compliance with RFC2679 [RFC2679] is 124 picked as an example. While test set ups may vary with the metrics 125 to be validated, the statistical methods will not. Documents 126 defining test setups to validate other metrics should be created by 127 the IPPM WG, once the process proposed here has been agreed upon. 129 This document defines the process of verifying equivalence by using a 130 specified test set up to create the required separate data sets 131 (which may be seen as samples taken from the same underlying 132 distribution) and then apply state of the art statistical methods to 133 verify equivalence of the results. To illustrate application of the 134 process defined her, validating compliance with RFC2679 [RFC2679] is 135 picked as an example. While test set ups may vary with the metrics 136 to be validated, the statistical methods will not. Documents 137 defining test setups to validate other metrics should be created by 138 the IPPM WG, once the process proposed here has been agreed upon. 140 Changes from -00 to -01 version 142 o Addition of a comparison of individual metric implementations 143 against the metric specification (trying to pick up problems and 144 solutions for metric advancement [morton-advance-metrics]). 146 o More emphasis on the requirement to carefully design and document 147 the measurement set up of the metric comparison. 149 o Proposal of testing conditions under identical WAN netwrok 150 conditions using IP in IP tunneling or Pseudo Wires and parallel 151 measurement streams. 153 o Proposing the requirement to document the smallest resolution at 154 which an ADK test was passed by 95%. As no minimum resolution is 155 specified, IPPM metric compliance is not linked to a particular 156 performance of an implementation. 158 o Reference to RFC 2330 and RFC 2679 for the 95% confidence interval 159 as preferred criterion to decide on statistical equivalence 161 o Reducing the proposed statistical test to ADK with 95% confidence. 163 1.1. Requirements Language 165 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 166 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 167 document are to be interpreted as described in RFC 2119 [RFC2119]. 169 2. Basic idea 171 The Framework for IP Performance Metrics (RFC 2330, [RFC2330]) 172 expects that a "methodology for a metric should have the property 173 that it is repeatable: if the methodology is used multiple times 174 under identical conditions, it should result in consistent 175 measurements." This means, an IPPM implementation is expected to 176 measure a metric with high precision. The metric compliance test 177 specified in the following emphasises precision over accuracy. 178 Further the methodology and test methods proposed by RFC 2330 are 179 used by this document too. 181 The implementation of a standard compliant metric is expected to meet 182 the requrirements of the related a metric specification. So before 183 comparing two metrice implementations, each metric implementation is 184 individually compared against the metric specification. As an 185 example, an implementation of the OWD metric must be calibrated. 186 Calibration results of a standard conformant metric implementation 187 must be published then. 189 Most metric specificatios leave freedom to implementors on those 190 aspects, which aren't fundamental for an individual metric 191 implementation. Calibration of individual metric implementations and 192 comparing different ones requires a careful design and documentation 193 of the metric implementation and of the testing conditions. 195 The IPPM framework expects repeating measurements to lead to the same 196 results, if the conditions under which these measurements have been 197 collected are identical. Small deviations are expected to lead to 198 small deviations in results only. To charaterise statistical 199 equivalence in the case of small deviations, RFC 2330 and RFC 2679 200 suggest to apply a 95% confidence interval. Quoting RFC 2679, "95 201 percent was chosen because ... a particular confidence level should 202 be specified so that the results of independent implementations can 203 be compared." 205 Two different IPPM implementations are expected to measure 206 statistically equivalent results, if they both measure a metric under 207 the same networking conditions. Formulating the measurement in 208 statistical terms: separate samples are collected (by separate metric 209 implementations) from the same underlying statistical process (the 210 same network conditions). The "statistical hypothesis" to be tested 211 is the expectation, that both samples do not expose statistically 212 different properties. This requires careful test design: 214 o The error induced by the sample size must be small enough to 215 minimize its influence on the test result. This may have to be 216 respected, especially if two implementations measure with 217 different average probing rates. 219 o If statistics of time series are compared, the implementation with 220 the lowest probing frequency determines the smallest temporal 221 interval for which results can be compared. 223 o Every comparison must be repeated several times based on different 224 measurement data to avoid random indications of compatibility (or 225 the lack of it). 227 o The measurement test set up must be self-consistent to the largest 228 possible extent. This means, network conditions, paths and IPPM 229 metric implementations SHOULD be identical for the compared 230 implementations to the largest possible degree to minimize the 231 influence of the test and measurement set up on the result. This 232 includes e.g. aspects of the stability and non-ambiguity of routes 233 taken by the measurement packets. See RFC 2330 for a discussion 234 on self-consistency RFC 2330 [RFC2330]. 236 As addressed by "problems and solutions for metric advancement" 237 [morton-advance-metrics], documentation of the metric test will 238 indicate which requirements and options of a metric specification are 239 specified clear enough for an implementation or uncover gaps in the 240 metric specification. The final step in advancing a metric 241 specification to standard is by improving unclear specifications and 242 by cleaning it from not supported options. 244 3. Verification of conformance to a metric specification 246 This section specifies how to verify compliance of two or more IPPM 247 implementations against a metric specification. This document only 248 proposes a general methodology. Compliance criteria to a specific 249 metric implementation are expected to be drafted for each individual 250 metric specification. The only exception is the statistical test 251 comparing two metric implementations which are simultaneously tested. 252 This test is applicable without metric specific decision criteria. 254 3.1. Tests of an individual implementation against a metric 255 specification 257 A metric implementation MUST support the requirements classified as 258 "MUST" and "REQUIRED" of the related metric specification to be 259 compliant to the latter. 261 Further, supported options of a metric implementation SHOULD be 262 documented in sufficient detail to validate and improve the 263 underlying metric specification option or remove options which saw no 264 implementation or which are badly specified from the metric 265 specification to be promoted to a standard. 267 RFC2330 and RFC2679 emphasise precision as an aim of IPPM metric 268 implementations. A single IPPM conformant implementation MUST under 269 otherwise identical network conditions produce precise results for 270 repeated measurements of the same metric. 272 RFC 2330 prefers the "empirical distribution function" EDF to 273 describe collections of measurements. RFC 2330 determines, that 274 "unless otherwise stated, IPPM goodness-of-fit tests are done using 275 5% significance." The goodness of fit test required to determine the 276 preciusion of a metric implementation consists of testing, whether 277 two or more samples belong to the same underlying distribution (of 278 measured network performance events). The goodness of fit test to be 279 applied is the Anderson-Darling K sample test (ADK test, K stands for 280 the number of samples to be compared). Please note that RFC 2330 and 281 RFC 2679 apply an Anderson Darling goodness of fit test too. 283 The results of a repeated tests with a single implementation MUST 284 pass an ADK sample test with confidence level of 95%. The resolution 285 for which the ADK test has been passed with the specified confidence 286 level MUST be documented. To formulate different: The requirement is 287 to document the smalles resolution, at which the results of the 288 tested metric implementation pass an ADK test with a confidence level 289 of 95%. 291 As an example, a one way delay measurement may pass an ADK test with 292 a timestamp resultion of 1 ms. The same test may fail, if timestamps 293 with a resolution of 100 microseconds are eavluated. The 294 implementation then is then conforming to the metric specification up 295 to a timestamp resolution of 1 ms. 297 3.2. Test set up resulting in identical live network testing conditions 299 Two major issues complicate tests for metric compliance across live 300 networks under identical testing conditions. One of these is the 301 general posit, "metric definition implementations cannot be 302 conveniently examined in field measurement scenarios". The other is 303 more more specificcally addressing "parallelism in devices and 304 networks", by which mechanisms like load balancing are meant. As a 305 reference for the latter, [RFC 4814] is given. 307 This section proposes two measures how to deal with both. Tunneling 308 mechanisms can be used to avoid pallalel processing of different 309 flows in the network. Measuring by separate parallel probe flows 310 results in repeated collection of data. In both cases, WAN network 311 conditions are identical, no matter what they are in detail. 313 Any measurement set up MUST be made to avoid the probing traffic 314 itself to impede the metric measurement. The created measurement 315 load MUST NOT result in congestion at the access link connecting the 316 measurement implementation to the WAN. The created measurement load 317 MUST NOT overload the measurement implementation itself, eg. by 318 causing a high CPU load or by creating imprecisions due to internal 319 send/receive probe packet collisions. 321 IP in IP tunnels can be used to avoid ECMP routing of different 322 measurement streams if they allow to carry inner IP packets from 323 different senders in a single tunnel with the same outer origin and 324 destination address as well as the same port numbers. The author is 325 not an expert on tunneling and appreciates guidance on the 326 applicability of one or more of the following protocols: IP in IP 327 [RFC2003], GRE [RFC2784] or L2TP [RFC2661] or [RFC3931]. RFC 4928 328 [RFC4928] proposes measures how to avoid ECMP treatment in MPLS 329 networks. Applying Pseudo-Wires for a metric implementation test is 330 one way to avoid MPLS based ECMP treatment. If tuneling is applied, 331 a single tunnel MUST carry all test traffic in one direction. If eg. 332 Ethernet Pseudo Wires are applied and the measurement streams are 333 carried in different VLANs, the Pseudo Wires MUST be set up in 334 physical port mode to avoid set up of Pseudo Wires per VLAN (which 335 may see different paths due to ECMP routing), see RFC 4448 [RFC4448]. 337 To have statsitical significance, a test MUST be repeated 5 times at 338 least (see below). WAN conditions may change over time. Sequential 339 testing is no useful metric test option. However tests can be 340 carried out by applying 5 or more different parallel measuremet 341 flows. The author takes no position, whether such a test is carried 342 out by sending eg a single CBR flow and defining avery n-th (n = 343 1..5) packet to belong to a specific measurement flow, or whether 344 multiple network cards are applied to create several distinct flows 345 of a single implementation. In the latter case, three different 346 cards of one implementation at a single test site will do, if 347 tunneling set ups like the one proposed by GRE encapsulated multicast 348 probing [GU&Duffield] are applied (note that one or more remote 349 tunnel end points and the same number of routers are required). 351 Some additional rules to calculate and compare samples have to be 352 respected. The following rules are of importance for the IPPM metric 353 test: 355 o To compare different probes of a common underlying distribution in 356 terms of metrics characterising a communication network requires 357 to respect the temporal nature for which the assumption of common 358 underlying distribution may hold. Any singletons or samples to be 359 compared MUST be captured within the same time interval. 361 o Whenever statistical events like singletons or rates are used to 362 characterise measured metrics of a time-interval, at least 5 363 events of a relevant metric MUST be present to ensure a minimum 364 confidence into the reported value (see Wikipedia on confidence 365 [Rule of thumb]). Note that this criterion also is to be 366 respected e.g. when comparing packet loss metrics. Any packet 367 loss measurement interval to be compared with the results of 368 another implementation needs to contain at least five lost packets 369 to have a minimum confidence that the observed loss rate wasn't 370 caused by a samll number of random packet drops. 372 o The minimum number of singletons or samples to be compared by an 373 Anderson-Darling test is 100 per tested metric implementation. 374 Note that the Anderson-Darling test detects small differences in 375 distributions fairly well and will fail for high number of 376 compared results (RFC2330 mentions an example with 8192 377 measurements to guarantee a failure of an Anderson-Darling test). 379 o The Anderson-Darling test is sensible against differing accuracy 380 or bias of different implementations. These differences result in 381 differing averages of compared samples. In general, differences 382 in averages of samples may result from differing test conditions. 383 An example may be different packet sizes, resulting in a constant 384 delay difference between compared samples. Therefore samples to 385 be compared by an Anderson Darling test MAY be calibrated by the 386 difference of the average values of the samples. 388 3.3. Tests two or more different implementations against a metric 389 specification 391 RFC2330 expects that a "a methodology for a given metric exhibits 392 continuity if, for small variations in conditions, it results in 393 small variations in the resulting measurements. Slightly more 394 precisely, for every positive epsilon, there exists a positive delta, 395 such that if two sets of conditions are within delta of each other, 396 then the resulting measurements will be within epsilon of each 397 other." A small variation in conditions in the context of a metric 398 comparison can be seen as different implementations measuring the 399 same metric along the same path. 401 RFC2679 comments that a "95 percent [confidence level for an 402 Anderson-Darling goodness of fit test] was chosen because....a 403 particular confidence level should be specified so that the results 404 of independent implementations can be compared." While the RFC 2679 405 statement refers to calibration, it expresses the expectation that 406 the methodology allows for comparisons between different 407 implementations. 409 IPPM metric specification however allow for implementor options to 410 the largest possible degree. It can't be expected that two 411 implementors pick identical options for the implementations. 412 Implementors SHOULD to the highest degree possible pick the same 413 configurations for their systems when comparing their implementations 414 by a metric test. 416 In some cases, a goodness of fit test may not be possible or show 417 dissapointing results. To clarify the difficulties arising from 418 different implemenation options, the individual options picked for 419 every compared implementation SHOULD be documented in sufficient 420 detail. Based on this documentation, the underlying metric 421 specification should be improved before it is promoted to a standard. 423 The same statistical test as applicable to quantify precision of a 424 single metric implementation MUST be passed to compare metric 425 conformance of different implemenations. To document compatibility, 426 the smallest measurement resolution at which the compared 427 implementations passed the ADK sample test MUST be documented. 429 For different implementations of the same metric, "variations in 430 conditions" are reasonably expected. The ADK test comparing samples 431 of the different implemenations may result in a lower precision than 432 the test for precision of each implementation individually. 434 3.4. Clock synchronisation 436 Clock synchronization effects require special attention. Accuracy of 437 one-way active delay measurements for any metrics implementation 438 depends on clock synchronization between the source and destination 439 of tests. Ideally, one-way active delay measurement (RFC 2679, 440 [RFC2679]) test endpoints either have direct access to independent 441 GPS or CDMA-based time sources or indirect access to nearby NTP 442 primary (stratum 1) time sources, equipped with GPS receivers. 443 Access to these time sources may not be available at all test 444 locations associated with different Internet paths, for a variety of 445 reasons out of scope of this document. 447 When secondary (stratum 2 and above) time sources are used with NTP 448 running acrossthe same network, whose metrics are subject to 449 comparative implementation tests, network impairments can affect 450 clock synchronization, distort sample one-way values and their 451 interval statistics. It is RECOMMENDED to discard sample one-way 452 delay values for any implementation, when one of the following 453 reliability conditions is met: 455 o Delay is measured and is finite in one direction, but not the 456 other. 458 o Absolute value of the difference between the sum of one-way 459 measurements in both directions and round-trip measurement is 460 greater than X% of the latter value. 462 Examination of the second condition requires RTT measurement for 463 reference, e.g., based on TWAMP (RFC5357, RFC 5357 [RFC5357]), in 464 conjunction with one-way delay measurement. 466 Specification of X% to strike a balance between identification of 467 unreliable one-way delay samples and misidentification of reliable 468 samples under a wide range of Internet path RTTs probably requires 469 further study. 471 An IPPM compliant metric implementation whose measurement requires 472 synchonized clocks is however expected to provide precise measurement 473 results. Any IPPM metric implementation MUST be of a precision of 1 474 ms (+/- 500 us) with a confidence of 95% if the metric is captured 475 along an Internet path which is stable and not congested during a 476 measurement duration of an hour or more. [Editor: this latter 477 definition may avoid NTP (stratum 2 or worse) synchonized IPPM 478 implementations from becoming IPPM compliant. However internal PC 479 clock synched implementations can't be rejected that way. Ideas on 480 criteria to deal with the latter are welcome. May drift be one, as 481 GPS synched implementations shouldn't have one or the same on origin 482 and destination, respectively]. 484 3.5. Recommended Metric Verification Measurement Process 486 The proposal made by the authors of bradner-metrictest 487 [bradner-metrictest] is picked up and slightly enhanced: 489 "In order to meet their obligations under the IETF Standards Process 490 the IESG must be convinced that each metric specification advanced to 491 Draft Standard or Internet Standard status is clearly written, that 492 there are the required multiple verifiably equivalent 493 implementations, and that all options have been implemented. 495 "In the context of this memo, metrics are designed to measure some 496 characteristic of a data network. An aim of any metric definition 497 should be that it should be specified in a way that can reliably 498 measure the specific characteristic in a repeatable way." 500 Each metric, statistic or option of those to be validated must be 501 compared against a reference measurement or another implementation by 502 at least 5 different basic data sets, each on with sufficient size to 503 reach the specified level of confidence. 505 "In the same way, sequentially running different implementations of 506 software that perform the tests described in the metric document on a 507 stable network, or simultaneously on a network that may or may not be 508 stable should produce essentially the same results." 510 Following these assumptions any recommendation for the advancement of 511 a metric specification needs to be accompanied by an implementation 512 report, as is the case with all requests for the advancement of IETF 513 specifications. The implementation report needs to include a 514 specific plan to test the specific metrics in the RFC in lab or real- 515 world networks and reports of the tests performed with two or more 516 implementations of the software. The test plan should cover key 517 parts of the specification, specify the precision reached for each 518 measured metric and thus define the meaning of "statistically 519 equivalent" for the specific metrics being tested. Ideally, the test 520 plan would co-evolve with the development of the metric, since that's 521 when people have the most context in their thinking regarding the 522 different subtleties that can arise. 524 In particular, the implementation report MUST as a minimum document: 526 o The metric compared and the RFC specifying it, including the 527 chosen options (like e.g. the implemented selection function in 528 the case of IPDV). 530 o A complete specification of the measurement stream (mean rate, 531 statistical distribution of packets, packet size (or mean packet 532 size and their distribution), DSCP and any other measurement 533 stream property which could result in deviating results. 534 Deviations in results can be caused also if chosen IP addresses 535 and ports of different implementations can result in different 536 layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path 537 routing in an operational network 539 o The duration of each measurement to be used for a metric 540 validation, the number of measurement points collected for each 541 metric during each measurement interval (i.e. the probe size) and 542 the level of confidence derived from this probe size for each 543 measurement interval. 545 o The result of the statistical tests performed for each metric 546 validation. 548 o The measurement configuration and set up. 550 o A parameterization of laboratory conditions and applied traffic 551 and network conditions allowing reproduction of these laboratory 552 conditions for readers of the implementation report. 554 All of the tests for each set MUST be run in a test set up as 555 specified in the section "Test set up resulting in identical live 556 network testing conditions." 558 It is RECOMMENDED to avoid effects falsifying results of real data 559 networks, if validation measurements are taken over them. Obviously, 560 the conditions met there can't be reproduced. As the measurement 561 equipment compared is designed to reliable quantify real network 562 performance, validating metrics under real network conditions is 563 desirable of course. 565 Data networks may forward packets differently in the case of: 567 o Different packet sizes chosen for different metric 568 implementations. A proposed countermeasure is selecting the same 569 packet size when validating results of two samples or a sample 570 against an original distribution. 572 o Selection of differing IP addresses and ports used by different 573 metric implementations during metric validation tests. If ECMP is 574 applied on IP or MPLS level, different paths can result (note that 575 it may be impossible to detect an MPLS ECMP path from an IP 576 endpoint). A proposed counter measure is to connect the 577 measurement equipment to be compared by a NAT device, or 578 establishing a single tunnel to transport all measurement traffic 579 The aim is to have the same IP addresses and port for all 580 measurement packets or to avoid ECMP based local routing diversion 581 by using a layer 2 tunnel. 583 o Different IP options. 585 o Different DSCP. 587 4. Acknowledgements 589 Gerhard Hasslinger commented a first version of this document, 590 suggested statistical tests and the evaluation of time series 591 information. Henk Uijterwaal pushed this work and Mike Hamilton 592 reviewed the document before publication. 594 5. Contributors 596 Scott Bradner, Vern Paxson and Allison Manking drafted bradner- 597 metrictest [bradner-metrictest], and major parts of it are quoted in 598 this document. Scott Bradner and Emile Stephan commented this draft 599 before publication. 601 6. IANA Considerations 603 This memo includes no request to IANA. 605 7. Security Considerations 607 This draft does not raise any specific security issues. 609 8. References 611 8.1. Normative References 613 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 614 October 1996. 616 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 617 3", BCP 9, RFC 2026, October 1996. 619 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 620 Requirement Levels", BCP 14, RFC 2119, March 1997. 622 [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, 623 "Framework for IP Performance Metrics", RFC 2330, 624 May 1998. 626 [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, 627 G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", 628 RFC 2661, August 1999. 630 [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 631 Delay Metric for IPPM", RFC 2679, September 1999. 633 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 634 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 635 March 2000. 637 [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling 638 Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. 640 [RFC4448] Martini, L., Rosen, E., El-Aawar, N., and G. Heron, 641 "Encapsulation Methods for Transport of Ethernet over MPLS 642 Networks", RFC 4448, April 2006. 644 [RFC4928] Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal 645 Cost Multipath Treatment in MPLS Networks", BCP 128, 646 RFC 4928, June 2007. 648 8.2. Informative References 650 [Autocorrelation] 651 N., N., "Autocorrelation", December 2008. 653 [Correlation] 654 N., N., "Correlation", June 2009. 656 [GU&Duffield] 657 Gu, Y., Duffield, N., Breslau, L., and S. Sen, "GRE 658 Encapsulated Multicast Probing: A Scalable Technique for 659 Measuring One-Way Loss", SIGMETRICS'07 San Diego, 660 California, USA, June 2007. 662 [Precision] 663 N., N., "Accuracy and precision", June 2009. 665 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 666 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 667 RFC 5357, October 2008. 669 [Rule of thumb] 670 N., N., "Confidence interval", October 2008. 672 [bradner-metrictest] 673 Bradner, S., Mankin, A., and V. Paxson, "Advancement of 674 metrics specifications on the IETF Standards Track", 675 draft -morton-ippm-advance-metrics-00, (work in progress), 676 July 2007. 678 [morton-advance-metrics] 679 Morton, A., "Problems and Possible Solutions for Advancing 680 Metrics on the Standards Track", draft -bradner- 681 metricstest-03, (work in progress), July 2009. 683 Appendix A. Further ideas on statistical tests 685 IPPM metrics are captured by time series. Time series can be checked 686 for correlation. There are two expectations on statistical time 687 series properties which should be met by separate measurements 688 probing the same underlying network performance distribution: 690 o The Autocorrelation indicates, whether there are any repeating 691 patterns within a time series. For the purpose of this document, 692 it does not matter whether there is autocorrelation in a 693 measurement. It is however expected, that two measurements expose 694 the same autocorrelation on identical "lag" intervals. If 695 calculable, the autocorrelation lies within an interval [-1;1], 696 (see Wikipedia on autocorrelation [Autocorrelation]). 698 o The correlation coefficient "indicates the strength of a linear 699 relationship between two random variables." The two random 700 variables in the case of this document are the measurement time 701 series of the IPPM implementations to be compared. The 702 expectation is, that both are strongly correlated and the 703 resulting correlation coefficient is close to 1, (see Wikipedia on 704 correlation [Correlation]). 706 A metric test can derive additional statistics from time series 707 analysis. Further, formulation of a test hypothesis is possible for 708 autocorrelation and the correlation coefficient. It is however not 709 clear, whether an appropriate statistical test to validate the 710 hypothesis by 95% significance exists. Applicability of time series 711 analysis for a metric test requires further input from statisticians. 713 In the absence of any metric test on time series, any test result 714 SHOULD provide the autocorrelation of the compared metrics time 715 series by lags from 1 to 10. In addition, the value of the 716 correlation coefficient SHOULD be provided. Autocorrelation and 717 Correlation coefficient are expected to be rather close to the value 718 1. 720 As mentioned earlier, the time series analysis requires application 721 of identical time intervals to allow a comparison. In our delay 722 example, single sample delay metric values are calculated for 9 723 minute intervals. If 200 consecutive sample delay metrics with the 724 same start and end interval are available for each implementation, 725 autocorrelation can be calculated for different n * 9 minute lags. 726 The autocorrelation calculated for the time series of each 727 implementation should be very close to the autocorrelation of the 728 other implementation for the same time lag. Further, the correlation 729 coefficient for both time series should be close to 1. 731 The way to prove that two IPPM metric measurements provide compatible 732 results then could be performed stepwise: 734 o First prove that the two compared implementations have the same 735 precision by comparing statistics of the distribution of 736 singletons (or samples) of a metric by comparing the EDF of the 737 samples captured by the two implementations. 739 o Second indicate that two compared implementations produce strongly 740 correlated time series of which each one individually has the same 741 autocorrelation as the other one. 743 Comparing "Accuracy" of IPPM implementations based on averages and 744 variations may require prior checks for the absence of long range 745 dependency within the compared measurements. Large outliers as 746 typically occurring in the case of long range dependency, can have a 747 serious impact on mean values. The median or percentiles may be more 748 robust measures on which to compare the accuracy of different IPPM 749 implementations. An idea may be to consider data up to a certain 750 percentile, calculate the mean for data up to this percentile and 751 then compare the means of the two implementations. This could be 752 repeated for different percentiles. If long range dependencies 753 impact is limited to large outliers, the method may work for lower 754 percentiles. Whether this makes sense must be confirmed by a 755 statistician, so this attempt requires further study. 757 Appendix B. Verification of measurement precision by statistical 758 methods 760 Following the definition of statistical precision [Precision], a 761 measurement process can be characterised by two properties: 763 o Accuracy, which is the degree of conformity of a measured quantity 764 to its actual (true) value. 766 o Precision, also called reproducibility or repeatability, the 767 degree to which repeated measurements show the same or similar 768 results. 770 Figure 1 further clarifies the difference between accuracy and 771 precision of a measurement. 773 Probability ^ 774 Density | 775 | Reference value Measured Value 776 | | | 777 | |<---Accuracy---->| 778 | | _|_ 779 | | / | \ 780 | | / | \ 781 | | / | \ 782 | | / | \ 783 | | / | \ 784 | | / | \ 785 Measured | | /<- Precision ->\ 786 Value -|---------|-----------------|----------> 787 | 789 Measurement accuracy and precision [Precision]. 791 Figure 1 793 The Framework for IP Performance Metrics (RFC 2330, [RFC2330]) 794 expects that a "methodology for a metric should have the property 795 that it is repeatable: if the methodology is used multiple times 796 under identical conditions, it should result in consistent 797 measurements." This means, an IPPM implementation is expected to 798 measure a metric with high precision. 800 A guideline for an IPPM conformant metric implementation can be taken 801 from these principles: 803 Two different implementations measuring the same IPPM metric must 804 produce results with a limited difference if measuring under to the 805 largest extent possible identical network conditions. 807 In a metric test, both conditions are expected to hold, meaning that 808 repeated tests of two implementations MUST produce precise results 809 for all repetition intervals. 811 A suitable statistical test and and a level of confidence to define 812 whether differences are rather limited and whether a measurement is 813 highly precise are specified below. 815 Let's assume a one way delay measurement comparison between system A, 816 probing with a frequency of 2 probes per second and system B probing 817 at a rate of 2 probes every 3 minutes. To ensure reasonable 818 confidence in results, sample metrics are calculated from at least 5 819 singletons per compared time interval. This means, sample delay 820 values are calculated for each system for identical 6 minute 821 intervals for the whole test duration. Per 6 minute interval, the 822 sample metric is calculated from 720 singletons for system A and from 823 6 singletons for system B). Note, that if outliers are not filtered, 824 moving averages are an option for an evaluation too. The minimum 825 move of an averaging interval is three minutes in our example. 827 The test set up for the delay measurement is chosen to minimize 828 errors by locating one system of each implementation at the same end 829 of two separate sites, between which delay is measured for the metric 830 test. Both measurement sites are connected by one IPSEC tunnel, so 831 that all measurement packets cross the Internet with the same IP 832 addresses. Both measurement systems measure simultaneously and the 833 local links are dimensioned to avoid congestion caused by the probing 834 traffic itself. 836 The measured delay values are reported with a resolution above the 837 measurement error and above the synchronisation error. This is done 838 to avoid comparing these errors between two different metric 839 implementations instead of comparing the IPPM metric implementation 840 itself. 842 The overall duration of the test is chosen so that more than 1000 six 843 minute measurement intervals are collected. The amount of data 844 collected allows separate comparisons for e.g. 200 consecutive 6 845 minute intervals. intervals, during which routes were instable, are 846 discarded prior to evaluation. 848 The captured delays may have been captured singletons ranging from an 849 absolute minimum Delay Dmin to values Dmin + 5 ms. To compare 850 distributions, the set of singletons of a chosen evaluation interval 851 (e.g. the data of one of the five 1800 minute capture sequences, see 852 above) is sorted for the frequency of singletons per Dmin + N * 0.5 853 ms (n = 1, 2, ...). After that, a comparison of the two probe sets 854 with any of the mentioned tests may be applied. 856 Authors' Addresses 858 Ruediger Geib (editor) 859 Deutsche Telekom 860 Heinrich Hertz Str. 3-7 861 Darmstadt, 64295 862 Germany 864 Phone: +49 6151 628 2747 865 Email: Ruediger.Geib@telekom.de 867 Al Morton 868 AT&T Labs 869 200 Laurel Avenue South 870 Middletown, NJ 07748 871 USA 873 Phone: +1 732 420 1571 874 Fax: +1 732 368 1192 875 Email: acmorton@att.com 876 URI: http://home.comcast.net/~acmacm/ 878 Reza Fardid 879 Covad Communications 880 2510 Zanker Road 881 San Jose, CA 95131 882 USA 884 Phone: +1 408 434-2042 885 Email: RFardid@covad.com