idnits 2.17.00 (12 Aug 2021) /tmp/idnits49844/draft-ietf-ippm-metrictest-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 24, 2010) is 4226 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1372 == Unused Reference: 'RFC2680' is defined on line 1041, but no explicit reference was found in the text == Unused Reference: 'RFC2681' is defined on line 1044, but no explicit reference was found in the text == Unused Reference: 'RFC3931' is defined on line 1051, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 2330 ** Obsolete normative reference: RFC 2679 (Obsoleted by RFC 7679) ** Obsolete normative reference: RFC 2680 (Obsoleted by RFC 7680) ** Downref: Normative reference to an Informational RFC: RFC 4459 Summary: 4 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Geib, Ed. 3 Internet-Draft Deutsche Telekom 4 Intended status: Standards Track A. Morton 5 Expires: April 27, 2011 AT&T Labs 6 R. Fardid 7 Cariden Technologies 8 A. Steinmitz 9 HS Fulda 10 October 24, 2010 12 IPPM standard advancement testing 13 draft-ietf-ippm-metrictest-01 15 Abstract 17 This document specifies tests to determine if multiple independent 18 instantiations of a performance metric RFC have implemented the 19 specifications in the same way. This is the performance metric 20 equivalent of interoperability, required to advance RFCs along the 21 standards track. Results from different implementations of metric 22 RFCs will be collected under the same underlying network conditions 23 and compared using state of the art statistical methods. The goal is 24 an evaluation of the metric RFC itself, whether its definitions are 25 clear and unambiguous to implementors and therefore a candidate for 26 advancement on the IETF standards track. 28 Status of this Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on April 27, 2011. 45 Copyright Notice 47 Copyright (c) 2010 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 6 64 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 6 65 3. Verification of conformance to a metric specification . . . . 8 66 3.1. Tests of an individual implementation against a metric 67 specification . . . . . . . . . . . . . . . . . . . . . . 9 68 3.2. Test setup resulting in identical live network testing 69 conditions . . . . . . . . . . . . . . . . . . . . . . . . 11 70 3.3. Tests of two or more different implementations against 71 a metric specification . . . . . . . . . . . . . . . . . . 15 72 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 16 73 3.5. Recommended Metric Verification Measurement Process . . . 17 74 3.6. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . 20 75 3.7. Proposal to determine an "equivalence" threshold for 76 each metric evaluated . . . . . . . . . . . . . . . . . . 21 77 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 78 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 22 79 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 80 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 81 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 82 8.1. Normative References . . . . . . . . . . . . . . . . . . . 23 83 8.2. Informative References . . . . . . . . . . . . . . . . . . 24 84 Appendix A. An example on a One-way Delay metric validation . . . 25 85 A.1. Compliance to Metric specification requirements . . . . . 25 86 A.2. Examples related to statistical tests for One-way Delay . 26 87 Appendix B. Anderson-Darling 2 sample C++ code . . . . . . . . . 28 88 Appendix C. A tunneling set up for remote metric 89 implementation testing . . . . . . . . . . . . . . . 36 90 Appendix D. Glossary . . . . . . . . . . . . . . . . . . . . . . 38 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38 93 1. Introduction 95 The Internet Standards Process RFC2026 [RFC2026] requires that for a 96 IETF specification to advance beyond the Proposed Standard level, at 97 least two genetically unrelated implementations must be shown to 98 interoperate correctly with all features and options. This 99 requirement can be met by supplying: 101 o evidence that (at least a sub-set of) the specification has been 102 implemented by multiple parties, thus indicating adoption by the 103 IETF community and the extent of feature coverage. 105 o evidence that each feature of the specification is sufficiently 106 well-described to support interoperability, as demonstrated 107 through testing and/or user experience with deployment. 109 In the case of a protocol specification, the notion of 110 "interoperability" is reasonably intuitive - the implementations must 111 successfully "talk to each other", while exercising all features and 112 options. To achieve interoperability, two implementors need to 113 interpret the protocol specifications in equivalent ways. In the 114 case of IP Performance Metrics (IPPM), this definition of 115 interoperability is only useful for test and control protocols like 116 the One-Way Active Measurement Protocol, OWAMP [RFC4656], and the 117 Two-Way Active Measurement Protocol, TWAMP [RFC5357]. 119 A metric specification RFC describes one or more metric definitions, 120 methods of measurement and a way to report the results of 121 measurement. One example would be a way to test and report the One- 122 way Delay that data packets incur while being sent from one network 123 location to another, One-way Delay Metric. 125 In the case of metric specifications, the conditions that satisfy the 126 "interoperability" requirement are less obvious, and there was a need 127 for IETF agreement on practices to judge metric specification 128 "interoperability" in the context of the IETF Standards Process. 129 This memo provides methods which should be suitable to evaluate 130 metric specifications for standards track advancement. The methods 131 proposed here MAY be generally applicable to metric specification 132 RFCs beyond those developed under the IPPM Framework [RFC2330]. 134 Since many implementations of IP metrics are embedded in measurement 135 systems that do not interact with one another (they were built before 136 OWAMP and TWAMP), the interoperability evaluation called for in the 137 IETF standards process cannot be determined by observing that 138 independent implementations interact properly for various protocol 139 exchanges. Instead, verifying that different implementations give 140 statistically equivalent results under controlled measurement 141 conditions takes the place of interoperability observations. Even 142 when evaluating OWAMP and TWAMP RFCs for standards track advancement, 143 the methods described here are useful to evaluate the measurement 144 results because their validity would not be ascertained in typical 145 interoperability testing. 147 The standards advancement process aims at producing confidence that 148 the metric definitions and supporting material are clearly worded and 149 unambiguous, or reveals ways in which the metric definitions can be 150 revised to achieve clarity. The process also permits identification 151 of options that were not implemented, so that they can be removed 152 from the advancing specification. Thus, the product of this process 153 is information about the metric specification RFC itself: 154 determination of the specifications or definitions that are clear and 155 unambiguous and those that are not (as opposed to an evaluation of 156 the implementations which assist in the process). 158 This document defines a process to verify that implementations (or 159 practically, measurement systems) have interpreted the metric 160 specifications in equivalent ways, and produce equivalent results. 162 Testing for statistical equivalence requires ensuring identical test 163 setups (or awareness of differences) to the best possible extent. 164 Thus, producing identical test conditions is a core goal of the memo. 165 Another important aspect of this process is to test individual 166 implementations against specific requirements in the metric 167 specifications using customized tests for each requirement. These 168 tests can distinguish equivalent interpretations of each specific 169 requirement. 171 Conclusions on equivalence are reached by two measures. 173 First, implementations are compared against individual metric 174 specifications to make sure that differences in implementation are 175 minimised or at least known. 177 Second, a test setup is proposed ensuring identical networking 178 conditions so that unknowns are minimized and comparisons are 179 simplified. The resulting separate data sets may be seen as samples 180 taken from the same underlying distribution. Using state of the art 181 statistical methods, the equivalence of the results is verified. To 182 illustrate application of the process and methods defined here, 183 evaluation of the One-way Delay Metric [RFC2679] is provided in an 184 Appendix. While test setups will vary with the metrics to be 185 validated, the general methodology of determining equivalent results 186 will not. Documents defining test setups to evaluate other metrics 187 should be developed once the process proposed here has been agreed 188 and approved. 190 The metric RFC advancement process begins with a request for protocol 191 action accompanied by a memo that documents the supporting tests and 192 results. The procedures of [RFC2026] are expanded in[RFC5657], 193 including sample implementation and interoperability reports. 194 Section 3 of [morton-advance-metrics-01] can serve as a template for 195 a metric RFC report which accompanies the protocol action request to 196 the Area Director, including description of the test set-up, 197 procedures, results for each implementation and conclusions. 199 Changes from WG -00 to WG -01 draft 201 o Discussion on merits and requirements of a distributed lab test 202 using only local load generators. 204 o Proposal of metrics suitable for tests using the proposed 205 measurement configuration. 207 o Hint on delay caused by software based L2TPv3 implementation. 209 o Added an appendix with a test configuration allowing remote tests 210 comparing different implementations accross the network. 212 o Proposal for maximum error of "equivalence", based on performance 213 comparison of identical implementations. This may be useful for 214 both ADK and non-ADK comparisons. 216 Changes from prior ID -02 to WG -00 draft 218 o Incorporation of aspects of reporting to support the protocol 219 action request in the Introduction and section 3.5 221 o Overhaul of sectcion 3.2 regarding tunneling: Added generic 222 tunneling requirements and L2TPv3 as an example tunneling 223 mechanism fulfilling the tunneling requirements. Removed and 224 adapted some of the prior references to other tunneling protocols 226 o Softened a requirement within section 3.4 (MUST to SHOULD on 227 precision) and removed some comments of the authors. 229 o Updated contact information of one author and added a new author. 231 o Added example C++ code of an Anderson-Darling two sample test 232 implementation. 234 Changes from ID -01 to ID -02 version 235 o Major editorial review, rewording and clarifications on all 236 contents. 238 o Additional text on parrallel testing using VLANs and GRE or 239 Pseudowire tunnels. 241 o Additional examples and a glossary. 243 Changes from ID -00 to ID -01 version 245 o Addition of a comparison of individual metric implementations 246 against the metric specification (trying to pick up problems and 247 solutions for metric advancement [morton-advance-metrics]). 249 o More emphasis on the requirement to carefully design and document 250 the measurement setup of the metric comparison. 252 o Proposal of testing conditions under identical WAN network 253 conditions using IP in IP tunneling or Pseudo Wires and parallel 254 measurement streams. 256 o Proposing the requirement to document the smallest resolution at 257 which an ADK test was passed by 95%. As no minimum resolution is 258 specified, IPPM metric compliance is not linked to a particular 259 performance of an implementation. 261 o Reference to RFC 2330 and RFC 2679 for the 95% confidence interval 262 as preferred criterion to decide on statistical equivalence 264 o Reducing the proposed statistical test to ADK with 95% confidence. 266 1.1. Requirements Language 268 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 269 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 270 document are to be interpreted as described in RFC 2119 [RFC2119]. 272 2. Basic idea 274 The implementation of a standard compliant metric is expected to meet 275 the requirements of the related metric specification. So before 276 comparing two metric implementations, each metric implementation is 277 individually compared against the metric specification. 279 Most metric specifications leave freedom to implementors on non- 280 fundamental aspects of an individual metric (or options). Comparing 281 different measurement results using a statistical test with the 282 assumption of identical test path and testing conditions requires 283 knowledge of all differences in the overall test setup. Metric 284 specification options chosen by implementors have to be documented. 285 It is REQUIRED to use identical implementation options wherever 286 possible for any test proposed here. Calibrations proposed by metric 287 standards should be performed to further identify (and possibly 288 reduce) potential sources of errors in the test setup. 290 The Framework for IP Performance Metrics [RFC2330] expects that a 291 "methodology for a metric should have the property that it is 292 repeatable: if the methodology is used multiple times under identical 293 conditions, it should result in consistent measurements." This means 294 an implementation is expected to repeatedly measure a metric with 295 consistent results (repeatability with the same result). Small 296 deviations in the test setup are expected to lead to small deviations 297 in results only. To characterise statistical equivalence in the case 298 of small deviations, RFC 2330 and [RFC2679] suggest to apply a 95% 299 confidence interval. Quoting RFC 2679, "95 percent was chosen 300 because ... a particular confidence level should be specified so that 301 the results of independent implementations can be compared." 303 Two different implementations are expected to produce statistically 304 equivalent results if they both measure a metric under the same 305 networking conditions. Formulating in statistical terms: separate 306 metric implementations collect separate samples from the same 307 underlying statistical process (the same network conditions). The 308 statistical hypothesis to be tested is the expectation that both 309 samples do not expose statistically different properties. This 310 requires careful test design: 312 o The measurement test setup must be self-consistent to the largest 313 possible extent. To minimize the influence of the test and 314 measurement setup on the result, network conditions and paths MUST 315 be identical for the compared implementations to the largest 316 possible degree. This includes both the stability and non- 317 ambiguity of routes taken by the measurement packets. See RFC 318 2330 for a discussion on self-consistency. 320 o The error induced by the sample size must be small enough to 321 minimize its influence on the test result. This may have to be 322 respected, especially if two implementations measure with 323 different average probing rates. 325 o Every comparison must be repeated several times based on different 326 measurement data to avoid random indications of compatibility (or 327 the lack of it). 329 o To minimize the influence of implementation options on the result, 330 metric implementations SHOULD use identical options and parameters 331 for the metric under evaluation. 333 o The implementation with the lowest probing frequency determines 334 the smallest temporal interval for which samples can be compared. 336 The metric specifications themselves are the primary focus of 337 evaluation, rather than the implementations of metrics. The 338 documentation produced by the advancement process should identify 339 which metric definitions and supporting material were found to be 340 clearly worded and unambiguous, OR, it should identify ways in which 341 the metric specification text should be revised to achieve clarity 342 and unified interpretation. 344 The process should also permit identification of options that were 345 not implemented, so that they can be removed from the advancing 346 specification (this is an aspect more typical of protocol advancement 347 along the standards track). 349 Note that this document does not propose to base interoperability 350 indications of performance metric implementations on comparisons of 351 individual singletons. Individual singletons may be impacted by many 352 statistical effects while they are measured. Comparing two 353 singletons of different implementations may result in failures with 354 higher probability than comparing samples. 356 3. Verification of conformance to a metric specification 358 This section specifies how to verify compliance of two or more IPPM 359 implementations against a metric specification. This document only 360 proposes a general methodology. Compliance criteria to a specific 361 metric implementation need to be defined for each individual metric 362 specification. The only exception is the statistical test comparing 363 two metric implementations which are simultaneously tested. This 364 test is applicable without metric specific decision criteria. 366 Several testing options exist to compare two or more implementations: 368 o Use a single test lab to compare the implementations and emulate 369 the Internet with an impairment generator. 371 o Use a single test lab to compare the implementations and measure 372 across the Internet. 374 o Use remotely separated test labs to compare the implementations 375 and emulate the Internet with two "identically" configured 376 impairment generators. 378 o Use remotely separated test labs to compare the implementations 379 and measure across the Internet. 381 o Use remotely separated test labs to compare the implementations 382 and measure across the Internet and include a single impairment 383 generator to impact all measurement flows in non discriminatory 384 way. 386 The first two approaches work, but cause higher expenses than the 387 other ones (due to travel and/or shipping+installation). For the 388 third option, ensuring two identically configured impairment 389 generators requires well defined test cases and possibly identical 390 hard- and software. >>>Comment: for some specific tests, impairment 391 generator accuracy requirements are less-demanding than others, and 392 in such cases there is more flexibility in impairment generator 393 configuration. <<< 395 It is a fair question, whether the last two options can result in any 396 applicable test set up at all. While an experimental approach is 397 given in Appendix C, the tradeoff that measurement packets of 398 different sites pass the path segments but always in a different 399 order of segments probably can't be avoided. 401 The question of which option above results in identical networking 402 conditions and is broadly accepted can't be answered without more 403 practical experience in comparing implementations. The last proposal 404 has the advantage that, while the measurement equipment is remotely 405 distributed, a single network impairment generator and the Internet 406 can be used in combination to impact all measurement flows. 408 3.1. Tests of an individual implementation against a metric 409 specification 411 A metric implementation MUST support the requirements classified as 412 "MUST" and "REQUIRED" of the related metric specification to be 413 compliant to the latter. 415 Further, supported options of a metric implementation SHOULD be 416 documented in sufficient detail. The documentation of chosen options 417 is RECOMMENDED to minimise (and recognise) differences in the test 418 setup if two metric implementations are compared. Further, this 419 documentation is used to validate and improve the underlying metric 420 specification option, to remove options which saw no implementation 421 or which are badly specified from the metric specification to be 422 promoted to a standard. This documentation SHOULD be made for all 423 implementation relevant specifications of a metric picked for a 424 comparison, which aren't explicitly marked as "MUST" or "REQUIRED" in 425 the metric specification. This applies for the following sections of 426 all metric specifications: 428 o Singleton Definition of the Metric. 430 o Sample Definition of the Metric. 432 o Statistics Definition of the Metric. As statistics are compared 433 by the test specified here, this documentation is required even in 434 the case, that the metric specification does not contain a 435 Statistics Definition. 437 o Timing and Synchronisation related specification (if relevant for 438 the Metric). 440 o Any other technical part present or missing in the metric 441 specification, which is relevant for the implementation of the 442 Metric. 444 RFC2330 and RFC2679 emphasise precision as an aim of IPPM metric 445 implementations. A single IPPM conformant implementation MUST under 446 otherwise identical network conditions produce precise results for 447 repeated measurements of the same metric. 449 RFC 2330 prefers the "empirical distribution function" EDF to 450 describe collections of measurements. RFC 2330 determines, that 451 "unless otherwise stated, IPPM goodness-of-fit tests are done using 452 5% significance." The goodness of fit test determines by which 453 precision two or more samples of a metric implementation belong to 454 the same underlying distribution (of measured network performance 455 events). The goodness of fit test to be applied is the Anderson- 456 Darling K sample test (ADK sample test, K stands for the number of 457 samples to be compared) [ADK]. Please note that RFC 2330 and RFC 458 2679 apply an Anderson Darling goodness of fit test too. 460 The results of a repeated test with a single implementation MUST pass 461 an ADK sample test with confidence level of 95%. The resolution for 462 which the ADK test has been passed with the specified confidence 463 level MUST be documented. To formulate this differently: The 464 requirement is to document the smallest resolution, at which the 465 results of the tested metric implementation pass an ADK test with a 466 confidence level of 95%. The minimum resolution available in the 467 reported results from each implementation MUST be taken into account 468 in the ADK test. 470 3.2. Test setup resulting in identical live network testing conditions 472 Two major issues complicate tests for metric compliance across live 473 networks under identical testing conditions. One is the general 474 point that metric definition implementations cannot be conveniently 475 examined in field measurement scenarios. The other one is more 476 broadly described as "parallelism in devices and networks", including 477 mechanisms like those that achieve load balancing (see [RFC4928]). 479 This section proposes two measures to deal with both issues. 480 Tunneling mechanisms can be used to avoid parallel processing of 481 different flows in the network. Measuring by separate parallel probe 482 flows results in repeated collection of data. If both measures are 483 combined, WAN network conditions are identical for a number of 484 independent measurement flows, no matter what the network conditions 485 are in detail. 487 Any measurement setup MUST be made to avoid the probing traffic 488 itself to impede the metric measurement. The created measurement 489 load MUST NOT result in congestion at the access link connecting the 490 measurement implementation to the WAN. The created measurement load 491 MUST NOT overload the measurement implementation itself, eg. by 492 causing a high CPU load or by creating imprecisions due to internal 493 transmit (receive respectively) probe packet collisions. 495 Tunneling multiple flows reaching a network element on a single 496 physical port may allow to transmit all packets of the tunnel via the 497 same path. Applying tunnels to avoid undesired influence of standard 498 routing for measurement purposes is a concept known from literature, 499 see e.g. GRE encapsulated multicast probing [GU+Duffield]. An 500 existing IP in IP tunnel protocol can be applied to avoid Equal-Cost 501 Multi-Path (ECMP) routing of different measurement streams if it 502 meets the following criteria: 504 o Inner IP packets from different measurement implementations are 505 mapped into a single tunnel with single outer IP origin and 506 destination address as well as origing and destination port 507 numbers which are identical for all packets. 509 o An easily accessible commodity tunneling protocol allows to carry 510 out a metric test from more test sites. 512 o A low operational overhead may enable a broader audience to set up 513 a metric test with the desired properties. 515 o The tunneling protocol should be reliable and stable in set up and 516 operation to avoid disturbances or influence on the test results. 518 o The tunneling protocol should not incurr any extra cost for those 519 interested in setting up a metric test. 521 An illustration of a test setup with two tunnels and two flows 522 between two linecards of one implementation is given in Figure 1. 524 Implementation ,---. +--------+ 525 +~~~~~~~~~~~/ \~~~~~~| Remote | 526 +------->-----F2->-| / \ |->---+ | 527 | +---------+ | Tunnel 1( ) | | | 528 | | transmit|-F1->-| ( ) |->+ | | 529 | | LC1 | +~~~~~~~~~| |~~~~| | | | 530 | | receive |-<--+ ( ) | F1 F2 | 531 | +---------+ | |Internet | | | | | 532 *-------<-----+ F2 | | | | | | 533 +---------+ | | +~~~~~~~~~| |~~~~| | | | 534 | transmit|-* *-| | | |--+<-* | 535 | LC2 | | Tunnel 2( ) | | | 536 | receive |-<-F1-| \ / |<-* | 537 +---------+ +~~~~~~~~~~~\ /~~~~~~| Router | 538 `-+-' +--------+ 540 Illustration of a test setup with two tunnels. For simplicity, only 541 two linecards of one implementation and two flows F between them are 542 shown. 544 Figure 1 546 Figure 2 shows the network elements required to set up GRE tunnels or 547 as shown by figure 1. 549 Implementation 551 +-----+ ,---. 552 | LC1 | / \ 553 +-----+ / \ +------+ 554 | +-------+ ( ) +-------+ |Remote| 555 +--------+ | | | | | | | | 556 |Ethernet| | Tunnel| |Internet | | Tunnel| | | 557 |Switch |--| Head |--| |--| Head |--| | 558 +--------+ | Router| | | | Router| | | 559 | | | ( ) | | |Router| 560 +-----+ +-------+ \ / +-------+ +------+ 561 | LC2 | \ / 562 +-----+ `-+-' 563 Illustration of a hardware setup to realise the test setup 564 illustrated by figure 1 with GRE tunnels or Pseudowires. 566 Figure 2 568 If tunneling is applied, two tunnels MUST carry all test traffic in 569 between the test site and the remote site. For example, if 802.1Q 570 Ethernet Virtual LANs (VLAN) are applied and the measurement streams 571 are carried in different VLANs, the IP tunnel or Pseudo Wires 572 respectively MUST be set up in physical port mode to avoid set up of 573 Pseudo Wires per VLAN (which may see different paths due to ECMP 574 routing), see RFC 4448. The remote router and the Ethernet switch 575 shown in figure 2 must support 802.1Q in this set up. 577 The IP packet size of the metric implementation SHOULD be chosen 578 small enough to avoid fragmentation due to the added Ethernet and 579 tunnel headers. Otherwise, the impact of tunnel overhead on 580 fragmentation and interface MTU size MUST be understood and taken 581 into account (see [RFC4459]). 583 An Ethernet port mode IP tunnel carrying several 802.1Q VLANs each 584 containing measurement traffic of a single measurement system was set 585 up as a proof of concept using RFC4719 [RFC4719], Transport of 586 Ethernet Frames over L2TPv3. Ethernet over L2TPv3 seems to fulfill 587 most of the desired tunneling protocol criteria mentioned above. 589 The following headers may have to be accounted for when calculating 590 total packet length, if VLANs and Ethernet over L2TPv3 tunnels are 591 applied: 593 o Ethernet 802.1Q: 22 Byte. 595 o L2TPv3 Header: 4-16 Byte for L2TPv3 data messages over IP; 16-28 596 Byte for L2TPv3 data messages over UDP. 598 o IPv4 Header (outer IP header): 20 Byte. 600 o MPLS Labels may be added by a carrier. Each MPLS Label has a 601 length of 4 Bytes. By the time of writing, between 1 and 4 Labels 602 seems to be a fair guess of what's expectable. 604 The applicability of one or more of the following tunneling protocols 605 may be investigated by interested parties if Ethernet over L2TPv3 is 606 felt to be not suitable: IP in IP [RFC2003] or Generic Routing 607 Encapsulation (GRE) [RFC2784]. RFC 4928 [RFC4928] proposes measures 608 how to avoid ECMP treatment in MPLS networks. 610 L2TP is a commodity tunneling protocol [RFC2661]. By the time of 611 writing, L2TPv3 [RFC3931]is the latest version of L2TP. If L2TPv3 is 612 applied, software based implementations of this protocol are not 613 suitable for the test set up, as such implementations may cause 614 uncalculable delay shifts. 616 Ethernet Pseudo Wires may also be set up on MPLS networks [RFC4448]. 617 While there's no technical issue with this solution, MPLS interfaces 618 are mostly found in the network provider domain. Hence not all of 619 the above tunneling criteria are met. 621 Appendix C provides an experimental tunneling set up for metric 622 implementation testing between two (or more) remote sites. 624 Each test is repeated several times. WAN conditions may change over 625 time. Sequential testing is desirable, but may not be a useful 626 metric test option. It is RECOMMENDED that tests be carried out by 627 establishing N different parallel measurement flows. Two or three 628 linecards per implementation serving to send or receive measurement 629 flows should be sufficient to create 5 or more parallel measurement 630 flows. If three linecards are used, each card sends and receives 2 631 flows. Other options are to separate flows by DiffServ marks 632 (without deploying any QoS in the inner or outer tunnel) or using a 633 single CBR flow and evaluating every n-th singleton to belong to a 634 specific measurement flow. 636 Some additional rules to calculate and compare samples have to be 637 respected to perform a metric test: 639 o To compare different probes of a common underlying distribution in 640 terms of metrics characterising a communication network requires 641 to respect the temporal nature for which the assumption of common 642 underlying distribution may hold. Any singletons or samples to be 643 compared MUST be captured within the same time interval. 645 o Whenever statistical events like singletons or rates are used to 646 characterise measured metrics of a time-interval, at least 5 647 singletons of a relevant metric SHOULD be present to ensure a 648 minimum confidence into the reported value (see Wikipedia on 649 confidence [Rule of thumb]). Note that this criterion also is to 650 be respected e.g. when comparing packet loss metrics. Any packet 651 loss measurement interval to be compared with the results of 652 another implementation SHOULD contain at least five lost packets 653 to have a minimum confidence that the observed loss rate wasn't 654 caused by a small number of random packet drops. 656 o The minimum number of singletons or samples to be compared by an 657 Anderson-Darling test SHOULD be 100 per tested metric 658 implementation. Note that the Anderson-Darling test detects small 659 differences in distributions fairly well and will fail for high 660 number of compared results (RFC2330 mentions an example with 8192 661 measurements where an Anderson-Darling test always failed). 663 o Generally, the Anderson-Darling test is sensitive to differences 664 in the accuracy or bias associated with varying implementations or 665 test conditions. These dissimilarities may result in differing 666 averages of samples to be compared. An example may be different 667 packet sizes, resulting in a constant delay difference between 668 compared samples. Therefore samples to be compared by an Anderson 669 Darling test MAY be calibrated by the difference of the average 670 values of the samples. Any calibration of this kind MUST be 671 documented in the test result. 673 3.3. Tests of two or more different implementations against a metric 674 specification 676 RFC2330 expects "a methodology for a given metric [to] exhibit 677 continuity if, for small variations in conditions, it results in 678 small variations in the resulting measurements. Slightly more 679 precisely, for every positive epsilon, there exists a positive delta, 680 such that if two sets of conditions are within delta of each other, 681 then the resulting measurements will be within epsilon of each 682 other." A small variation in conditions in the context of the metric 683 test proposed here can be seen as different implementations measuring 684 the same metric along the same path. 686 IPPM metric specification however allow for implementor options to 687 the largest possible degree. It can't be expected that two 688 implementors pick identical options for the implementations. 689 Implementors SHOULD to the highest degree possible pick the same 690 configurations for their systems when comparing their implementations 691 by a metric test. 693 In some cases, a goodness of fit test may not be possible or show 694 disappointing results. To clarify the difficulties arising from 695 different implementation options, the individual options picked for 696 every compared implementation SHOULD be documented in sufficient 697 detail. Based on this documentation, the underlying metric 698 specification should be improved before it is promoted to a standard. 700 The same statistical test as applicable to quantify precision of a 701 single metric implementation MUST be passed to compare metric 702 conformance of different implementations. To document compatibility, 703 the smallest measurement resolution at which the compared 704 implementations passed the ADK sample test MUST be documented. 706 For different implementations of the same metric, "variations in 707 conditions" are reasonably expected. The ADK test comparing samples 708 of the different implementations may result in a lower precision than 709 the test for precision of each implementation individually. 711 3.4. Clock synchronisation 713 Clock synchronization effects require special attention. Accuracy of 714 one-way active delay measurements for any metrics implementation 715 depends on clock synchronization between the source and destination 716 of tests. Ideally, one-way active delay measurement (RFC 2679, 717 [RFC2679]) test endpoints either have direct access to independent 718 GPS or CDMA-based time sources or indirect access to nearby NTP 719 primary (stratum 1) time sources, equipped with GPS receivers. 720 Access to these time sources may not be available at all test 721 locations associated with different Internet paths, for a variety of 722 reasons out of scope of this document. 724 When secondary (stratum 2 and above) time sources are used with NTP 725 running across the same network, whose metrics are subject to 726 comparative implementation tests, network impairments can affect 727 clock synchronization, distort sample one-way values and their 728 interval statistics. It is RECOMMENDED to discard sample one-way 729 delay values for any implementation, when one of the following 730 reliability conditions is met: 732 o Delay is measured and is finite in one direction, but not the 733 other. 735 o Absolute value of the difference between the sum of one-way 736 measurements in both directions and round-trip measurement is 737 greater than X% of the latter value. 739 Examination of the second condition requires RTT measurement for 740 reference, e.g., based on TWAMP (RFC5357, RFC 5357 [RFC5357]), in 741 conjunction with one-way delay measurement. 743 Specification of X% to strike a balance between identification of 744 unreliable one-way delay samples and misidentification of reliable 745 samples under a wide range of Internet path RTTs probably requires 746 further study. 748 An IPPM compliant metric implementation whose measurement requires 749 synchronized clocks is however expected to provide precise 750 measurement results. Any IPPM metric implementation SHOULD be of a 751 precision of 1 ms (+/- 500 us) with a confidence of 95% if the metric 752 is captured along an Internet path which is stable and not congested 753 during a measurement duration of an hour or more. 755 3.5. Recommended Metric Verification Measurement Process 757 In order to meet their obligations under the IETF Standards Process 758 the IESG must be convinced that each metric specification advanced to 759 Draft Standard or Internet Standard status is clearly written, that 760 there are the required multiple verifiably equivalent 761 implementations, and that all options have been implemented. 763 In the context of this document, metrics are designed to measure some 764 characteristic of a data network. An aim of any metric definition 765 should be that it should be specified in a way that can reliably 766 measure the specific characteristic in a repeatable way across 767 multiple independent implementations. 769 Each metric, statistic or option of those to be validated MUST be 770 compared against a reference measurement or another implementation by 771 at least 5 different basic data sets, each one with sufficient size 772 to reach the specified level of confidence, as specified by this 773 document. 775 Finally, the metric definitions, embodied in the text of the RFCs, 776 are the objects that require evaluation and possible revision in 777 order to advance to the next step on the standards track. 779 IF two (or more) implementations do not measure an equivalent metric 780 as specified by this document, 782 AND sources of measurement error do not adequately explain the lack 783 of agreement, 785 THEN the details of each implementation should be audited along with 786 the exact definition text, to determine if there is a lack of clarity 787 that has caused the implementations to vary in a way that affects the 788 correspondence of the results. 790 IF there was a lack of clarity or multiple legitimate interpretations 791 of the definition text, 793 THEN the text should be modified and the resulting memo proposed for 794 consensus and (possible) advancement along the standards track. 796 Finally, all the findings MUST be documented in a report that can 797 support advancement on the standards track, similar to those 798 described in [RFC5657]. The list of measurement devices used in 799 testing satisfies the implementation requirement, while the test 800 results provide information on the quality of each specification in 801 the metric RFC (the surrogate for feature interoperability). 803 The complete process of advancing a metric specification to a 804 standard as defined by this document is illustrated in Figure 3. 806 ,---. 807 / \ 808 ( Start ) 809 \ / Implementations 810 `-+-' +-------+ 811 | /| 1 `. 812 +---+----+ / +-------+ `.-----------+ ,-------. 813 | RFC | / |Check for | ,' was RFC `. YES 814 | | / |Equivalence.... clause x ------+ 815 | |/ +-------+ |under | `. clear? ,' | 816 | Metric \.....| 2 ....relevant | `---+---' +----+-----+ 817 | Metric |\ +-------+ |identical | No | |Report | 818 | Metric | \ |network | +--+----+ |results + | 819 | ... | \ |conditions | |Modify | |Advance | 820 | | \ +-------+ | | |Spec +--+RFC | 821 +--------+ \| n |.'+-----------+ +-------+ |request(?)| 822 +-------+ +----------+ 824 Illustration of the metric standardisation process 826 Figure 3 828 Any recommendation for the advancement of a metric specification MUST 829 be accompanied by an implementation report, as is the case with all 830 requests for the advancement of IETF specifications. The 831 implementation report needs to include the tests performed, the 832 applied test setup, the specific metrics in the RFC and reports of 833 the tests performed with two or more implementations. The test plan 834 needs to specify the precision reached for each measured metric and 835 thus define the meaning of "statistically equivalent" for the 836 specific metrics being tested. 838 Ideally, the test plan would co-evolve with the development of the 839 metric, since that's when people have the most context in their 840 thinking regarding the different subtleties that can arise. 842 In particular, the implementation report MUST as a minimum document: 844 o The metric compared and the RFC specifying it. This includes 845 statements as required by the section "Tests of an individual 846 implementation against a metric specification" of this document. 848 o The measurement configuration and setup. 850 o A complete specification of the measurement stream (mean rate, 851 statistical distribution of packets, packet size or mean packet 852 size and their distribution), DSCP and any other measurement 853 stream properties which could result in deviating results. 854 Deviations in results can be caused also if chosen IP addresses 855 and ports of different implementations can result in different 856 layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path 857 routing in an operational network. 859 o The duration of each measurement to be used for a metric 860 validation, the number of measurement points collected for each 861 metric during each measurement interval (i.e. the probe size) and 862 the level of confidence derived from this probe size for each 863 measurement interval. 865 o The result of the statistical tests performed for each metric 866 validation as required by the section "Tests of two or more 867 different implementations against a metric specification" of this 868 document. 870 o A parameterization of laboratory conditions and applied traffic 871 and network conditions allowing reproduction of these laboratory 872 conditions for readers of the implementation report. 874 o The documentation helping to improve metric specifications defined 875 by this section. 877 All of the tests for each set SHOULD be run in a test setup as 878 specified in the section "Test setup resulting in identical live 879 network testing conditions." 881 If a different test set up is chosen, it is RECOMMENDED to avoid 882 effects falsifying results of validation measurements caused by real 883 data networks (like parallelism in devices and networks). Data 884 networks may forward packets differently in the case of: 886 o Different packet sizes chosen for different metric 887 implementations. A proposed countermeasure is selecting the same 888 packet size when validating results of two samples or a sample 889 against an original distribution. 891 o Selection of differing IP addresses and ports used by different 892 metric implementations during metric validation tests. If ECMP is 893 applied on IP or MPLS level, different paths can result (note that 894 it may be impossible to detect an MPLS ECMP path from an IP 895 endpoint). A proposed counter measure is to connect the 896 measurement equipment to be compared by a NAT device, or 897 establishing a single tunnel to transport all measurement traffic 898 The aim is to have the same IP addresses and port for all 899 measurement packets or to avoid ECMP based local routing diversion 900 by using a layer 2 tunnel. 902 o Different IP options. 904 o Different DSCP. 906 o If the N measurements are captured using sequential measurements 907 instead of simultaneous ones, then the following factors come into 908 play: Time varying paths and load conditions. 910 3.6. Miscellaneous 912 A minimum amount of singletons per metric is required if results are 913 to be compared. To avoid accidental singletons from impacting a 914 metric comparison, a minimum number of 5 singletons per compared 915 interval was proposed above. Commercial Internet service is not 916 operated to reliably create enough rare events of singletons to 917 characterize bad measurement engineering or bad implementations. In 918 the case that a metric validation requires capturing rare events, an 919 impairment generator may have to be added to the test set up. 920 Inclusion of an impairment generator and the parameterisation of the 921 impairments generated MUST be documented. 923 A metric characterising a common impairment condition would be one, 924 which by expectation creates a singleton result for each measured 925 packet. Delay or Delay Variation are examples of this type, and in 926 such cases, the Internet may be used to compare metric 927 implementations. 929 Rare events are those, where by expectation no or a rather low number 930 of "event is present" singletons are captured during a measurement 931 interval. Packet duplications, packet loss rates above one digit 932 percentages, loss patterns and packet reordering are examples. Note 933 especially that a packet reordering or loss pattern metric 934 implementation comparison may require a more sophisticated test set 935 up than described here. Spatial and temporal effects combine in the 936 case of packet re-ordering and measurements with different packet 937 rates may always lead to different results. 939 As specified above, 5 singletons are the recommended basis to 940 minimise interference of random events with the statistical test 941 proposed by this document. In the case of ratio measurements (like 942 packet loss), the underlying sum of basic events, against the which 943 the metric's monitored singletons are "rated", determines the 944 resolution of the test. A packet loss statistic with a resolution of 945 1% requires one packet loss statistic-datapoint to consist of 500 946 delay singletons (of which at least 5 were lost). To compare EDFs on 947 packet loss requires one hundred such statistics per flow. That 948 means, all in all at least 50 000 delay singletons are required per 949 single measurement flow. Live network packet loss is assumed to be 950 present during main traffic hours only. Let this interval be 5 951 hours. The required minimum rate of a single measurement flow in 952 that case is 2.8 packets/sec (assuming a loss of 1% during 5 hours). 953 If this measurement is too demanding under live network conditions, 954 an impairment generator should be used. 956 3.7. Proposal to determine an "equivalence" threshold for each metric 957 evaluated 959 This section describes a proposal for maximum error of "equivalence", 960 based on performance comparison of identical implementations. This 961 comparison may be useful for both ADK and non-ADK comparisons. 963 Each metric tested by two or more implementations (cross- 964 implementation testing). 966 Each metric is also tested twice simultaneously by the *same* 967 implementation, using different Src/Dst Address pairs and other 968 differences such that the connectivity differences of the cross- 969 implementation tests are also experienced and measured by the same 970 implementation. 972 Comparative results for the same implementation represent a bound on 973 cross-implementation equivalence. This should be particularly useful 974 when the metric does *not* produces a continuous distribution of 975 singleton values, such as with a loss metric, or a duplication 976 metric. Appendix A indicates how the ADK will work for 0ne-way 977 delay, and should be likewise applicable to distributions of delay 978 variation. 980 Proposal: the implementation with the largest difference in 981 homogeneous comparison results is the lower bound on the equivalence 982 threshold, noting that there may be other systematic errors to 983 account for when comparing between implementations. 985 Thus, when evaluationg equivalence in cross-implementation results: 987 Maximum_Error = Same_Implementation_Error + Systematic_Error 989 and only the systematic error need be decided beforehand. 991 In the case of ADK comparison, the largest same-implementation 992 resolution of distribution equivalence can be used as a limit on 993 cross-implementation resolutions (at the same confidence level). 995 4. Acknowledgements 997 Gerhard Hasslinger commented a first version of this document, 998 suggested statistical tests and the evaluation of time series 999 information. Henk Uijterwaal and Lars Eggert have encouraged and 1000 helped to orgainize this work. Mike Hamilton, Scott Bradner, David 1001 Mcdysan and Emile Stephan commented on this draft. Carol Davids 1002 reviewed the 01 version of the ID before it was promoted to WG draft. 1004 5. Contributors 1006 Scott Bradner, Vern Paxson and Allison Mankin drafted bradner- 1007 metrictest [bradner-metrictest], and major parts of it are included 1008 in this document. 1010 6. IANA Considerations 1012 This memo includes no request to IANA. 1014 7. Security Considerations 1016 This draft does not raise any specific security issues. 1018 8. References 1019 8.1. Normative References 1021 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 1022 October 1996. 1024 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 1025 3", BCP 9, RFC 2026, October 1996. 1027 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1028 Requirement Levels", BCP 14, RFC 2119, March 1997. 1030 [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, 1031 "Framework for IP Performance Metrics", RFC 2330, 1032 May 1998. 1034 [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, 1035 G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", 1036 RFC 2661, August 1999. 1038 [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 1039 Delay Metric for IPPM", RFC 2679, September 1999. 1041 [RFC2680] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 1042 Packet Loss Metric for IPPM", RFC 2680, September 1999. 1044 [RFC2681] Almes, G., Kalidindi, S., and M. Zekauskas, "A Round-trip 1045 Delay Metric for IPPM", RFC 2681, September 1999. 1047 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 1048 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 1049 March 2000. 1051 [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling 1052 Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. 1054 [RFC4448] Martini, L., Rosen, E., El-Aawar, N., and G. Heron, 1055 "Encapsulation Methods for Transport of Ethernet over MPLS 1056 Networks", RFC 4448, April 2006. 1058 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 1059 Network Tunneling", RFC 4459, April 2006. 1061 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1062 Zekauskas, "A One-way Active Measurement Protocol 1063 (OWAMP)", RFC 4656, September 2006. 1065 [RFC4719] Aggarwal, R., Townsley, M., and M. Dos Santos, "Transport 1066 of Ethernet Frames over Layer 2 Tunneling Protocol Version 1067 3 (L2TPv3)", RFC 4719, November 2006. 1069 [RFC4928] Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal 1070 Cost Multipath Treatment in MPLS Networks", BCP 128, 1071 RFC 4928, June 2007. 1073 [RFC5657] Dusseault, L. and R. Sparks, "Guidance on Interoperation 1074 and Implementation Reports for Advancement to Draft 1075 Standard", BCP 9, RFC 5657, September 2009. 1077 8.2. Informative References 1079 [ADK] Scholz, F. and M. Stephens, "K-sample Anderson-Darling 1080 Tests of fit, for continuous and discrete cases", 1081 University of Washington, Technical Report No. 81, 1082 May 1986. 1084 [GU+Duffield] 1085 Gu, Y., Duffield, N., Breslau, L., and S. Sen, "GRE 1086 Encapsulated Multicast Probing: A Scalable Technique for 1087 Measuring One-Way Loss", SIGMETRICS'07 San Diego, 1088 California, USA, June 2007. 1090 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1091 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1092 RFC 5357, October 2008. 1094 [Rule of thumb] 1095 Hardy, M., "Confidence interval", March 2010. 1097 [bradner-metrictest] 1098 Bradner, S., Mankin, A., and V. Paxson, "Advancement of 1099 metrics specifications on the IETF Standards Track", 1100 draft -bradner-metricstest-03, (work in progress), 1101 July 2007. 1103 [morton-advance-metrics] 1104 Morton, A., "Problems and Possible Solutions for Advancing 1105 Metrics on the Standards Track", draft -morton-ippm- 1106 advance-metrics-00, (work in progress), July 2009. 1108 [morton-advance-metrics-01] 1109 Morton, A., "Lab Test Results for Advancing Metrics on the 1110 Standards Track", draft -morton-ippm-advance-metrics-01, 1111 (work in progress), June 2010. 1113 Appendix A. An example on a One-way Delay metric validation 1115 The text of this appendix is not binding. It is an example how parts 1116 of a One-way Delay metric test could look like. 1117 http://xml.resource.org/public/rfc/bibxml/ 1119 A.1. Compliance to Metric specification requirements 1121 One-way Delay, Loss threshold, RFC 2679 1123 This test determines if implementations use the same configured 1124 maximum waiting time delay from one measurement to another under 1125 different delay conditions, and correctly declare packets arriving in 1126 excess of the waiting time threshold as lost. See Section 3.5 of 1127 RFC2679, 3rd bullet point and also Section 3.8.2 of RFC2679. 1129 (1) Configure a path with 1 sec one-way constant delay. 1131 (2) Measure one-way delay with 2 or more implementations, using 1132 identical waiting time thresholds for loss set at 2 seconds. 1134 (3) Configure the path with 3 sec one-way delay. 1136 (4) Repeat measurements. 1138 (5) Observe that the increase measured in step 4 caused all packets 1139 to be declared lost, and that all packets that arrive 1140 successfully in step 2 are assigned a valid one-way delay. 1142 One-way Delay, First-bit to Last bit, RFC 2679 1144 This test determines if implementations register the same relative 1145 increase in delay from one measurement to another under different 1146 delay conditions. This test tends to cancel the sources of error 1147 which may be present in an implementation. See Section 3.7.2 of 1148 RFC2679, and Section 10.2 of RFC2330. 1150 (1) Configure a path with X ms one-way constant delay, and ideally 1151 including a low-speed link. 1153 (2) Measure one-way delay with 2 or more implementations, using 1154 identical options and equal size small packets (e.g., 100 octet 1155 IP payload). 1157 (3) Maintain the same path with X ms one-way delay. 1159 (4) Measure one-way delay with 2 or more implementations, using 1160 identical options and equal size large packets (e.g., 1500 octet 1161 IP payload). 1163 (5) Observe that the increase measured in steps 2 and 4 is 1164 equivalent to the increase in ms expected due to the larger 1165 serialization time for each implementation. Most of the 1166 measurement errors in each system should cancel, if they are 1167 stationary. 1169 One-way Delay, RFC 2679 1171 This test determines if implementations register the same relative 1172 increase in delay from one measurement to another under different 1173 delay conditions. This test tends to cancel the sources of error 1174 which may be present in an implementation. This test is intended to 1175 evaluate measurments in sections 3 and 4 of RFC2679. 1177 (1) Configure a path with X ms one-way constant delay. 1179 (2) Measure one-way delay with 2 or more implementations, using 1180 identical options. 1182 (3) Configure the path with X+Y ms one-way delay. 1184 (4) Repeat measurements. 1186 (5) Observe that the increase measured in steps 2 and 4 is ~Y ms for 1187 each implementation. Most of the measurement errors in each 1188 system should cancel, if they are stationary. 1190 Error Calibration, RFC 2679 1192 This is a simple check to determine if an implementation reports the 1193 error calibration as required in Section 4.8 of RFC2679. Note that 1194 the context (Type-P) must also be reported. 1196 A.2. Examples related to statistical tests for One-way Delay 1198 A one way delay measurement may pass an ADK test with a timestamp 1199 resultion of 1 ms. The same test may fail, if timestamps with a 1200 resolution of 100 microseconds are eavluated. The implementation 1201 then is then conforming to the metric specification up to a timestamp 1202 resolution of 1 ms. 1204 Let's assume another one way delay measurement comparison between 1205 implementation 1, probing with a frequency of 2 probes per second and 1206 implementation 2 probing at a rate of 2 probes every 3 minutes. To 1207 ensure reasonable confidence in results, sample metrics are 1208 calculated from at least 5 singletons per compared time interval. 1209 This means, sample delay values are calculated for each system for 1210 identical 6 minute intervals for the whole test duration. Per 6 1211 minute interval, the sample metric is calculated from 720 singletons 1212 for implementation 1 and from 6 singletons for implementation 2. 1213 Note, that if outliers are not filtered, moving averages are an 1214 option for an evaluation too. The minimum move of an averaging 1215 interval is three minutes in this example. 1217 The data in table 1 may result from measuring One-Way Delay with 1218 implementation 1 (see column Implemnt_1) and implementation 2 (see 1219 column implemnt_2). Each data point in the table represents a 1220 (rounded) average of the sampled delay values per interval. The 1221 resolution of the clock is one micro-second. The difference in the 1222 delay values may result eg. from different probe packet sizes. 1224 +------------+------------+-----------------------------+ 1225 | Implemnt_1 | Implemnt_2 | Implemnt_2 - Delta_Averages | 1226 +------------+------------+-----------------------------+ 1227 | 5000 | 6549 | 4997 | 1228 | 5008 | 6555 | 5003 | 1229 | 5012 | 6564 | 5012 | 1230 | 5015 | 6565 | 5013 | 1231 | 5019 | 6568 | 5016 | 1232 | 5022 | 6570 | 5018 | 1233 | 5024 | 6573 | 5021 | 1234 | 5026 | 6575 | 5023 | 1235 | 5027 | 6577 | 5025 | 1236 | 5029 | 6580 | 5028 | 1237 | 5030 | 6585 | 5033 | 1238 | 5032 | 6586 | 5034 | 1239 | 5034 | 6587 | 5035 | 1240 | 5036 | 6588 | 5036 | 1241 | 5038 | 6589 | 5037 | 1242 | 5039 | 6591 | 5039 | 1243 | 5041 | 6592 | 5040 | 1244 | 5043 | 6599 | 5047 | 1245 | 5046 | 6606 | 5054 | 1246 | 5054 | 6612 | 5060 | 1247 +------------+------------+-----------------------------+ 1249 Table 1 1251 Average values of sample metrics captured during identical time 1252 intervals are compared. This excludes random differences caused by 1253 differing probing intervals or differing temporal distance of 1254 singletons resulting from their Poisson distributed sending times. 1256 In the example, 20 values have been picked (note that at least 100 1257 values are recommended for a single run of a real test). Data must 1258 be ordered by ascending rank. The data of Implemnt_1 and Implemnt_2 1259 as shown in the first two columns of table 1 clearly fails an ADK 1260 test with 95% confidence. 1262 The results of Implemnt_2 are now reduced by difference of the 1263 averages of column 2 (rounded to 6581 us) and column 1 (rounded to 1264 5029 us), which is 1552 us. The result may be found in column 3 of 1265 table 1. Comparing column 1 and column 3 of the table by an ADK test 1266 shows, that the data contained in these columns passes an ADK tests 1267 with 95% confidence. 1269 >>> Comment: Extensive averaging was used in this example, because of 1270 the vastly different sampling frequencies. As a result, the 1271 distributions compared do not exactly align with a metric in 1272 [RFC2679], but illustrate the ADK process adequately. 1274 Appendix B. Anderson-Darling 2 sample C++ code 1276 /* Routines for computing the Anderson-Darling 2 sample 1277 * test statistic. 1278 * 1279 * Implemented based on the description in 1280 * "Anderson-Darling K Sample Test" Heckert, Alan and 1281 * Filliben, James, editors, Dataplot Reference Manual, 1282 * Chapter 15 Auxiliary, NIST, 2004. 1283 * Official Reference by 2010 1284 * Heckert, N. A. (2001). Dataplot website at the 1285 * National Institute of Standards and Technology: 1286 * http://www.itl.nist.gov/div898/software/dataplot.html/ 1287 * June 2001. 1288 */ 1290 #include 1291 #include 1292 #include 1293 #include 1295 using namespace std; 1297 vector vec1, vec2; 1298 double adk_result; 1299 double adk_criterium = 1.993; 1301 /* vec1 and vec2 to be initialised with sample 1 and 1302 * sample 2 values in ascending order. 1303 */ 1305 /* example for iterating the vectors 1306 * for(vector::iterator it = vec1->begin(); 1307 * it != vec1->end(); it++ 1308 * { 1309 * cout << *it << endl; 1310 * } 1311 */ 1313 static int k, val_st_z_samp1, val_st_z_samp2, 1314 val_eq_z_samp1, val_eq_z_samp2, 1315 j, n_total, n_sample1, n_sample2, L, 1316 max_number_samples, line, maxnumber_z; 1317 static int column_1, column_2; 1318 static double adk, n_value, z, sum_adk_samp1, 1319 sum_adk_samp2, z_aux; 1320 static double H_j, F1j, hj, F2j, denom_1_aux, denom_2_aux; 1321 static bool next_z_sample2, equal_z_both_samples; 1322 static int stop_loop1, stop_loop2, stop_loop3,old_eq_line2, 1323 old_eq_line1; 1325 static double adk_criterium = 1.993; 1327 k = 2; 1328 n_sample1 = vec1->size() - 1; 1329 n_sample2 = vec2->size() - 1; 1331 // -1 because vec[0] is a dummy value 1333 n_total = n_sample1 + n_sample2; 1335 /* value equal to the line with a value = zj in sample 1. 1336 * Here j=1, so the line is 1. 1337 */ 1339 val_eq_z_samp1 = 1; 1341 /* value equal to the line with a value = zj in sample 2. 1342 * Here j=1, so the line is 1. 1343 */ 1345 val_eq_z_samp2 = 1; 1347 /* value equal to the last line with a value < zj 1348 * in sample 1. Here j=1, so the line is 0. 1349 */ 1351 val_st_z_samp1 = 0; 1353 /* value equal to the last line with a value < zj 1354 * in sample 1. Here j=1, so the line is 0. 1355 */ 1357 val_st_z_samp2 = 0; 1359 sum_adk_samp1 = 0; 1360 sum_adk_samp2 = 0; 1361 j = 1; 1363 // as mentioned above, j=1 1365 equal_z_both_samples = false; 1366 next_z_sample2 = false; 1368 //assuming the next z to be of sample 1 1370 stop_loop1 = n_sample1 + 1; 1372 // + 1 because vec[0] is a dummy, see n_sample1 declaration 1374 stop_loop2 = n_sample2 + 1; 1375 stop_loop3 = n_total + 1; 1377 /* The required z values are calculated until all values 1378 * of both samples have been taken into account. See the 1379 * lines above for the stoploop values. Construct required 1380 * to avoid a mathematical operation in the While condition 1381 */ 1383 while (((stop_loop1 > val_eq_z_samp1) 1384 || (stop_loop2 > val_eq_z_samp2)) && stop_loop3 > j) 1385 { 1386 if(val_eq_z_samp1 < n_sample1+1) 1387 { 1389 /* here, a preliminary zj value is set. 1390 * See below how to calculate the actual zj. 1391 */ 1393 z = (*vec1)[val_eq_z_samp1]; 1395 /* this while sequence calculates the number of values 1396 * equal to z. 1397 */ 1398 while ((val_eq_z_samp1+1 < n_sample1) 1399 && z == (*vec1)[val_eq_z_samp1+1] ) 1400 { 1401 val_eq_z_samp1++; 1402 } 1403 } 1404 else 1405 { 1406 val_eq_z_samp1 = 0; 1407 val_st_z_samp1 = n_sample1; 1409 // this should be val_eq_z_samp1 - 1 = n_sample1 1410 } 1412 if(val_eq_z_samp2 < n_sample2+1) 1413 { 1414 z_aux = (*vec2)[val_eq_z_samp2];; 1416 /* this while sequence calculates the number of values 1417 * equal to z_aux 1418 */ 1420 while ((val_eq_z_samp2+1 < n_sample2) 1421 && z_aux == (*vec2)[val_eq_z_samp2+1] ) 1422 { 1423 val_eq_z_samp2++; 1424 } 1426 /* the smaller of the two actual data values is picked 1427 * as the next zj. 1428 */ 1430 if(z > z_aux) 1431 { 1432 z = z_aux; 1433 next_z_sample2 = true; 1434 } 1435 else 1436 { 1437 if (z == z_aux) 1438 { 1439 equal_z_both_samples = true; 1440 } 1442 /* This is the case, if the last value of column1 is 1443 * smaller than the remaining values of column2. 1444 */ 1445 if (val_eq_z_samp1 == 0) 1446 { 1447 z = z_aux; 1448 next_z_sample2 = true; 1449 } 1450 } 1451 } 1452 else 1453 { 1454 val_eq_z_samp2 = 0; 1455 val_st_z_samp2 = n_sample2; 1457 // this should be val_eq_z_samp2 - 1 = n_sample2 1459 } 1461 /* in the following, sum j = 1 to L is calculated for 1462 * sample 1 and sample 2. 1463 */ 1465 if (equal_z_both_samples) 1466 { 1468 /* hj is the number of values in the combined sample 1469 * equal to zj 1470 */ 1471 hj = val_eq_z_samp1 - val_st_z_samp1 1472 + val_eq_z_samp2 - val_st_z_samp2; 1474 /* H_j is the number of values in the combined sample 1475 * smaller than zj plus one half the the number of 1476 * values in the combined sample equal to zj 1477 * (that's hj/2). 1478 */ 1480 H_j = val_st_z_samp1 + val_st_z_samp2 1481 + hj / 2; 1483 /* F1j is the number of values in the 1st sample 1484 * which are less than zj plus one half the number 1485 * of values in this sample which are equal to zj. 1486 */ 1488 F1j = val_st_z_samp1 + (double) 1489 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1491 /* F2j is the number of values in the 1st sample 1492 * which are less than zj plus one half the number 1493 * of values in this sample which are equal to zj. 1495 */ 1496 F2j = val_st_z_samp2 + (double) 1497 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1499 /* set the line of values equal to zj to the 1500 * actual line of the last value picked for zj. 1501 */ 1502 val_st_z_samp1 = val_eq_z_samp1; 1504 /* Set the line of values equal to zj to the actual 1505 * line of the last value picked for zjof each 1506 * sample. This is required as data smaller than zj 1507 * is accounted differently than values equal to zj. 1508 */ 1510 val_st_z_samp2 = val_eq_z_samp2; 1512 /* next the lines of the next values z, ie. zj+1 1513 * are addressed. 1514 */ 1516 val_eq_z_samp1++; 1518 /* next the lines of the next values z, ie. 1519 * zj+1 are addressed 1520 */ 1522 val_eq_z_samp2++; 1523 } 1524 else 1525 { 1527 /* the smaller z value was contained in sample 2, 1528 * hence this value is the zj to base the following 1529 * calculations on. 1530 */ 1531 if (next_z_sample2) 1532 { 1534 /* hj is the number of values in the combined 1535 * sample equal to zj, in this case these are 1536 * within sample 2 only. 1537 */ 1538 hj = val_eq_z_samp2 - val_st_z_samp2; 1540 /* H_j is the number of values in the combined sample 1541 * smaller than zj plus one half the the number of 1542 * values in the combined sample equal to zj 1543 * (that's hj/2). 1544 */ 1546 H_j = val_st_z_samp1 + val_st_z_samp2 1547 + hj / 2; 1549 /* F1j is the number of values in the 1st sample which 1550 * are less than zj plus one half the number of values in 1551 * this sample which are equal to zj. 1552 * As val_eq_z_samp2 < val_eq_z_samp1, these are the 1553 * val_st_z_samp1 only. 1554 */ 1555 F1j = val_st_z_samp1; 1557 /* F2j is the number of values in the 1st sample which 1558 * are less than zj plus one half the number of values in 1559 * this sample which are equal to zj. The latter are from 1560 * sample 2 only in this case. 1561 */ 1563 F2j = val_st_z_samp2 + (double) 1564 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1566 /* Set the line of values equal to zj to the actual line 1567 * of the last value picked for zj of sample 2 only in 1568 * this case. 1569 */ 1570 val_st_z_samp2 = val_eq_z_samp2; 1572 /* next the line of the next value z, ie. zj+1 is 1573 * addressed. Here, only sample 2 must be addressed. 1574 */ 1576 val_eq_z_samp2++; 1577 if (val_eq_z_samp1 == 0) 1578 { 1579 val_eq_z_samp1 = stop_loop1; 1580 } 1581 } 1583 /* the smaller z value was contained in sample 2, 1584 * hence this value is the zj to base the following 1585 * calculations on. 1586 */ 1588 else 1589 { 1591 /* hj is the number of values in the combined 1592 * sample equal to zj, in this case these are 1593 * within sample 1 only. 1594 */ 1595 hj = val_eq_z_samp1 - val_st_z_samp1; 1597 /* H_j is the number of values in the combined 1598 * sample smaller than zj plus one half the the number 1599 * of values in the combined sample equal to zj 1600 * (that's hj/2). 1601 */ 1603 H_j = val_st_z_samp1 + val_st_z_samp2 1604 + hj / 2; 1606 /* F1j is the number of values in the 1st sample which 1607 * are less than zj plus, in this case these are within 1608 * sample 1 only one half the number of values in this 1609 * sample which are equal to zj. The latter are from 1610 * sample 1 only in this case. 1611 */ 1613 F1j = val_st_z_samp1 + (double) 1614 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1616 /* F2j is the number of values in the 1st sample which 1617 * are less than zj plus one half the number of values 1618 * in this sample which are equal to zj. As 1619 * val_eq_z_samp1 < val_eq_z_samp2, these are the 1620 * val_st_z_samp2 only. 1621 */ 1623 F2j = val_st_z_samp2; 1625 /* Set the line of values equal to zj to the actual line 1626 * of the last value picked for zj of sample 1 only in 1627 * this case 1628 */ 1630 val_st_z_samp1 = val_eq_z_samp1; 1632 /* next the line of the next value z, ie. zj+1 is 1633 * addressed. Here, only sample 1 must be addressed. 1634 */ 1635 val_eq_z_samp1++; 1637 if (val_eq_z_samp2 == 0) 1638 { 1639 val_eq_z_samp2 = stop_loop2; 1640 } 1641 } 1642 } 1644 denom_1_aux = n_total * F1j - n_sample1 * H_j; 1645 denom_2_aux = n_total * F2j - n_sample2 * H_j; 1647 sum_adk_samp1 = sum_adk_samp1 + hj 1648 * (denom_1_aux * denom_1_aux) / 1649 (H_j * (n_total - H_j) 1650 - n_total * hj / 4); 1651 sum_adk_samp2 = sum_adk_samp2 + hj 1652 * (denom_2_aux * denom_2_aux) / 1653 (H_j * (n_total - H_j) 1654 - n_total * hj / 4); 1656 next_z_sample2 = false; 1657 equal_z_both_samples = false; 1659 /* index to count the z. It is only required to prevent 1660 * the while slope to execute endless 1661 */ 1662 j++; 1663 } 1665 // calculating the adk value is the final step. 1667 adk_result = (double) (n_total - 1) / (n_total 1668 * n_total * (k - 1)) 1669 * (sum_adk_samp1 / n_sample1 1670 + sum_adk_samp2 / n_sample2); 1672 /* if(adk_result <= adk_criterium) 1673 * adk_2_sample test is passed 1674 */ 1676 Figure 4 1678 Appendix C. A tunneling set up for remote metric implementation testing 1680 Parties interested in testing metric compliance is most convenient if 1681 all involved parties can stay in their local test laboratories. 1682 Figure 4 shows a test configuration which may enable remote metric 1683 compliance testing. 1685 +----+ +----+ +----+ +----+ 1686 |LC10| |LC11| ,---. |LC20| |LC21| 1687 +----+ +----+ / \ +-------+ +----+ +----+ 1688 | V10 | V11 / \ | Tunnel| | V20 | V21 1689 | | ( ) | Head | | | 1690 +--------+ +------+ | | | Router|__+----------+ 1691 |Ethernet| |Tunnel| |Internet | +---B---+ |Ethernet | 1692 |Switch |--|Head |-| | | |Switch | 1693 +-+--+---+ |Router| | | +---+---+ +--+--+----+ 1694 |__| +--A---+ ( )--|Option.| |__| 1695 \ / |Impair.| 1696 Bridge \ / |Gener. | Bridge 1697 V20 to V21 `-+-? +-------+ V10 to V11 1699 Figure 5 1701 LC10 identify measurement clients /line cards. V10 and the others 1702 denote VLANs. All VLANs are using the same tunnel from A to B and in 1703 the reverse direction. The remote site VLANs are U-bridged at the 1704 local site Ethernet switch. The measurement packets of site 1 travel 1705 tunnel A->B first, are U-bridged at site 2 and travel tunnel B->A 1706 second. Measurement packets of site 2 travel tunnel B->A first, are 1707 U-bridged at site 1 and travel tunnel A->B second. So all 1708 measurement packets pass the same tunnel segments, but in different 1709 segment order. An experiment to prove or reject the above test set 1710 up shown in figure 4 has been agreed but not yet scheduled between 1711 Deutsche Telekom and RIPE. 1713 Figure 4 includes an optional impairment generator. If this 1714 impairment generator is inserted in the IP path between the tunnel 1715 head end routers, it equally impacts all measurement packets and 1716 flows. Thus trouble with ensuring identical test set up by 1717 configuring two separated impairment generators identically is 1718 avoided (which was another proposal allowing remote metric compliance 1719 testing). 1721 Appendix D. Glossary 1723 +-------------+-----------------------------------------------------+ 1724 | ADK | Anderson-Darling K-Sample test, a test used to | 1725 | | check whether two samples have the same statistical | 1726 | | distribution. | 1727 | ECMP | Equal Cost Multipath, a load balancing mechanism | 1728 | | evaluating MPLS labels stacks, IP addresses and | 1729 | | ports. | 1730 | EDF | The "Empirical Distribution Function" of a set of | 1731 | | scalar measurements is a function F(x) which for | 1732 | | any x gives the fractional proportion of the total | 1733 | | measurements that were smaller than or equal as x. | 1734 | Metric | A measured quantity related to the performance and | 1735 | | reliability of the Internet, expressed by a value. | 1736 | | This could be a singleton (single value), a sample | 1737 | | of single values or a statistic based on a sample | 1738 | | of singletons. | 1739 | OWAMP | One-way Active Measurement Protocol, a protocol for | 1740 | | communication between IPPM measurement systems | 1741 | | specified by IPPM. | 1742 | OWD | One-Way Delay, a performance metric specified by | 1743 | | IPPM. | 1744 | Sample | A sample metric is derived from a given singleton | 1745 | metric | metric by evaluating a number of distinct instances | 1746 | | together. | 1747 | Singleton | A singleton metric is, in a sense, one atomic | 1748 | metric | measurement of this metric. | 1749 | Statistical | A 'statistical' metric is derived from a given | 1750 | metric | sample metric by computing some statistic of the | 1751 | | values defined by the singleton metric on the | 1752 | | sample. | 1753 | TWAMP | Two-way Active Measurement Protocol, a protocol for | 1754 | | communication between IPPM measurement systems | 1755 | | specified by IPPM. | 1756 +-------------+-----------------------------------------------------+ 1758 Table 2 1760 Authors' Addresses 1762 Ruediger Geib (editor) 1763 Deutsche Telekom 1764 Heinrich Hertz Str. 3-7 1765 Darmstadt, 64295 1766 Germany 1768 Phone: +49 6151 628 2747 1769 Email: Ruediger.Geib@telekom.de 1771 Al Morton 1772 AT&T Labs 1773 200 Laurel Avenue South 1774 Middletown, NJ 07748 1775 USA 1777 Phone: +1 732 420 1571 1778 Fax: +1 732 368 1192 1779 Email: acmorton@att.com 1780 URI: http://home.comcast.net/~acmacm/ 1782 Reza Fardid 1783 Cariden Technologies 1784 888 Villa Street, Suite 500 1785 Mountain View, CA 94041 1786 USA 1788 Phone: 1789 Email: rfardid@cariden.com 1791 Alexander Steinmitz 1792 HS Fulda 1793 Marquardstr. 35 1794 Fulda, 36039 1795 Germany 1797 Phone: 1798 Email: steinionline@gmx.de