idnits 2.17.00 (12 Aug 2021) /tmp/idnits54895/draft-geib-ippm-metrictest-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 0 form feeds but 16 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 6, 2009) is 4701 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 2679 (Obsoleted by RFC 7679) Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force R. Geib, Ed. 2 Internet-Draft Deutsche Telekom 3 Intended status: Informational R. Fardid 4 Expires: January 7, 2010 Covad Communications 5 July 6, 2009 7 IPPM standard compliance testing 8 draft-geib-ippm-metrictest-00 10 Status of this Memo 12 This Internet-Draft is submitted to IETF in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt. 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 This Internet-Draft will expire on January 7, 2010. 33 Copyright Notice 35 Copyright (c) 2009 IETF Trust and the persons identified as the 36 document authors. All rights reserved. 38 This document is subject to BCP 78 and the IETF Trust's Legal 39 Provisions Relating to IETF Documents in effect on the date of 40 publication of this document (http://trustee.ietf.org/license-info). 41 Please review these documents carefully, as they describe your rights 42 and restrictions with respect to this document. 44 Abstract 46 This document specifies tests to determine if multiple, independent, 47 and interoperable implementations of a metrics specification document 48 are at hand so that the metrics specification can be advanced to an 49 Internet standard. Results of different IPPM implementations can be 50 compared if they measure under the same underlying network 51 conditions. Results are compared using state of the art statistical 52 methods. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 58 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 4 59 3. Verification of equivalence by statistic measurements . . . . 5 60 4. Recommended Metric Verification Measurement Process . . . . . 12 61 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14 62 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 14 63 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 64 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15 65 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 66 9.1. Normative References . . . . . . . . . . . . . . . . . . . 15 67 9.2. Informative References . . . . . . . . . . . . . . . . . . 15 68 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16 70 1. Introduction 72 Draft bradner-metrictest [bradner-metrictest] states: 74 The Internet Standards Process RFC2026 [RFC2026] requires that for a 75 IETF specification to advance beyond the Proposed Standard level, at 76 least two genetically unrelated implementations must be shown to 77 interoperate correctly with all features and options. There are two 78 distinct reasons for this requirement. 80 In the case of a protocol specification, the notion of 81 "interoperability" is reasonably intuitive - the implementations must 82 successfully "talk to each other", while exercising all features and 83 options. 85 In the case of a specification for a performance metric, network 86 latency for example, exactly what constitutes "interoperation" is 87 less obvious. The IESG didn't yet decide how to judge "metric 88 specification interoperability" in the context of the IETF Standards 89 Process and this new draft suggests a methodology which (hopefully) 90 is suitable for IPPM metrics. General applicability of the methods 91 proposed in the following should however not be excluded. 93 A metric specification describes a method of testing and a way to 94 report the results of this testing. One example of such a metric 95 would be a way to test and report the latency that data packets would 96 incur while being sent from one network location to another. 98 Since implementations of testing metrics are by their nature stand- 99 alone and do not interact with each other, the level of the 100 interoperability called for in the IETF standards process cannot be 101 simply determined by seeing that the implementations interact 102 properly. Instead, verifying equivalence by proofing that different 103 implementations verifiably give statistically equivalent results 104 Verifiable equivalence may take the place of interoperability. 106 This document defines the process of verifying equivalence by using a 107 specified test set up to create the required separate data sets 108 (which may be seen as samples taken from the same underlying 109 distribution) and then apply state of the art statistical methods to 110 verify equivalence of the results. To illustrate application of the 111 process defined her, validating compliance with RFC2679 [RFC2679] is 112 picked as an example. While test set ups may vary with the metrics 113 to be validated, the statistical methods will not. Documents 114 defining test setups to validate other metrics should be created by 115 the IPPM WG, once the process proposed here has been agreed upon. 117 1.1. Requirements Language 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 121 document are to be interpreted as described in RFC 2119 [RFC2119]. 123 2. Basic idea 125 Two different IPPM implementations are expected to measure 126 statistically equivalent results, if they both measure a metric under 127 the same networking conditions. Formulating the measurement in 128 statistical terms: separate samples are collected (by separate metric 129 implementations) from the same underlying statistical process (the 130 same network conditions). The "statistical hypothesis" to be tested 131 is the expectation, that both samples expose statistically equivalent 132 properties. This requires careful test design: 134 o The error induced by the sample size must be small enough to 135 minimize its influence on the test result. This may have to be 136 respected, especially if two implementations measure with 137 different average probing rates. 139 o If time series are compared, the implementation with the lowest 140 probing frequency determines the smallest temporal interval for 141 which results can be compared. 143 o Every comparison must be repeated several times based on different 144 measurement data to avoid random indications of compatibility (or 145 the lack of it). 147 o The measurement test set up must be self-consistent to the largest 148 possible extent. This means, network conditions, paths and IPPM 149 metric implementations SHOULD be identical for the compared 150 implementations to the largest possible degree to minimize the 151 influence of the test and measurement set up on the result. This 152 includes e.g. aspects of the stability and non-ambiguity of routes 153 taken by the measurement packets. See RFC 2330 for a discussion 154 on self-consistency RFC 2330 [RFC2330]. 156 State of the art statistical methods are proposed for a comparison of 157 measurement results in the hope that user friendly tools required to 158 perform the necessary statistical analysis are easily accessible. 159 [editor: this sentence may be reworded or deleted, if the expectation 160 doesn't hold]. 162 Let's assume a one way delay measurement comparison between system A, 163 probing with a frequency of 2 probes per second and system B probing 164 at a rate of 2 probes every 3 minutes. To ensure reasonable 165 confidence in results, sample metrics are calculated from at least 5 166 singletons per compared time interval. This means, sample delay 167 values are calculated for each system for identical 6 minute 168 intervals for the whole test duration. Per 6 minute interval, the 169 sample metric is calculated from 720 singletons for system A and from 170 6 singletons for system B). Note, that if outliers are not filtered, 171 moving averages are an option for an evaluation too. The minimum 172 move of an averaging interval is three minutes in our example. 174 The test set up for the delay measurement is chosen to minimize 175 errors by locating one system of each implementation at the same end 176 of two separate sites, between which delay is measured for the metric 177 test. Both measurement sites are connected by one IPSEC tunnel, so 178 that all measurement packets cross the Internet with the same IP 179 addresses. Both measurement systems measure simultaneously and the 180 local links are dimensioned to avoid congestion caused by the probing 181 traffic itself. 183 The measured delay values are reported with a resolution above the 184 measurement error and above the synchronisation error. This is done 185 to avoid comparing these errors between two different metric 186 implementations instead of comparing the IPPM metric implementation 187 itself. 189 The overall duration of the test is chosen so that more than 1000 six 190 minute measurement intervals are collected. The amount of data 191 collected allows separate comparisons for e.g. 200 consecutive 6 192 minute intervals. intervals, during which routes were instable, are 193 discarded prior to evaluation. 195 3. Verification of equivalence by statistic measurements 197 Following the definition of statistical precision [Precision], a 198 measurement process can be characterised by two properties: 200 o Accuracy, which is the degree of conformity of a measured quantity 201 to its actual (true) value. 203 o Precision, also called reproducibility or repeatability, the 204 degree to which repeated measurements show the same or similar 205 results. 207 Figure 1 further clarifies the difference between accuracy and 208 precision of a measurement. 210 Probability ^ 211 Density | 212 | Reference value Measured Value 213 | | | 214 | |<---Accuracy---->| 215 | | _|_ 216 | | / | \ 217 | | / | \ 218 | | / | \ 219 | | / | \ 220 | | / | \ 221 | | / | \ 222 Measured | | /<- Precision ->\ 223 Value -|---------|-----------------|----------> 224 | 226 Measurement accuracy and precision [Precision]. 228 Figure 1 230 The Framework for IP Performance Metrics (RFC 2330, [RFC2330]) 231 expects that a "methodology for a metric should have the property 232 that it is repeatable: if the methodology is used multiple times 233 under identical conditions, it should result in consistent 234 measurements." This means, an IPPM implementation is expected to 235 measure a metric with high precision. 237 Further, RFC2330 expects that a "a methodology for a given metric 238 exhibits continuity if, for small variations in conditions, it 239 results in small variations in the resulting measurements. Slightly 240 more precisely, for every positive epsilon, there exists a positive 241 delta, such that if two sets of conditions are within delta of each 242 other, then the resulting measurements will be within epsilon of each 243 other." A small variation in conditions in the context of a metric 244 comparison can be seen as two implementations measuring the same 245 metric along the same path. 247 Two guidelines for an IPPM conformant metric implementation can be 248 taken from these principles: 250 o A single IPPM conformant implementation MUST under otherwise 251 identical network conditions produce highly precise results for 252 repeated measurements of the same metric. 254 o Two different implementations measuring the same IPPM metric MUST 255 produce results with a rather limited difference if measuring 256 under to the largest extent possible identical network conditions. 258 In a metric test, both conditions must hold, meaning that repeated 259 tests of two implementations MUST produce precise results for all 260 repetition intervals. 262 A suitable statistical test and and a level of confidence to define 263 whether differences are rather limited and whether a measurement is 264 highly precise are specified below. 266 RFC 2330 prefers the "empirical distribution function" EDF to 267 describe collections of measurements. RFC 2330 uses the EDF to test 268 goodness of fit of an IPPM flow's inter packet spacing to a Poisson 269 process. To do that, RFC 2330 uses the Anderson-Darling test with a 270 5% significance. RFC 2330 further determines, that "unless otherwise 271 stated, IPPM goodness-of-fit tests are done using 5% significance." 273 The principles suggested by RFC 2330 are applied to compare the 274 implementation of IPPM metrics as follows: 276 o The empirical distribution function of the singletons or samples 277 resulting from the measurement of a particular metric is forming 278 the basis of a comparison of two IPPM implementations. Note that 279 a parametric description of this distribution is not required. 281 o The hypothesis to be validated by an IPPM metric test is that two 282 implementations of an IPPM metric draw probes from the same 283 underlying distribution. The hypothesis is true, if samples of 284 two tested metric implementations follow the same distribution by 285 a significance of 95%. Note that the distribution function from 286 which the probes are drawn itself is irrelevant. 288 o The samples taken by two implementations to be tested are compared 289 by an Anderson-Darling k sample test. The Anderson-Darling k 290 sample test is the generalization of the classical Anderson- 291 Darling goodness of fit test, and it is used to test the 292 hypothesis that k independent samples belong to the same 293 population without specifying their common distribution function. 294 [Editor: I couldn't find a complete documentation of that test on 295 the web by a fast search, but a reference to a publication is 296 there and code seems to be available too. Other tests which are 297 documented in Wikipedia for that purpose are Kolmogorov-Smirnov 298 and Chi-Square. it is proposed to make Anderson Darling k sample 299 obligatory/a MUST if code can be appended to this draft. If not, 300 Anderson Darling k sample is recommended and Kolmogorov-Smirnov or 301 Chi Square are optional]. 303 Getting back to the chosen example delay measurement, the captured 304 delays may have been captured singletons ranging from an absolute 305 minimum Delay Dmin to values Dmin + 5 ms. To compare distributions, 306 the set of singletons of a chosen evaluation interval (e.g. the data 307 of one of the five 1800 minute capture sequences, see above) is 308 sorted for the frequency of singletons per Dmin + N * 0.5 ms (n = 1, 309 2, ...). After that, a comparison of the two probe sets with any of 310 the mentioned tests may be applied. 312 While constructing the example, some additional rules to calculate 313 and compare samples have been respected. The following two rules are 314 of importance for the IPPM metric tests: 316 o To compare different probes of a common underlying distribution in 317 terms of metrics characterising a communication network requires 318 to respect the temporal nature for which the assumption of common 319 underlying distribution may hold. Any singletons or samples to be 320 compared MUST be captured within the same time interval. 322 o Whenever sample metrics, samples of singletons or rates are used 323 to characterise measured metrics of a time-interval, at least 5 324 events of a relevant metric MUST be present to ensure a minimum 325 confidence into the reported value (see Wikipedia on confidence 326 [Rule of thumb]). Note that this criterion is to be respected 327 e.g. when comparing packet loss metrics. Any packet loss 328 measurement interval to be compared with the results of another 329 implementation needs to contain at least five lost packets to have 330 a minimum confidence that these losses didn't happen randomly. 332 o The minimum number of singletons or samples to be compared by an 333 Anderson-Darling test is 100 per tested metric implementation. 334 Note that the Anderson-Darling test detects small differences in 335 distributions fairly well and will fail for high number of 336 compared results (RFC2330 mentions an example with 8192 337 measurements to guarantee a failure of an Anderson-Darling test). 339 Comparing "Accuracy" of IPPM implementations based on averages and 340 variations may require prior checks for the absence of long range 341 dependency within the compared measurements. Large outliers as 342 typically occurring in the case of long range dependency, can have a 343 serious impact on mean values. The median or percentiles may be more 344 robust measures on which to compare the accuracy of different IPPM 345 implementations. An idea may be to consider data up to a certain 346 percentile, calculate the mean for data up to this percentile and 347 then compare the means of the two implementations. This could be 348 repeated for different percentiles. If long range dependencies 349 impact is limited to large outliers, the method may work for lower 350 percentiles. Whether this makes sense must be confirmed by a 351 statistician, so this attempt requires further study. 353 IPPM metrics are captured by time series. Time series can be checked 354 for correlation. There are two expectations on statistical time 355 series properties which should be met by separate measurements 356 probing the same underlying network performance distribution: 358 o The Autocorrelation indicates, whether there are any repeating 359 patterns within a time series. For the purpose of this document, 360 it does not matter whether there is autocorrelation in a 361 measurement. It is however expected, that two measurements expose 362 the same autocorrelation on identical "lag" intervals. If 363 calculable, the autocorrelation lies within an interval [-1;1], 364 (see Wikipedia on autocorrelation [Autocorrelation]). 366 o The correlation coefficient "indicates the strength of a linear 367 relationship between two random variables." The two random 368 variables in the case of this document are the measurement time 369 series of the IPPM implementations to be compared. The 370 expectation is, that both are strongly correlated and the 371 resulting correlation coefficient is close to 1, (see Wikipedia on 372 correlation [Correlation]). 374 A metric test can derive additional statistics from time series 375 analysis. Further, formulation of a test hypothesis is possible for 376 autocorrelation and the correlation coefficient. It is however not 377 clear, whether an appropriate statistical test to validate the 378 hypothesis by 95% significance exists. Applicability of time series 379 analysis for a metric test requires further input from statisticians. 381 In the absence of any metric test on time series, any test result 382 SHOULD provide the autocorrelation of the compared metrics time 383 series by lags from 1 to 10. In addition, the value of the 384 correlation coefficient SHOULD be provided. Autocorrelation and 385 Correlation coefficient are expected to be rather close to the value 386 1. 388 As mentioned earlier, the time series analysis requires application 389 of identical time intervals to allow a comparison. In our delay 390 example, single sample delay metric values are calculated for 9 391 minute intervals. If 200 consecutive sample delay metrics with the 392 same start and end interval are available for each implementation, 393 autocorrelation can be calculated for different n * 9 minute lags. 394 The autocorrelation calculated for the time series of each 395 implementation should be very close to the autocorrelation of the 396 other implementation for the same time lag. Further, the correlation 397 coefficient for both time series should be close to 1. 399 The way to prove that two IPPM metric measurements provide compatible 400 results then could be performed stepwise: 402 o First prove that the two compared implementations have the same 403 precision by comparing statistics of the distribution of 404 singletons (or samples) of a metric by comparing the EDF of the 405 samples captured by the two implementations. 407 o Second indicate that two compared implementations produce strongly 408 correlated time series of which each one individually has the same 409 autocorrelation as the other one. 411 Clock synchronization effects require special attention. Accuracy of 412 one-way active delay measurements for any metrics implementation 413 depends on clock synchronization between the source and destination 414 of tests. Ideally, one-way active delay measurement (RFC 2679, 415 [RFC2679]) test endpoints either have direct access to independent 416 GPS or CDMA-based time sources or indirect access to nearby NTP 417 primary (stratum 1) time sources, equipped with GPS receivers. 418 Access to these time sources may not be available at all test 419 locations associated with different Internet paths, for a variety of 420 reasons out of scope of this document. 422 When secondary (stratum 2 and above) time sources are used with NTP 423 running acrossthe same network, whose metrics are subject to 424 comparative implementation tests, network impairments can affect 425 clock synchronization, distort sample one-way values and their 426 interval statistics. It is RECOMMENDED to discard sample one-way 427 delay values for any implementation, when one of the following 428 reliability conditions is met: 430 o Delay is measured and is finite in one direction, but not the 431 other. 433 o Absolute value of the difference between the sum of one-way 434 measurements in both directions and round-trip measurement is 435 greater than X% of the latter value. 437 Examination of the second condition requires RTT measurement for 438 reference, e.g., based on TWAMP (RFC5357, RFC 5357 [RFC5357]), in 439 conjunction with one-way delay measurement. 441 Specification of X% to strike a balance between identification of 442 unreliable one-way delay samples and misidentification of reliable 443 samples under a wide range of Internet path RTTs probably requires 444 further study. 446 An IPPM compliant metric implementation whose measurement requires 447 synchonized clocks is however expected to provide precise measurement 448 results. Any IPPM metric implementation MUST be of a precision of 1 449 ms (+/- 500 us) with a confidence of 95% if the metric is captured 450 along an Internet path which is stable and not congested during a 451 measurement duration of an hour or more. [Editor: this latter 452 definition may avoid NTP (stratum 2 or worse) synchonized IPPM 453 implementations from becoming IPPM compliant. However internal PC 454 clock synched implementations can't be rejected that way. Ideas on 455 criteria to deal with the latter are welcome. May drift be one, as 456 GPS synched implementations shouldn't have one or the same on origin 457 and destination, respectively]. 459 Metric tests should be executed under conditions which are identical 460 to the largest possible or necessary extent. As "identical network 461 conditions" are fundamental to the nethodology proposed by this 462 document, more input and a thorough discussion is needed to define 463 these. Some thoughts are: 465 o In a laboratory environment, NTP synchronisation may have a less 466 serious impact. In a real network, improper synchronisation will 467 be harder to conceal. 469 o OWD measurements are of highest precision with well synchonized 470 measurement systems measuring delays along a stable not congested 471 path. Care must be taken to avoid comparing noise and the 472 measurement error respectively instead of the delay. 474 o Packet loss, delay variation and packet reordering require a 475 sufficient number of these events to allow for a metric test with 476 the desired confidence. While one could wait for congestion or 477 execute the test across known bottlenecks, this may incur some 478 effort. A question is, whether to test these metrics under 479 laboratory conditions. To generalise this question: can 480 laboratory metric tests be tolerated for metrics whose precision 481 doesn't depend on synchonized clocks? 483 o Packet loss and delay variation probably allow for a relaxed 484 definition of "identical test conditions", as it may be sufficient 485 for test packets to share the congested interface or paths to test 486 for these metrics. 488 o In a laboratory environment, "stationary" networking conditions 489 can be produced without having to care about parallel resources, 490 applied by carriers to increase capacity. In a commercial 491 network, hashing functions (on addresses and ports) determine 492 which set of resources all the packets in a flow will traverse. 493 Testing in the lab may not remove the parallel resources, but it 494 can provide some time stability that's never assured in live 495 network testing. 497 o Applicability of tunnels to avoid the impact of unknown parallel 498 resources applied by networks traversed by measuremenmts packets 499 during a test should be investigated. 501 o To determine if some aspects of the metric specifications are 502 clear and unambiguous, some specific conditions in the lab may be 503 simulated to determine if implementations measure them as 504 expected. This it should be tested whether all implementors read 505 the spec the same way. Further, reducing some sources of 506 variation right at the start, will make the job of statistical 507 comparison simpler. 509 o Getting access to operator information like load and packet loss 510 counters of a network which was used during a metric test is 511 improbable. But testing across a real network still is desirable 512 for a metric test. 514 4. Recommended Metric Verification Measurement Process 516 The proposal made by the authors of bradner-metrictest 517 [bradner-metrictest] is picked up and slightly enhanced: 519 "In order to meet their obligations under the IETF Standards Process 520 the IESG must be convinced that each metric specification advanced to 521 Draft Standard or Internet Standard status is clearly written, that 522 there are the required multiple verifiably equivalent 523 implementations, and that all options have been implemented. 525 "In the context of this memo, metrics are designed to measure some 526 characteristic of a data network. An aim of any metric definition 527 should be that it should be specified in a way that can reliably 528 measure the specific characteristic in a repeatable way." 530 Each metric, statistic or option of those to be validated must be 531 compared against a reference measurement or another implementation by 532 at least 5 different basic data sets, each on with sufficient size to 533 reach the specified level of confidence. 535 "In the same way, sequentially running different implementations of 536 software that perform the tests described in the metric document on a 537 stable network, or simultaneously on a network that may or may not be 538 stable should produce essentially the same results." 540 Following these assumptions any recommendation for the advancement of 541 a metric specification needs to be accompanied by an implementation 542 report, as is the case with all requests for the advancement of IETF 543 specifications. The implementation report needs to include a 544 specific plan to test the specific metrics in the RFC in lab or real- 545 world networks and reports of the tests performed with two or more 546 implementations of the software. The test plan should cover key 547 parts of the specification, specify the accuracy required for each 548 measured metric and thus define the meaning of "statistically 549 equivalent" for the specific metrics being tested. Ideally, the test 550 plan would co-evolve with the development of the metric, since that's 551 when people have the most context in their thinking regarding the 552 different subtleties that can arise. 554 In particular, the implementation report MUST as a minimum document: 556 o The metric compared and the RFC specifying it, including the 557 chosen options (like e.g. the implemented selection function in 558 the case of IPDV). 560 o A complete specification of the measurement stream (mean rate, 561 statistical distribution of packets, packet size (or mean packet 562 size and their distribution), DSCP and any other measurement 563 stream property which could result in deviating results. 564 Deviations in results can be caused also if chosen IP addresses 565 and ports of different implementations can result in different 566 layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path 567 routing in an operational network 569 o The duration of each measurement to be used for a metric 570 validation, the number of measurement points collected for each 571 metric during each measurement interval (i.e. the probe size) and 572 the level of confidence derived from this probe size for each 573 measurement interval 575 o The result of the statistical tests performed for each metric 576 validation. 578 o The measurement configuration and set up 580 o A parameterization of laboratory conditions and applied traffic 581 and network conditions allowing reproduction of these laboratory 582 conditions for readers of the implementation report. 584 "All of the tests for each set MUST be run in the same direction 585 between the same two points on the same network. The tests SHOULD be 586 run simultaneously unless the network is stable enough to ensure that 587 the path the data takes through the network will not change between 588 tests." 590 It is RECOMMENDED to avoid effects falsifying results of real data 591 networks, if validation measurements are taken over them. Obviously, 592 the conditions met there can't be reproduced. As the measurement 593 equipment compared is designed to reliable quantify real network 594 performance, validating metrics under real network conditions is 595 desirable of course. 597 Data networks may forward packets differently in the case of: 599 o Different packet sizes chosen for different metric 600 implementations. A proposed countermeasure is selecting the same 601 packet size when validating results of two samples or a sample 602 against an original distribution. 604 o Selection of differing IP addresses and ports used by different 605 metric implementations during metric validation tests. If ECMP is 606 applied on IP or MPLS level, different paths can result (note that 607 it may be impossible to detect an MPLS ECMP path from an IP 608 endpoint). A proposed counter measure is to connect the 609 measurement equipment to be compared by a NAT device, or 610 establishing a single tunnel to transport all measurement traffic 611 The aim is to have the same IP addresses and port for all 612 measurement packets or to avoid ECMP by a layer 2 tunnel. 614 o Different IP options. 616 o Different DSCP. 618 The test design may have to be adapted for the purpose of the 619 measurement. Creation of delay and delay variation probes is simple 620 and straightforward, also if the measurement runs acrossa real data 621 network. Collecting a large number of packet loss samples on a real 622 data network while being sure that operational conditions are stable 623 may not be feasible. Further discussion on test designs to verify 624 specific metrics may indeed be required. 626 5. Acknowledgements 628 Gerhard Hasslinger commented a first version of this document, 629 suggested statistical tests and the evaluation of time series 630 information. Henk Uijterwaal pushed this work and Mike Hamilton 631 reviewed the document before publication. 633 6. Contributors 635 Scott Bradner, Vern Paxson and Allison Manking drafted bradner- 636 metrictest [bradner-metrictest], and major parts of it are quoted in 637 this document. Al Morton and Scott Bradner commented this draft 638 before publication. 640 7. IANA Considerations 642 This memo includes no request to IANA. 644 8. Security Considerations 646 This draft does not raise any specific security issues. 648 9. References 650 9.1. Normative References 652 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 653 3", BCP 9, RFC 2026, October 1996. 655 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 656 Requirement Levels", BCP 14, RFC 2119, March 1997. 658 [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, 659 "Framework for IP Performance Metrics", RFC 2330, 660 May 1998. 662 [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 663 Delay Metric for IPPM", RFC 2679, September 1999. 665 9.2. Informative References 667 [Autocorrelation] 668 N., N., "Autocorrelation", December 2008. 670 [Correlation] 671 N., N., "Correlation", June 2009. 673 [Precision] 674 N., N., "Accuracy and precision", June 2009. 676 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 677 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 678 RFC 5357, October 2008. 680 [Rule of thumb] 681 N., N., "Confidence interval", October 2008. 683 [bradner-metrictest] 684 Bradner, S., Mankin, A., and V. Paxson, "Advancement of 685 metrics specifications on the IETF Standards Track", 686 draft -bradner-metricstest-03, (work in progress), 687 July 2007. 689 Authors' Addresses 691 Ruediger Geib (editor) 692 Deutsche Telekom 693 Heinrich Hertz Str. 3-7 694 Darmstadt, 64295 695 Germany 697 Phone: +49 6151 628 2747 698 Email: Ruediger.Geib@telekom.de 700 Reza Fardid 701 Covad Communications 702 2510 Zanker Road 703 San Jose, CA 95131 704 USA 706 Phone: +1 408 434-2042 707 Email: RFardid@covad.com