idnits 2.17.00 (12 Aug 2021) /tmp/idnits7840/draft-niccolini-hash-descr-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 3 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 9) being 60 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 16 instances of too long lines in the document, the longest one being 2 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 294 has weird spacing: '...ng long stm...' == Line 295 has weird spacing: '...ng long utmp;...' == Line 297 has weird spacing: '...ng long sum =...' == Line 374 has weird spacing: '...typedef unsig...' == Line 375 has weird spacing: '...typedef unsig...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 2003) is 6792 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '40' on line 267 -- Looks like a reference, but probably isn't: '0' on line 486 -- Looks like a reference, but probably isn't: '1' on line 464 -- Looks like a reference, but probably isn't: '2' on line 464 -- Looks like a reference, but probably isn't: '3' on line 464 -- Looks like a reference, but probably isn't: '4' on line 482 -- Looks like a reference, but probably isn't: '5' on line 465 -- Looks like a reference, but probably isn't: '6' on line 465 -- Looks like a reference, but probably isn't: '7' on line 465 -- Looks like a reference, but probably isn't: '8' on line 466 -- Looks like a reference, but probably isn't: '9' on line 466 -- Looks like a reference, but probably isn't: '10' on line 466 -- Looks like a reference, but probably isn't: '11' on line 466 -- Possible downref: Non-RFC (?) normative reference: ref. 'DuFw' -- Possible downref: Non-RFC (?) normative reference: ref. 'DuGr01' == Outdated reference: draft-ietf-psamp-sample-tech has been published as RFC 5475 -- Possible downref: Non-RFC (?) normative reference: ref. 'DiRoCla' -- Possible downref: Non-RFC (?) normative reference: ref. 'Jenk97' -- Possible downref: Non-RFC (?) normative reference: ref. 'HaKr97' -- Possible downref: Non-RFC (?) normative reference: ref. 'NiTa03' Summary: 5 errors (**), 0 flaws (~~), 10 warnings (==), 22 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft 3 Document: 4 Expires: April 2004 5 S. Niccolini 6 University of Pisa 7 M. Molina 8 NEC Europe Ltd. 9 Nick Duffield 10 AT&T Labs - 11 - Research 13 October 2003 15 Hash functions description for packet selection 17 Status of this Memo 19 This document is an Internet-Draft and is in full conformance with 20 all provisions of Section 10 of RFC2026. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF), its areas, and its working groups. Note that 24 other groups may also distribute working documents as Internet- 25 Drafts. Internet-Drafts are draft documents valid for a maximum of 26 six months and may be updated, replaced, or obsoleted by other 27 documents at any time. It is inappropriate to use Internet-Drafts 28 as reference material or to cite them other than as "work in 29 progress." 31 The list of current Internet-Drafts can be accessed at 32 http://www.ietf.org/ietf/1id-abstracts.txt 34 The list of Internet-Draft Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html. 37 Abstract 39 This document is concerned with the specification and properties of 40 Hash Functions for packet selection in PSAMP. Having a detailed and 41 exhaustive description of the Hash Function, in terms of input 42 parameters, output value, selection range and internal operations 43 is fundamental for coherent configuration of the same hash based 44 selection at different points in the network. This coherency is 45 fundamental for all network measurement applications needing 46 consistent packet selection. 48 Table of Contents 50 1. Introduction.................................................2 51 1.1 Hash-based Packet Selection..................................2 52 1.2 Consistent Packet Selection..................................3 54 October 2003 56 2. Criteria for PSAMP Hash Functions............................3 57 3. Hash function description....................................4 58 3.1 MMH (Multilinear Modular Hashing) function...................4 59 3.1.1 Input parameters.............................................5 60 3.1.2 Built in parameters..........................................5 61 3.1.3 Output.......................................................5 62 3.1.4 Functioning..................................................5 63 3.2 "Bob" hash function..........................................7 64 3.2.1 Input Parameters.............................................7 65 3.2.2 Built in parameters..........................................7 66 3.2.3 Output.......................................................8 67 3.2.4 Functioning..................................................8 68 4. References..................................................10 69 5. Author's Addresses..........................................11 70 6. Intellectual Property Statement.............................11 71 7. Full Copyright Statement....................................11 73 1. Introduction 75 This document is concerned with the specification and properties of 76 Hash Functions for packet selection in PSAMP. Having a detailed and 77 exhaustive description of the Hash Function, in terms of input 78 parameters, output value, selection range and internal operations 79 is fundamental for coherent configuration of the same hash based 80 selection at different points in the network. This coherency is 81 fundamental for all network measurement applications needing 82 consistent packet selection. 84 1.1 Hash-based Packet Selection 86 The PSAMP framework document for passive packet measurement [DuFw] 87 divides packet selection techniques in two broad classes: sampling 88 and filtering. Filtering is defined as follows: 90 ææa filter is a selection operation that selects a packet 91 deterministically based on the packet content, its treatment, 92 and functions of these occurring in the selection state. 93 Examples include match/mask filtering, and hash-based 94 selection.ÆÆ 96 The deterministic property is of particular importance for 97 measurement applications that require the consistent selection of 98 packet at multiple points in the network. This is described more 99 fully in the next section. 101 This document is concerned with the specification and properties of 102 hash functions to be used for packet selection in PSAMP. We start 104 October 2003 106 by recalling the following definitions of the components of hash- 107 based selection from [DuFw]. 109 * Hash-based selection: a filter specified by a hash domain, a 110 hash function, and hash range and a hash selection range. 112 * Hash domain: a subset of the packet content and the packet 113 treatment, viewed as an N-bit string for some positive integer N. 115 * Hash range: a set of M-bit strings for some positive integer M. 117 * Hash function: a deterministic map from the hash domain into 118 the hash range. 120 * Hash selection range: a subset of the hash range. The packet is 121 selected if the action of the hash function on the hash domain 122 for the packet yields a result in the hash selection range. 124 1.2 Consistent Packet Selection 126 Hash functions are deterministic in nature, in the sense that they 127 always produce the same output from a given input. This is a useful 128 property for some network measurement applications, since it 129 enables the consistent selection of a given packet at different 130 points in the network, in a sense we now describe. 131 Suppose that a hash-based selector operates at multiple points on a 132 packetÆs path. Consider the following conditions: 134 1. The same hash function is employed at each point 135 2. The hash domain is the same at each point and contains only 136 invariant portions of the packet header and/or payload, 137 i.e., those portions that do not change from point to point 138 3. the hash selection range is the same at each point. 140 If, and only if, conditions 1,2, and 3 are all satisfied, then the 141 hash-based selectors at each point are consistent in the sense that 142 a given packet is selected at either all the points or at none. 143 Consistent packet selection using hash functions was proposed in 144 [DuGr01]. 146 This document will specify (in its final form) certain hash 147 functions suitable to be used in the PSAMP protocol. 149 2. Criteria for PSAMP Hash Functions. 151 The number of hash functions that could conceivably be used for 152 consistent packet selection is virtually infinite. However, for the 153 applications considered in [DuFw], two properties are particularly 154 important: computation speed and mixing strength. The latter means 155 that small changing in the input should produce big changes in the 157 October 2003 159 output (the hash value). Jointly satisfying these two properties is 160 difficult, therefore the choice of a good hash function often 161 implies a trade off decision. 163 The purpose of this document is to give a formal description of 164 some hash function rather that indicate which is the best one. 165 Nevertheless, we have included some that, according to tests 166 performed [NiTa03], are candidates to satisfy both properties. 167 Other hash functions may be added in later version of this draft, 168 along with better indications about their relative performances. 170 The description of each hash functions that we propose in this 171 draft comprises the following items: 173 1. input parameters; 174 2. built-in parameters; 175 3. output (including selection range); 176 4. hash function operation; 178 The first three items need to be formalized in the information 179 model for packet selection [ZeTech] and the PSAMP MIB [DiRoCla], in 180 order that the hash function, consistently configured, can be 181 activated in the desired measurement points. The last item provides 182 a consistent reference for the implementation of the hash function 183 by defining the precise mapping that the hash function makes from 184 its input to its output. Some reference code is included for this 185 purpose. 187 3. Hash function description 189 3.1 MMH (Multilinear Modular Hashing) function 191 MMH is a hash function suitable for fast software implementation 192 able to hash a string of variable size [HaKr97]. 193 In order to achieve fast software implementation while preserving 194 the low collision probability, the MMH hash function is built 195 implementing a division-less modular reduction. For sake of 196 simplicity, we refer to the implementation done on 32-bit word 197 machines. The generalization of such implementation may be 198 considered for adapting the MMH to different word length machines, 199 but the reference code presented in 3.1.4 is optimized for 32-bit 200 word machines and requires a compiler that supports the 201 multiplication of two 32-bit words and feeds the result into a 64 202 bit string (long long type) 204 October 2003 206 3.1.1 Input parameters 208 The only input parameter of the MMH is: 210 - the length of the input string (key) to be hashed, in bytes. The 211 MMH operates on elementary blocks of 32 bits, therefore the key 212 length must be a multiple of 4 (if the key is not a multiple of 213 4, it must be zero-padded) 215 3.1.2 Built in parameters 217 The MMH uses the following built-in parameters: 219 - ro: a prime number in the range of [2^32 ; 2^32 + 2^16]. It is 220 used to implement the division-less modular reduction; 221 - A vector of (different) 32-bit numbers used to implement the so 222 called inner-product operation. A good choice is to take only 223 prime numbers; 224 - The number of elements of such a vector, which must be at least 225 equal to the key length divided by 4. 227 3.1.3 Output 229 The output of the MMH function is a 32-bit number. It should be 230 specified: 231 - A 32 bit mask to apply to the output 232 - The selection range as a list of non overlapping intervals [start 233 value, end value] where value is in [0,2^32] 235 3.1.4 Functioning 237 To obtain the hash value, MMH first computes two quantities called 238 inner-product-low and inner-product-high, respectively. These are 239 obtained adding together the results of the multiplication of the 240 words of the input key by the built-in vector. 241 The two inner products computed are 64-bit long, they need to be 242 firstly modular reduced to the prime number range ([0 ; ro]) and 243 then to the 32-bit range ([0 ; 2^32]). The output of the reduction 244 (and of the function itself) is the hash value. 246 Reference code: 248 /*----------------------mmh hash function ---------------------------- 249 -*/ 250 October 2003 252 /* 253 */ 254 /* The mmh_table size limits the maximum key length that can be hashed 255 */ 256 /* With a table of size T we can hash keys up to 4*T bytes long keys 257 */ 258 /* 259 */ 260 /*-------------------------------------------------------------------- 261 -*/ 263 #define DO(i) sum += key2[i] * (unsigned long long) mmh_table[i] 265 /* mmh signature table with the first 40 prime numbers: */ 266 /* we can hash string up to 160 char long */ 267 static unsigned long mmh_table[40] = 268 { 2, 3, 5, 7, 11, 269 13, 17, 19, 23, 29, 270 31, 37, 41, 43, 47, 271 53, 59, 61, 67, 71, 272 73, 79, 83, 89, 97, 273 101, 103, 107, 109, 113, 274 127, 131, 137, 139, 149, 275 151, 157, 163, 167, 173}; 277 /** 278 ** FUNCTION: mmh 279 ** PARAMETERS: 280 ** - key: pointer to the char string to be hashed 281 ** - len: length of the char string to be hashed 282 ** RETURN: hashvalue in an unsigned long variable 283 ** PURPOSE: Accepts a pointer to a string to be hashed 284 ** and returns an unsigned long. 285 ** NOTE: using this function requires that 286 ** the key is a 4-byte multiple otherwise 287 ** it gives erroneus results. The padding with 0x0 bytes 288 ** of keys non multiple of 4 must be performed 289 ** before calling this function! 290 **/ 291 unsigned long mmh(char *key, unsigned short len) 292 { 293 unsigned short i; 294 signed long long stmp; 295 unsigned long long utmp; 297 unsigned long long sum = 0LL; /*running sum*/ 299 unsigned long ret; /*return value*/ 301 unsigned long *key2 = (unsigned long *)key; 303 October 2003 305 for (i=0; i<(len/4); i++) 306 DO(i); 308 /*********return (sum % 0x10000000fLL);***************/ 310 stmp = (sum & 0xffffffffLL) - ((sum >> 32) * 15); /*lo - hi 311 * 15 */ 312 utmp = (stmp & 0xffffffffLL) - ((stmp >> 32) * 15); /*lo - hi 313 * 15 */ 315 ret = utmp & 0xffffffff; 317 if (utmp > 0x10000000fLL) /* if larger than p - subtract 15 318 again */ 319 ret -=15; 321 return ret; 322 } 324 3.2 "Bob" hash function 326 "Bob" hash function is a hash function designed for having each bit 327 of the input affecting every bit of the return value and using both 328 1-bit and 2-bit deltas to achieve the so called avalanche effect 329 [Jenk97]. This function was originally built for hash table lookup 330 with fast software implementation. 332 3.2.1 Input Parameters 334 The input parameters of such a function are: 335 - the length of the input string (key) to be hashed, in bytes. The 336 elementary input blocks of Bob hash are the single bytes, 337 therefore no padding is needed. 338 - an init value (an arbitrary 32-bit number). 340 3.2.2 Built in parameters 342 The Bob Hash uses the following built-in parameter: 344 - the golden ratio (an arbitrary 32-bit number used in the hash 345 function computation: its purpose is to avoid mapping all zeros 346 to all zeros); 348 Note: the mix sub-function (see mix (a,b,c) macro in the reference 349 code in 3.2.4) has a number of parameters governing the shifts in 350 the registers. The one presented is not the only possible choice. 352 October 2003 354 It is an open point whether these may be considered additional 355 built-in parameters to specify at function configuration. 357 3.2.3 Output 359 The output of the MMH function is a 32-bit number. It should be 360 specified: 361 - A 32 bit mask to apply to the output 362 - The selection range as a list of non overlapping intervals [start 363 value, end value] where value is in [0,2^32] 365 3.2.4 Functioning 367 The hash value is obtained computing first an initialization of an 368 internal state (composed of 3 32-bit numbers, called a, b, c in the 369 reference code below), then, for each input byte of the key the 370 internal state is combined by addition and mixed using the mix sub- 371 function. Finally, the internal state mixed one last time and the 372 third number of the state (c) is chosen as the return value. 374 typedef unsigned long int ub4; /* unsigned 4-byte quantities */ 375 typedef unsigned char ub1; /* unsigned 1-byte quantities */ 377 #define hashsize(n) ((ub4)1<<(n)) 378 #define hashmask(n) (hashsize(n)-1) 380 /* 381 -------------------------------------------------------------------- 382 mix -- mix 3 32-bit values reversibly. 383 For every delta with one or two bits set, and the deltas of all three 384 high bits or all three low bits, whether the original value of a,b,c 385 is almost all zero or is uniformly distributed, 386 * If mix() is run forward or backward, at least 32 bits in a,b,c 387 have at least 1/4 probability of changing. 388 * If mix() is run forward, every bit of c will change between 1/3 and 389 2/3 of the time. (Well, 22/100 and 78/100 for some 2-bit deltas.) 390 mix() was built out of 36 single-cycle latency instructions in a 391 structure that could supported 2x parallelism, like so: 392 a -= b; 393 a -= c; x = (c>>13); 394 b -= c; a ^= x; 395 b -= a; x = (a<<8); 396 c -= a; b ^= x; 397 c -= b; x = (b>>13); 398 ... 399 Unfortunately, superscalar Pentiums and Sparcs can't take advantage 400 of that parallelism. They've also turned some of those single-cycle 401 latency instructions into multi-cycle latency instructions 402 -------------------------------------------------------------------- 403 */ 404 October 2003 406 #define mix(a,b,c) \ 407 { \ 408 a -= b; a -= c; a ^= (c>>13); \ 409 b -= c; b -= a; b ^= (a<<8); \ 410 c -= a; c -= b; c ^= (b>>13); \ 411 a -= b; a -= c; a ^= (c>>12); \ 412 b -= c; b -= a; b ^= (a<<16); \ 413 c -= a; c -= b; c ^= (b>>5); \ 414 a -= b; a -= c; a ^= (c>>3); \ 415 b -= c; b -= a; b ^= (a<<10); \ 416 c -= a; c -= b; c ^= (b>>15); \ 417 } 419 /* 420 -------------------------------------------------------------------- 421 hash() -- hash a variable-length key into a 32-bit value 422 k : the key (the unaligned variable-length array of bytes) 423 len : the length of the key, counting by bytes 424 initval : can be any 4-byte value 425 Returns a 32-bit value. Every bit of the key affects every bit of 426 the return value. Every 1-bit and 2-bit delta achieves avalanche. 427 About 6*len+35 instructions. 429 The best hash table sizes are powers of 2. There is no need to do 430 mod a prime (mod is sooo slow!). If you need less than 32 bits, 431 use a bitmask. For example, if you need only 10 bits, do 432 h = (h & hashmask(10)); 433 In which case, the hash table should have hashsize(10) elements. 435 If you are hashing n strings (ub1 **)k, do it like this: 436 for (i=0, h=0; i= 12) 463 { 464 a += (k[0] +((ub4)k[1]<<8) +((ub4)k[2]<<16) +((ub4)k[3]<<24)); 465 b += (k[4] +((ub4)k[5]<<8) +((ub4)k[6]<<16) +((ub4)k[7]<<24)); 466 c += (k[8] +((ub4)k[9]<<8) +((ub4)k[10]<<16)+((ub4)k[11]<<24)); 467 mix(a,b,c); 468 k += 12; len -= 12; 469 } 471 /*---------------------------- handle the last 11 bytes */ 472 c += length; 473 switch(len) /* all the case statements fall through */ 474 { 475 case 11: c+=((ub4)k[10]<<24); 476 case 10: c+=((ub4)k[9]<<16); 477 case 9 : c+=((ub4)k[8]<<8); 478 /* the first byte of c is reserved for the length */ 479 case 8 : b+=((ub4)k[7]<<24); 480 case 7 : b+=((ub4)k[6]<<16); 481 case 6 : b+=((ub4)k[5]<<8); 482 case 5 : b+=k[4]; 483 case 4 : a+=((ub4)k[3]<<24); 484 case 3 : a+=((ub4)k[2]<<16); 485 case 2 : a+=((ub4)k[1]<<8); 486 case 1 : a+=k[0]; 487 /* case 0: nothing left to add */ 488 } 489 mix(a,b,c); 490 /*-------------------------------- report the result */ 491 return c; 492 } 494 4. References 496 [DuFw] N. Duffield: A Framework for Passive Packet Measurement, 497 Internet Draft, , work 498 in progress, June 2003 500 [DuGr01] N. G. Duffield and M. Grossglauser, Trajectory Sampling 501 for Direct Traffic Observation, IEEE/ACM Trans. on 502 Networking, 9(3), 280-292, June 2001. 504 [ZeTech] T.Zseby, M.Molina, F.Raspall, N.Duffield: Sampling and 505 Filtering Techniques for IP Packet Selection, Internet 506 Draft < draft-ietf-psamp-sample-tech-02.txt>, work in 507 progress, June 2003 509 [DiRoCla] T.Dietz, D.Romascanu, B. Claise: Definitions of Managed 510 Objects for Packet Sampling, Internet Draft , work in progress, June 2003 513 October 2003 515 [Jenk97] B. Jenkins: Algorithm Alley, Dr. Dobb's Journal, September 516 1997. http://burtleburtle.net/bob/hash/doobs.html 518 [HaKr97] S. Halevi, H. Krawczyk : MMH: Software Message 519 Authentication in the Gbit/second Rates, LNCS vol. 1267, 520 Springer, 1997. Pages 172-189 522 [NiTa03] S. Niccolini, S. Tartarelli, F. Raspall, M. Molina : On 523 time synchronization and hashing for passive One-Way Delay 524 measurements, work in progress, available at 525 http://www.ccrle.nec.de/mmolina/papers/paper_hash_sync.pdf 527 5. Author's Addresses 529 Saverio Niccolini 530 University of Pisa 531 Via Caruso 1 532 56122 Pisa 533 Italy 534 Phone: +39 050 2217-592 535 EMail: s.niccolini@iet.unipi.it 537 Maurizio Molina 538 NEC Europe Ltd., Network Laboratories 539 Adenauerplatz 6 540 69115 Heidelberg 541 Germany 542 Phone: +49 6221 90511-18 543 Email: molina@ccrle.nec.de 545 Nick Duffield 546 AT&T Labs - Research 547 Room B-139 548 180 Park Ave 549 Florham Park NJ 07932, USA 550 Phone: +1 973-360-8726 551 Email: duffield@research.att.com 553 6. Intellectual Property Statement 555 AT&T Corporation may own intellectual property applicable to this 556 contribution. The IETF has been notified of AT&T's licensing intent 557 for the specification contained in this document. See 558 http://www.ietf.org/ietf/IPR/ATT-GENERAL.txt for AT&T's IPR 560 statement. 561 7. Full Copyright Statement 563 Copyright (C) The Internet Society (2002). All Rights Reserved. 564 This document and translations of it may be copied and furnished to 565 others, and derivative works that comment on or otherwise explain 567 October 2003 569 it or assist in its implementation may be prepared, copied, 570 published and distributed, in whole or in part, without restriction 571 of any kind, provided that the above copyright notice and this 572 paragraph are included on all such copies and derivative works. 573 However, this document itself may not be modified in any way, such 574 as by removing the copyright notice or references to the Internet 575 Society or other Internet organizations, except as needed for the 576 purpose of developing Internet standards in which case the 577 procedures for copyrights defined in the Internet Standards process 578 must be followed, or as required to translate it into languages 579 other than 580 English. 582 The limited permissions granted above are perpetual and will not be 583 revoked by the Internet Society or its successors or assigns. 585 This document and the information contained herein is provided on 586 an 587 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 588 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 589 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 590 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 591 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.