idnits 2.17.00 (12 Aug 2021) /tmp/idnits19052/draft-ietf-aqm-fq-codel-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 630 has weird spacing: '...st_head flowc...' -- The document date (March 18, 2016) is 2250 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: draft-ietf-aqm-codel has been published as RFC 8289 == Outdated reference: draft-ietf-aqm-fq-implementation has been published as RFC 7806 == Outdated reference: draft-ietf-tcpm-dctcp has been published as RFC 8257 == Outdated reference: draft-ietf-tsvwg-rtcweb-qos has been published as RFC 8837 Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AQM working group T. Hoeiland-Joergensen 3 Internet-Draft Karlstad University 4 Intended status: Experimental P. McKenney 5 Expires: September 19, 2016 IBM Linux Technology Center 6 D. Taht 7 Teklibre 8 J. Gettys 10 E. Dumazet 11 Google, Inc. 12 March 18, 2016 14 The FlowQueue-CoDel Packet Scheduler and Active Queue Management 15 Algorithm 16 draft-ietf-aqm-fq-codel-06 18 Abstract 20 This memo presents the FQ-CoDel hybrid packet scheduler/Active Queue 21 Management algorithm, a powerful tool for fighting bufferbloat and 22 reducing latency. 24 FQ-CoDel mixes packets from multiple flows and reduces the impact of 25 head of line blocking from bursty traffic. It provides isolation for 26 low-rate traffic such as DNS, web, and videoconferencing traffic. It 27 improves utilisation across the networking fabric, especially for 28 bidirectional traffic, by keeping queue lengths short; and it can be 29 implemented in a memory- and CPU-efficient fashion across a wide 30 range of hardware. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on September 19, 2016. 49 Copyright Notice 51 Copyright (c) 2016 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 67 1.1. Conventions used in this document . . . . . . . . . . . . 4 68 1.2. Terminology and concepts . . . . . . . . . . . . . . . . 4 69 1.3. Informal summary of FQ-CoDel . . . . . . . . . . . . . . 5 70 2. CoDel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 71 3. Flow Queueing . . . . . . . . . . . . . . . . . . . . . . . . 6 72 4. The FQ-CoDel scheduler . . . . . . . . . . . . . . . . . . . 7 73 4.1. Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . 7 74 4.1.1. Alternative classification schemes . . . . . . . . . 8 75 4.2. Dequeue . . . . . . . . . . . . . . . . . . . . . . . . . 9 76 5. Implementation considerations . . . . . . . . . . . . . . . . 10 77 5.1. Data structures . . . . . . . . . . . . . . . . . . . . . 10 78 5.2. Parameters . . . . . . . . . . . . . . . . . . . . . . . 11 79 5.2.1. Interval . . . . . . . . . . . . . . . . . . . . . . 11 80 5.2.2. Target . . . . . . . . . . . . . . . . . . . . . . . 11 81 5.2.3. Packet limit . . . . . . . . . . . . . . . . . . . . 11 82 5.2.4. Quantum . . . . . . . . . . . . . . . . . . . . . . . 12 83 5.2.5. Flows . . . . . . . . . . . . . . . . . . . . . . . . 12 84 5.2.6. Explicit Congestion Notification (ECN) . . . . . . . 12 85 5.2.7. CE threshold . . . . . . . . . . . . . . . . . . . . 13 86 5.3. Probability of hash collisions . . . . . . . . . . . . . 13 87 5.4. Memory Overhead . . . . . . . . . . . . . . . . . . . . . 13 88 5.5. Per-Packet Timestamping . . . . . . . . . . . . . . . . . 14 89 5.6. Limiting queueing in lower layers . . . . . . . . . . . . 15 90 5.7. Other forms of "Fair Queueing" . . . . . . . . . . . . . 15 91 5.8. Differences between CoDel and FQ-CoDel behaviour . . . . 15 92 6. Limitations of flow queueing . . . . . . . . . . . . . . . . 16 93 6.1. Fairness between things other than flows . . . . . . . . 16 94 6.2. Flow bunching by opaque encapsulation . . . . . . . . . . 17 95 6.3. Low-priority congestion control algorithms . . . . . . . 17 96 7. Deployment status and future work . . . . . . . . . . . . . . 18 97 8. Security Considerations . . . . . . . . . . . . . . . . . . . 18 98 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 99 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19 100 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 101 11.1. Normative References . . . . . . . . . . . . . . . . . . 19 102 11.2. Informative References . . . . . . . . . . . . . . . . . 21 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 105 1. Introduction 107 The FlowQueue-CoDel (FQ-CoDel) algorithm is a combined packet 108 scheduler and Active Queue Management (AQM) [RFC3168] algorithm 109 developed as part of the bufferbloat-fighting community effort 110 [BLOATWEB]. It is based on a modified Deficit Round Robin (DRR) 111 queue scheduler [DRR][DRRPP], with the CoDel AQM [I-D.ietf-aqm-codel] 112 algorithm operating on each queue. This document describes the 113 combined algorithm; reference implementations are available for the 114 ns2 [NS2] and ns3 [NS3] network simulators, and it is included in the 115 mainline Linux kernel as the fq_codel queueing discipline [LINUXSRC]. 117 FQ-CoDel is a general, efficient, nearly parameterless queue 118 management approach combining flow queueing with CoDel. It is a 119 powerful tool for solving bufferbloat [BLOAT], and we believe it to 120 be safe to turn on by default, as has already happened in a number of 121 Linux distributions. In this document we document the Linux 122 implementation in sufficient detail for an independent 123 implementation, to enable deployment outside of the Linux ecosystem. 125 Since the FQ-CoDel algorithm was originally developed in the Linux 126 kernel, that implementation is still considered canonical. This 127 document strives to describe the algorithm in the abstract in the 128 first sections and separate out most implementation details in 129 subsequent sections, but does use the Linux implementation as 130 reference for default behaviour in the algorithm description itself. 132 The rest of this document is structured as follows: This section 133 gives some concepts and terminology used in the rest of the document, 134 and gives a short informal summary of the FQ-CoDel algorithm. 135 Section 2 gives an overview of the CoDel algorithm. Section 3 covers 136 the flow hashing and DRR portion. Section 4 then describes the 137 working of the algorithm in detail, while Section 5 describes 138 implementation details and considerations. Section 6 lists some of 139 the limitations of using flow queueing. Finally, Section 7 outlines 140 the current status of FQ-CoDel deployment and lists some possible 141 future areas of inquiry, and Section 8 reiterates some important 142 security points that must be observed in the implementation. 144 1.1. Conventions used in this document 146 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 147 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 148 document are to be interpreted as described in [RFC2119]. 150 In this document, these words will appear with that interpretation 151 only when in ALL CAPS. Lower case uses of these words are not to be 152 interpreted as carrying [RFC2119] significance. 154 1.2. Terminology and concepts 156 Flow: A flow is typically identified by a 5-tuple of source IP, 157 destination IP, source port, destination port, and protocol number. 158 It can also be identified by a superset or subset of those 159 parameters, or by media access control (MAC) address, or other means. 160 FQ-CoDel hashes flows into a configurable number of buckets to assign 161 packets to internal Queues. 163 Queue: A queue of packets represented internally in FQ-CoDel. In 164 most instances each flow gets its own queue; however because of the 165 possibility of hash collisions, this is not always the case. In an 166 attempt to avoid confusion, the word 'queue' is used to refer to the 167 internal data structure, and 'flow' to refer to the actual stream of 168 packets being delivered to the FQ-CoDel algorithm. 170 Scheduler: A mechanism to select which queue a packet is dequeued 171 from. 173 CoDel AQM: The Active Queue Management algorithm employed by FQ-CoDel 174 [I-D.ietf-aqm-codel]. 176 DRR: Deficit round-robin scheduling [DRR]. 178 Quantum: The maximum amount of bytes to be dequeued from a queue at 179 once. 181 Interval: Characteristic time period used by the control loop of 182 CoDel to detect when a persistent Queue is developing (see 183 Section 4.3 of [I-D.ietf-aqm-codel]). 185 Target: Setpoint value of the minimum sojourn time of packets in a 186 Queue used as the target of the control loop in CoDel (see 187 Section 4.4 of [I-D.ietf-aqm-codel]). 189 1.3. Informal summary of FQ-CoDel 191 FQ-CoDel is a _hybrid_ of DRR [DRR] and CoDel [I-D.ietf-aqm-codel], 192 with an optimisation for sparse flows similar to Shortest Queue First 193 (SQF) [SQF] and DRR++ [DRRPP]. We call this "Flow Queueing" rather 194 than "Fair Queueing" as flows that build a queue are treated 195 differently from flows that do not. 197 By default, FQ-CoDel stochastically classifies incoming packets into 198 different queues by hashing the 5-tuple of IP protocol number and 199 source and destination IP and port numbers, perturbed with a random 200 number selected at initiation time (although other flow 201 classification schemes can optionally be configured instead; see 202 Section 4.1.1). Each queue is managed by the CoDel AQM algorithm 203 [CODEL]. Packet ordering within a queue is preserved, since queues 204 have FIFO ordering. 206 The FQ-CoDel algorithm consists of two logical parts: the scheduler 207 which selects which queue to dequeue a packet from, and the CoDel AQM 208 which works on each of the queues. The subtleties of FQ-CoDel are 209 mostly in the scheduling part, whereas the interaction between the 210 scheduler and the CoDel algorithm are fairly straight forward: 212 At initialisation, each queue is set up to have a separate set of 213 CoDel state variables. By default, 1024 queues are created. The 214 Linux implementation at the time of writing supports anywhere from 215 one to 64K separate queues, and each queue maintains the state 216 variables throughout its lifetime, and so acts the same as the non-FQ 217 CoDel variant would. This means that with only one queue, FQ-CoDel 218 behaves essentially the same as CoDel by itself. 220 On dequeue, FQ-CoDel selects a queue from which to dequeue by a two- 221 tier round-robin scheme, in which each queue is allowed to dequeue up 222 to a configurable quantum of bytes for each iteration. Deviations 223 from this quantum is maintained as byte credits for the queue, which 224 serves to make the fairness scheme byte-based rather than packet- 225 based. The two-tier round-robin mechanism distinguishes between 226 "new" queues (which don't build up a standing queue) and "old" 227 queues, that have queued enough data to be around for more than one 228 iteration of the round-robin scheduler. 230 This new/old queue distinction has a particular consequence for 231 queues that don't build up more than a quantum of bytes before being 232 visited by the scheduler: Such queues are removed from the list, and 233 then re-added as a new queue each time a packet arrives for it, and 234 so will get priority over queues that do not empty out each round 235 (except for a minor modification to protect against starvation, 236 detailed below). Exactly how little data a flow has to send to keep 237 its queue in this state is somewhat difficult to reason about, 238 because it depends on both the egress link speed and the number of 239 concurrent flows. However, in practice many things that are 240 beneficial to have prioritised for typical internet use (ACKs, DNS 241 lookups, interactive SSH, HTTP requests, VoIP) _tend_ to fall in this 242 category, which is why FQ-CoDel performs so well for many practical 243 applications. However, the implicitness of the prioritisation means 244 that for applications that require guaranteed priority (for instance 245 multiplexing the network control plane over the network itself), 246 explicit classification is still needed. 248 This scheduling scheme has some subtlety to it, which is explained in 249 detail in the remainder of this document. 251 2. CoDel 253 CoDel is described in the ACM Queue paper [CODEL], and the IETF 254 document [I-D.ietf-aqm-codel]. The basic idea is to control queue 255 length, maintaining sufficient queueing to keep the outgoing link 256 busy, but avoiding building up the queue beyond that point. This is 257 done by preferentially dropping packets that remain in the queue for 258 "too long". Packets are dropped by head drop, which lowers the time 259 for the drop signal to propagate back to the sender by the length of 260 the queue, and helps trigger TCP fast retransmit sooner. 262 The CoDel algorithm itself will not be described here; instead we 263 refer the reader to the CoDel draft [I-D.ietf-aqm-codel]. 265 3. Flow Queueing 267 The intention of FQ-CoDel's scheduler is to give each _flow_ its own 268 queue, hence the term _Flow Queueing_. Rather than a perfect 269 realisation of this, a hashing-based scheme is used, where flows are 270 hashed into a number of buckets which each has its own queue. The 271 number of buckets is configurable, and presently defaults to 1024 in 272 the Linux implementation. This is enough to avoid hash collisions on 273 a moderate number of flows as seen for instance in a home gateway. 274 Depending on the characteristics of the link, this can be tuned to 275 trade off memory for a lower probability of hash collisions. See 276 Section 6 for a more in-depth discussion of this. 278 By default, the flow hashing is performed on the 5-tuple of source 279 and destination IP addresses and port numbers and IP protocol number. 280 While the hashing can be customised to match on arbitrary packet 281 bytes, care should be taken when doing so: Much of the benefit of the 282 FQ-CoDel scheduler comes from this per-flow distinction. However, 283 the default hashing does have some limitations, as discussed in 284 Section 6. 286 FQ-CoDel's DRR scheduler is byte-based, employing a deficit round- 287 robin mechanism between queues. This works by keeping track of the 288 current number _byte credits_ of each queue. This number is 289 initialised to the configurable quantum; each time a queue gets a 290 dequeue opportunity, it gets to dequeue packets, decreasing the 291 number of credits by the packet size for each packet. This continues 292 until the value of _byte credits_ becomes zero or less, at which 293 point it is increased by one quantum, and the dequeue opportunity 294 ends. 296 This means that if one queue contains packets of, for instance, size 297 quantum/3, and another contains quantum-sized packets, the first 298 queue will dequeue three packets each time it gets a turn, whereas 299 the second only dequeues one. This means that flows that send small 300 packets are not penalised by the difference in packet sizes; rather, 301 the DRR scheme approximates a (single-)byte-based fairness queueing 302 scheme. The size of the quantum determines the scheduling 303 granularity, with the tradeoff from too small a quantum being 304 scheduling overhead. For small bandwidths, lowering the quantum from 305 the default MTU size can be advantageous. 307 Unlike plain DRR there are two sets of flows - a "new" list for flows 308 that have not built a queue recently, and an "old" list for queues 309 that build a backlog. This distinction is an integral part of the 310 FQ-CoDel scheduler and is described in more detail in Section 4. 312 4. The FQ-CoDel scheduler 314 To make its scheduling decisions, FQ-CoDel maintains two ordered 315 lists of active queues, called "new" and "old" queues. When a packet 316 is added to a queue that is not currently active, that queue becomes 317 active by being added to the list of new queues. Later on, it is 318 moved to the list of old queues, from which it is removed when it is 319 no longer active. This behaviour is the source of some subtlety in 320 the packet scheduling at dequeue time, explained below. 322 4.1. Enqueue 324 The packet enqueue mechanism consists of three stages: classification 325 into a queue, timestamping and bookkeeping, and optionally dropping a 326 packet when the total number of enqueued packets goes over the 327 maximum. 329 When a packet is enqueued, it is first classified into the 330 appropriate queue. By default, this is done by hashing (using a 331 Jenkins hash function [JENKINS]) on the 5-tuple of IP protocol, and 332 source and destination IP addresses and port numbers (if they exist), 333 and taking the hash value modulo the number of queues. The hash is 334 salted by modulo addition of a random value selected at 335 initialisation time, to prevent possible DoS attacks if the hash is 336 predictable ahead of time (see Section 8). The Linux kernel 337 implements the Jenkins hash function by mixing three 32-bit values 338 into a single 32-bit output value. Inputs larger than 96 bits are 339 reduced by additional mixing steps, 96 bits at a time. 341 Once the packet has been successfully classified into a queue, it is 342 handed over to the CoDel algorithm for timestamping. It is then 343 added to the tail of the selected queue, and the queue's byte count 344 is updated by the packet size. Then, if the queue is not currently 345 active (i.e., if it is not in either the list of new or the list of 346 old queues), it is added to the end of the list of new queues, and 347 its number of credits is initiated to the configured quantum. 348 Otherwise, the queue is left in its current queue list. 350 Finally, the total number of enqueued packets is compared with the 351 configured limit, and if it is _above_ this value (which can happen 352 since a packet was just enqueued), a packet is dropped from the head 353 of the queue with the largest current byte count. Note that this in 354 most cases means that the packet that gets dropped is different from 355 the one that was just enqueued, and may even be from a different 356 queue. 358 4.1.1. Alternative classification schemes 360 As mentioned previously, it is possible to modify the classification 361 scheme to provide a different notion of a 'flow'. The Linux 362 implementation provides this option in the form of the "tc filter" 363 command. While this can add capabilities (for instance, matching on 364 other possible parameters such as MAC address, diffserv code point 365 values, firewall rules, flow specific markings, IPv6 flow label, 366 etc.), care should be taken to preserve the notion of 'flow' as much 367 of the benefit of the FQ-CoDel scheduler comes from keeping flows in 368 separate queues. 370 For protocols that do not contain a port number (such as ICMP), the 371 Linux implementation simply sets the port numbers to zero and 372 performs the hashing as usual. In practice, this results in such 373 protocols to each get their own queue (except in the case of hash 374 collisions). An implementation can perform other classifications for 375 protocols that have their own notion of a flow, but SHOULD fall back 376 to simply hashing on source and destination IP address and IP 377 protocol number in the absence of other information. 379 The default classification scheme can additionally be improved by 380 performing decapsulation of tunnelled packets prior to hashing on the 381 5-tuple in the encapsulated payload. The Linux implementation does 382 this for common encapsulations known to the kernel, such as 6in4 383 [RFC4213], IP-in-IP [RFC2003] and GRE (Generic Routing Encapsulation) 384 [RFC2890]. This helps to distinguish between flows that share the 385 same (outer) 5-tuple, but of course is limited to unencrypted tunnels 386 (see Section 6.2). 388 4.2. Dequeue 390 Most of FQ-CoDel's work is done at packet dequeue time. It consists 391 of three parts: selecting a queue from which to dequeue a packet, 392 actually dequeuing it (employing the CoDel algorithm in the process), 393 and some final bookkeeping. 395 For the first part, the scheduler first looks at the list of new 396 queues; for the queue at the head of that list, if that queue has a 397 negative number of credits (i.e., it has already dequeued at least a 398 quantum of bytes), it is given an additional quantum of credits, the 399 queue is put onto _the end of_ the list of old queues, and the 400 routine selects the next queue and starts again. 402 Otherwise, that queue is selected for dequeue. If the list of new 403 queues is empty, the scheduler proceeds down the list of old queues 404 in the same fashion (checking the credits, and either selecting the 405 queue for dequeuing, or adding credits and putting the queue back at 406 the end of the list). 408 After having selected a queue from which to dequeue a packet, the 409 CoDel algorithm is invoked on that queue. This applies the CoDel 410 control law, which is the mechanism CoDel uses to determine when to 411 drop packets (see [I-D.ietf-aqm-codel]). As a result of this, one or 412 more packets may be discarded from the head of the selected queue, 413 before the packet that should be dequeued is returned (or nothing is 414 returned if the queue is or becomes empty while being handled by the 415 CoDel algorithm). 417 Finally, if the CoDel algorithm does not return a packet, then the 418 queue must be empty, and the scheduler does one of two things: if the 419 queue selected for dequeue came from the list of new queues, it is 420 moved to _the end of_ the list of old queues. If instead it came 421 from the list of old queues, that queue is removed from the list, to 422 be added back (as a new queue) the next time a packet arrives that 423 hashes to that queue. Then (since no packet was available for 424 dequeue), the whole dequeue process is restarted from the beginning. 426 If, instead, the scheduler _did_ get a packet back from the CoDel 427 algorithm, it subtracts the size of the packet from the byte credits 428 for the selected queue and returns the packet as the result of the 429 dequeue operation. 431 The step that moves an empty queue from the list of new queues to 432 _the end of_ the list of old queues before it is removed is crucial 433 to prevent starvation. Otherwise the queue could reappear (the next 434 time a packet arrives for it) before the list of old queues is 435 visited; this can go on indefinitely even with a small number of 436 active flows, if the flow providing packets to the queue in question 437 transmits at just the right rate. This is prevented by first moving 438 the queue to _the end of_ the list of old queues, forcing a pass 439 through that, and thus preventing starvation. Moving it to the end 440 of the list, rather than the front, is crucial for this to work. 442 The resulting migration of queues between the different states is 443 summarised in the following state diagram: 445 +-----------------+ +------------------+ 446 | | Empty | | 447 | Empty |<---------------+ Old +----+ 448 | | | | | 449 +-------+---------+ +------------------+ | 450 | ^ ^ |Credits 451 |Arrival | | |Exhausted 452 v | | | 453 +-----------------+ | | | 454 | | Empty or | | | 455 | New +-------------------+ +-------+ 456 | | Credits exhausted 457 +-----------------+ 459 Figure 1: Partial state diagram for queues between different states. 460 Both the new and old queue states can additionally have arrival and 461 dequeue events that do not change the state; these are omitted here. 463 5. Implementation considerations 465 This section contains implementation details for the FQ-CoDel 466 algorithm. This includes the data structures and parameters used in 467 the Linux implementation, as well as discussion of some required 468 features of the target platform and other considerations. 470 5.1. Data structures 472 The main data structure of FQ-CoDel is the array of queues, which is 473 instantiated with the number of queues specified by the _flows_ 474 parameter at instantiation time. Each queue consists simply of an 475 ordered list of packets with FIFO semantics, two state variables 476 tracking the queue credits and total number of bytes enqueued, and 477 the set of CoDel state variables. Other state variables to track 478 queue statistics can also be included: for instance, the Linux 479 implementation keeps a count of dropped packets. 481 In addition to the queue structures themselves, FQ-CoDel maintains 482 two ordered lists containing references to the subset of queues that 483 are currently active. These are the list of 'new' queues and the 484 list of 'old' queues, as explained in Section 4 above. 486 In the Linux implementation, queue space is shared: there's a global 487 limit on the number of packets the queues can hold, but not one per 488 queue. 490 5.2. Parameters 492 The following are the user configuration parameters exposed by the 493 Linux implementation of FQ-CoDel. 495 5.2.1. Interval 497 The _interval_ parameter has the same semantics as CoDel and is used 498 to ensure that the minimum sojourn time of packets in a queue used as 499 an estimator by the CoDel control algorithm is a relatively up-to- 500 date value. That is, CoDel only reacts to delay experienced in the 501 last epoch of length interval. It SHOULD be set to be on the order 502 of the worst-case RTT through the bottleneck to give end-points 503 sufficient time to react. 505 The default interval value is 100 ms. 507 5.2.2. Target 509 The _target_ parameter has the same semantics as CoDel. It is the 510 acceptable minimum standing/persistent queue delay for each FQ-CoDel 511 Queue. This minimum delay is identified by tracking the local 512 minimum queue delay that packets experience. 514 The default target value is 5 ms, but this value should be tuned to 515 be at least the transmission time of a single MTU-sized packet at the 516 prevalent egress link speed (which for, e.g., 1Mbps and MTU 1500 is 517 ~15ms), to prevent CoDel from being too aggressive at low bandwidths. 518 It should otherwise be set to on the order of 5-10% of the configured 519 interval. 521 5.2.3. Packet limit 523 Routers do not have infinite memory, so some packet limit MUST be 524 enforced. 526 The _limit_ parameter is the hard limit on the real queue size, 527 measured in number of packets. This limit is a global limit on the 528 number of packets in all queues; each individual queue does not have 529 an upper limit. When the limit is reached and a new packet arrives 530 for enqueue, a packet is dropped from the head of the largest queue 531 (measured in bytes) to make room for the new packet. 533 In Linux, the default packet limit is 10240 packets, which is 534 suitable for up to 10 Gigabit Ethernet speeds. In practice, the hard 535 limit is rarely, if ever, hit, as drops are performed by the CoDel 536 algorithm long before the limit is hit. For platforms that are 537 severely memory constrained, a lower limit can be used. 539 5.2.4. Quantum 541 The _quantum_ parameter is the number of bytes each queue gets to 542 dequeue on each round of the scheduling algorithm. The default is 543 set to 1514 bytes which corresponds to the Ethernet MTU plus the 544 hardware header length of 14 bytes. 546 In systems employing TCP Segmentation Offload (TSO), where a "packet" 547 consists of an offloaded packet train, it can presently be as large 548 as 64K bytes. In systems using Generic Receive Offload (GRO), they 549 can be up to 17 times the TCP max segment size (or 25K bytes). These 550 mega-packets severely impact FQ-CoDel's ability to schedule traffic, 551 and hurt latency needlessly. There is ongoing work in Linux to make 552 smarter use of offload engines. 554 5.2.5. Flows 556 The _flows_ parameter sets the number of queues into which the 557 incoming packets are classified. Due to the stochastic nature of 558 hashing, multiple flows may end up being hashed into the same slot. 560 This parameter can be set only at initialisation time in the current 561 implementation, since memory has to be allocated for the hash table. 563 The default value is 1024 in the current Linux implementation. 565 5.2.6. Explicit Congestion Notification (ECN) 567 ECN is _enabled_ by default. Rather than do anything special with 568 misbehaved ECN flows, FQ-CoDel relies on the packet scheduling system 569 to minimise their impact, thus the number of unresponsive packets in 570 a flow being marked with ECN can grow to the overall packet limit, 571 but will not otherwise affect the performance of the system. 573 It can be disabled by specifying the _noecn_ parameter. 575 5.2.7. CE threshold 577 This parameter enables Date Centre TCP (DCTCP)-like processing 578 resulting in CE (Congestion Encountered) marking on ECN-Capable 579 Transport (ECT) packets [RFC3168] starting at a lower sojourn delay 580 setpoint than the default CoDel Target. Details of DCTCP can be 581 found in [I-D.ietf-tcpm-dctcp]. 583 The parameter, _ce_threshold_, is disabled by default and can be set 584 to a number of microseconds to enable. 586 5.3. Probability of hash collisions 588 Since the Linux FQ-CoDel implementation by default uses 1024 hash 589 buckets, the probability that (say) 100 flows will all hash to the 590 same bucket is something like ten to the power of minus 300. Thus, 591 at least one of the flows will almost certainly hash to some other 592 queue. 594 Expanding on this, based on analytical equations for hash collision 595 probabilities, for 100 flows, the probability of no collision is 596 90.78%; the probability that no more than two of the 100 flows will 597 be involved in any given collision = 99.57%; and the probability that 598 no more than three of the 100 flows will be involved in any given 599 collision = 99.99%. These probabilities assume a hypothetical 600 perfect hashing function, so in practice they may be a bit lower. We 601 have not found this difference to matter in practice. 603 These probabilities can be improved upon by using set-associative 604 hashing, a technique used in the Cake algorithm currently being 605 developed as a further development upon the FQ-CoDel principles. For 606 a 4-way associative hash with the same number of total queues, the 607 probability of no collisions for 100 flows is 99.93%, while for an 608 8-way associative hash it is ~100%. 610 5.4. Memory Overhead 612 FQ-CoDel can be implemented with a low memory footprint (less than 64 613 bytes per queue on 64 bit systems). These are the data structures 614 used in the Linux implementation: 616 617 struct codel_vars { 618 u32 count; /* number of dropped packets */ 619 u32 lastcount; /* count entry to dropping state */ 620 bool dropping; /* currently dropping? */ 621 u16 rec_inv_sqrt; /* reciprocal sqrt computation */ 622 codel_time_t first_above_time; /* when delay above target */ 623 codel_time_t drop_next; /* next time to drop */ 624 codel_time_t ldelay; /* sojourn time of last dequeued packet */ 625 }; 627 struct fq_codel_flow { 628 struct sk_buff *head; 629 struct sk_buff *tail; 630 struct list_head flowchain; 631 int credits; /* current number of queue credits */ 632 u32 dropped; /* # of drops (or ECN marks) on flow */ 633 struct codel_vars cvars; 634 }; 636 638 The master table managing all queues looks like this: 640 642 struct fq_codel_sched_data { 643 struct tcf_proto *filter_list; /* optional external classifier */ 644 struct fq_codel_flow *flows; /* Flows table [flows_cnt] */ 645 u32 *backlogs; /* backlog table [flows_cnt] */ 646 u32 flows_cnt; /* number of flows */ 647 u32 perturbation; /* hash perturbation */ 648 u32 quantum; /* psched_mtu(qdisc_dev(sch)); */ 649 struct codel_params cparams; 650 struct codel_stats cstats; 651 u32 drop_overlimit; 652 u32 new_flow_count; 654 struct list_head new_flows; /* list of new flows */ 655 struct list_head old_flows; /* list of old flows */ 656 }; 658 660 5.5. Per-Packet Timestamping 662 The CoDel portion of the algorithm requires per-packet timestamps be 663 stored along with the packet. While this approach works well for 664 software-based routers, it may be impossible to retrofit devices that 665 do most of their processing in silicon and lack space or mechanism 666 for timestamping. 668 Also, while perfect resolution is not needed, timestamp resolution 669 finer than the CoDel target setting is necessary. Furthermore, 670 timestamping functions in the core OS need to be efficient as they 671 are called at least once on each packet enqueue and dequeue. 673 5.6. Limiting queueing in lower layers 675 When deploying a queue management algorithm such as FQ-CoDel, it is 676 important to ensure that the algorithm actually runs in the right 677 place to control the queue. In particular lower layers of the 678 operating system networking stack can have queues of their own, as 679 can device drivers and hardware. Thus, it is desirable that the 680 queue management algorithm runs as close to the hardware as possible. 681 However, scheduling such complexity at interrupt time is difficult, 682 so a small standing queue between the algorithm and the wire is often 683 needed at higher transmit rates. 685 In Linux, the mechanism to ensure these different needs are balanced 686 is called "Byte Queue Limits" [BQL], which controls the device driver 687 ring buffer (for physical line rates). For cases where this 688 functionality is not available, the queue can be controlled by means 689 of a software rate limiter such as Hierarchical Token Bucket [HTB] or 690 Hierarchical Fair-Service Curve [HFSC]. The Cake algorithm [CAKE] 691 integrates a software rate limiter for this purpose. 693 Other issues with queues at lower layers are described in [CODEL]. 695 5.7. Other forms of "Fair Queueing" 697 Much of the scheduling portion of FQ-CoDel is derived from DRR and is 698 substantially similar to DRR++. Versions based on Stochastic Fair 699 Queueing [SFQ] have also been produced and tested in ns2. Other 700 forms of Fair Queueing, such as Weighted Fair Queueing [WFQ] or Quick 701 Fair Queueing [QFQ], have not been thoroughly explored, but there's 702 no a priori reason why the round-robin scheduling of FQ-CoDel 703 couldn't be replaced with something else. 705 For a comprehensive discussion of fairness queueing algorithms and 706 their combination with AQM, see [I-D.ietf-aqm-fq-implementation]. 708 5.8. Differences between CoDel and FQ-CoDel behaviour 710 CoDel can be applied to a single queue system as a straight AQM, 711 where it converges towards an "ideal" drop rate (i.e., one that 712 minimises delay while keeping a high link utilisation), and then 713 optimises around that control point. 715 The scheduling of FQ-CoDel mixes packets of competing flows, which 716 acts to pace bursty flows to better fill the pipe. Additionally, a 717 new flow gets substantial leeway over other flows until CoDel finds 718 an ideal drop rate for it. However, for a new flow that exceeds the 719 configured quantum, more time passes before all of its data is 720 delivered (as packets from it, too, are mixed across the other 721 existing queue-building flows). Thus, FQ-CoDel takes longer (as 722 measured in time) to converge towards an ideal drop rate for a given 723 new flow, but does so within fewer delivered _packets_ from that 724 flow. 726 Finally, the flow isolation FQ-CoDel provides means that the CoDel 727 drop mechanism operates on the flows actually building queues, which 728 results in packets being dropped more accurately from the largest 729 flows than CoDel alone manages. Additionally, flow isolation 730 radically improves the transient behaviour of the network when 731 traffic or link characteristics change (e.g., when new flows start up 732 or the link bandwidth changes); while CoDel itself can take a while 733 to respond, FQ-CoDel reacts almost immediately. 735 6. Limitations of flow queueing 737 While FQ-CoDel has been shown in many scenarios to offer significant 738 performance gains compared to alternative queue management 739 strategies, there are some scenarios where the scheduling algorithm 740 in particular is not a good fit. This section documents some of the 741 known cases which either may require tweaking the default behaviour, 742 or where alternatives to flow queueing should be considered. 744 6.1. Fairness between things other than flows 746 In some parts of the network, enforcing flow-level fairness may not 747 be desirable, or some other form of fairness may be more important. 748 An example of this can be an Internet Service Provider that may be 749 more interested in ensuring fairness between customers than between 750 flows. Or a hosting or transit provider that wishes to ensure 751 fairness between connecting Autonomous Systems or networks. Another 752 issue can be that the number of simultaneous flows experienced at a 753 particular link can be too high for flow-based fairness queueing to 754 be effective. 756 Whatever the reason, in a scenario where fairness between flows is 757 not desirable, reconfiguring FQ-CoDel to match on a different 758 characteristic can be a way forward. The implementation in Linux can 759 leverage the packet matching mechanism of the _tc_ subsystem to use 760 any available packet field to partition packets into virtual queues, 761 to for instance match on address or subnet source/destination pairs, 762 application layer characteristics, etc. 764 Furthermore, as commonly deployed today, FQ-CoDel is used with three 765 or more tiers of service classification: priority, best effort and 766 background, based on diffserv markings. Some products do more 767 detailed classification, including deep packet inspection and 768 destination-specific filters to achieve their desired result. 770 6.2. Flow bunching by opaque encapsulation 772 Where possible, FQ-CoDel will attempt to decapsulate packets before 773 matching on the header fields for the flow hashing. However, for 774 some encapsulation techniques, most notably encrypted VPNs, this is 775 not possible. If several flows are bunched into one such 776 encapsulated tunnel, they will be seen as one flow by the FQ-CoDel 777 algorithm. This means that they will share a queue, and drop 778 behaviour, and so flows inside the encapsulation will not benefit 779 from the implicit prioritisation of FQ-CoDel, but will continue to 780 benefit from the reduced overall queue length from the CoDel 781 algorithm operating on the queue. In addition, when such an 782 encapsulated bunch competes against other flows, it will count as one 783 flow, and not assigned a share of the bandwidth based on how many 784 flows are inside the encapsulation. 786 Depending on the application, this may or may not be desirable 787 behaviour. In cases where it is not, changing FQ-CoDel's matching to 788 not be flow-based (as detailed in the previous subsection above) can 789 be a mitigation. Going forward, having some mechanism for opaque 790 encapsulations to express to the outer layer which flow a packet 791 belongs to, could be a way to mitigate this. Naturally, care needs 792 to be taken when designing such a mechanism to ensure no new privacy 793 and security issues are raised by exposing information from inside 794 the encapsulation to the outside world. Keeping the extra 795 information out-of-band and dropping it before it hits the network 796 could be one way to achieve this. 798 6.3. Low-priority congestion control algorithms 800 In the presence of queue management schemes that limit latency under 801 load, low-priority congestion control algorithms such as LEDBAT 802 [RFC6817] (or, in general, algorithms that try to voluntarily use up 803 less than their fair share of bandwidth) experiences little added 804 latency when the link is congested. Thus, they lack the signal to 805 back off that added latency previously afforded them. This effect is 806 seen with FQ-CoDel as well as with any effective AQM [GONG2014]. 808 As such, these delay-based algorithms tend to revert to loss-based 809 congestion control, and will consume the fair share of bandwidth 810 afforded to them by the FQ-CoDel scheduler. However, low-priority 811 congestion control mechanisms may be able to take steps to continue 812 to be low priority, for instance by taking into account the vastly 813 reduced level of delay afforded by an AQM, or by using a coupled 814 approach to observing the behaviour of multiple flows. 816 7. Deployment status and future work 818 The FQ-CoDel algorithm as described in this document has been shipped 819 as part of the Linux kernel since version 3.5, released on the 21st 820 of July, 2012, with the ce_threshold being added in version 4.2. The 821 algorithm has seen widespread testing in a variety of contexts and is 822 configured as the default queueing discipline in a number of mainline 823 Linux distributions (as of this writing at least OpenWRT, Arch Linux 824 and Fedora). We believe it to be a safe default and encourage people 825 running Linux to turn it on: It is a massive improvement over the 826 previous default FIFO queue. 828 Of course there is always room for improvement, and this document has 829 listed some of the known limitations of the algorithm. As such, we 830 encourage further research into algorithm refinements and addressing 831 of limitations. One such effort is undertaken by the bufferbloat 832 community in the form of the Cake queue management scheme [CAKE]. In 833 addition to this we believe the following (non-exhaustive) list of 834 issues to be worthy of further enquiry: 836 o Variations on the flow classification mechanism to fit different 837 notions of flows. For instance, an ISP might want to deploy per- 838 subscriber scheduling, while in other cases several flows can 839 share a 5-tuple, as exemplified by the RTCWEB QoS recommendations 840 [I-D.ietf-tsvwg-rtcweb-qos]. 842 o Interactions between flow queueing and delay-based congestion 843 control algorithms and scavenger protocols. 845 o Other scheduling mechanisms to replace the DRR portion of the 846 algorithm, e.g., QFQ or WFQ. 848 o Sensitivity of parameters, most notably the number of queues and 849 the CoDel parameters. 851 8. Security Considerations 853 There are no specific security exposures associated with FQ-CoDel 854 that are not also present in current FIFO systems. On the contrary, 855 some vulnerabilities of FIFO systems are reduced with FQ-CoDel (e.g., 856 simple minded packet floods). However, some care is needed in the 857 implementation to ensure this is the case. These are included in the 858 description above, however we reiterate them here: 860 o To prevent packets in the new queues from starving old queues, it 861 is important that when a queue on the list of new queues empties, 862 it is moved to _the end of_ the list of old queues. This is 863 described at the end of Section 4.2. 865 o To prevent an attacker targeting a specific flow for a denial of 866 service attack, the hash that maps packets to queues should not be 867 predictable. To achieve this, FQ-CoDel salts the hash, as 868 described in the beginning of Section 4.1. The size of the salt 869 and the strength of the hash function is obviously a tradeoff 870 between performance and security. The Linux implementation uses a 871 32 bit random value as the salt and a Jenkins hash function. This 872 makes it possible to achieve high throughput, and we consider it 873 sufficient to ward off the most obvious attacks. 875 o Packet fragments without a layer 4 header can be hashed into 876 different bins than the first fragment with the header intact. 877 This can cause reordering and/or adversely affect the performance 878 of the flow. Keeping state to match the fragments to the 879 beginning of the packet, or simply putting all packet fragments 880 (including the first fragment of each fragmented packet) into the 881 same queue, are two ways to alleviate this. 883 9. IANA Considerations 885 This document has no actions for IANA. 887 10. Acknowledgements 889 Our deepest thanks to Kathie Nichols, Van Jacobson, and all the 890 members of the bufferbloat.net effort for all the help on developing 891 and testing the algorithm. In addition, our thanks to Anil Agarwal 892 for his help with getting the hash collision probabilities in this 893 document right. 895 11. References 897 11.1. Normative References 899 [I-D.ietf-aqm-codel] 900 Nichols, K., Jacobson, V., McGregor, A., and J. Iyengar, 901 "Controlled Delay Active Queue Management", draft-ietf- 902 aqm-codel-03 (work in progress), March 2016. 904 [I-D.ietf-aqm-fq-implementation] 905 Baker, F. and R. Pan, "On Queuing, Marking, and Dropping", 906 draft-ietf-aqm-fq-implementation-05 (work in progress), 907 November 2015. 909 [I-D.ietf-tcpm-dctcp] 910 Bensley, S., Eggert, L., Thaler, D., Balasubramanian, P., 911 and G. Judd, "Datacenter TCP (DCTCP): TCP Congestion 912 Control for Datacenters", draft-ietf-tcpm-dctcp-01 (work 913 in progress), November 2015. 915 [I-D.ietf-tsvwg-rtcweb-qos] 916 Jones, P., Dhesikan, S., Jennings, C., and D. Druta, "DSCP 917 and other packet markings for WebRTC QoS", draft-ietf- 918 tsvwg-rtcweb-qos-14 (work in progress), March 2016. 920 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 921 DOI 10.17487/RFC2003, October 1996, 922 . 924 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 925 Requirement Levels", BCP 14, RFC 2119, 926 DOI 10.17487/RFC2119, March 1997, 927 . 929 [RFC2890] Dommety, G., "Key and Sequence Number Extensions to GRE", 930 RFC 2890, DOI 10.17487/RFC2890, September 2000, 931 . 933 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 934 of Explicit Congestion Notification (ECN) to IP", 935 RFC 3168, DOI 10.17487/RFC3168, September 2001, 936 . 938 [RFC4213] Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms 939 for IPv6 Hosts and Routers", RFC 4213, 940 DOI 10.17487/RFC4213, October 2005, 941 . 943 [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, 944 "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, 945 DOI 10.17487/RFC6817, December 2012, 946 . 948 11.2. Informative References 950 [BLOAT] Gettys, J., "Bufferbloat: Dark buffers in the Internet.", 951 in IEEE Internet Comput. 15, 3, 952 DOI http://dx.doi.org/10.1109/MIC.2011.56, 2011, 953 . 956 [BLOATWEB] 957 "Bufferbloat web site", . 959 [BQL] Herbert, T., "Network Byte Queue Limits", August 2011, 960 . 962 [CAKE] "Cake comprehensive queue management system", 963 . 965 [CODEL] Nichols, K. and V. Jacobson, "Controlling Queue Delay", 966 July 2012, . 968 [DRR] Shreedhar, M. and G. Varghese, "Efficient Fair Queueing 969 Using Deficit Round Robin", in IEEE/ACM Trans. Netw. 4, 3, 970 June 1996, 971 . 974 [DRRPP] MacGregor, M. and W. Shi, "Deficits for Bursty Latency- 975 critical Flows: DRR++", in Proceedings IEEE International 976 Conference on Networks 2000 (ICON 2000), 2000, 977 . 980 [GONG2014] 981 Gong, Y., Rossi, D., Testa, C., Valenti, S., and D. Taht, 982 "Fighting the bufferbloat: on the coexistence of AQM and 983 low priority congestion control", in 2013 IEEE Conference 984 on Computer Communications Workshops (INFOCOM WKSHPS), 985 July 2014, . 988 [HFSC] Stoica, I., Zhang, H., and T. Eugene, "Hierarchical fair- 989 service curve", in Sigcomm 1997 proceedings, 1997, 990 . 993 [HTB] "Hierarchical Token Bucket", 994 . 996 [JENKINS] Jenkins, B., "A Hash Function for Hash Table Lookup", 997 1996, . 999 [LINUXSRC] 1000 "Current FQ-CoDel Linux source code", . 1004 [NS2] "NS2 web site", . 1006 [NS3] "NS3 web site", . 1008 [QFQ] Checconi, F., Rizzo, L., and P. Valente, "QFQ: Efficient 1009 packet scheduling with tight guarantees", in IEEE/ACM 1010 Transactions on Networking (TON), 2013, 1011 . 1013 [SFQ] McKenney, P., "Stochastic fairness queueing", published as 1014 technical report, 2002, 1015 . 1019 [SQF] Bonald, T., Muscariello, L., and N. Ostallo, "On the 1020 impact of TCP and per-flow scheduling on Internet 1021 Performance", in IEEE/ACM transactions on Networking, 1022 April 2012, . 1025 [WFQ] Demers, A., Keshav, S., and S. Shenker, "Analysis and 1026 simulation of a fair queueing algorithm", in SIGCOMM 1027 Comput. Commun. Rev., September 1989, 1028 . 1030 Authors' Addresses 1032 Toke Hoeiland-Joergensen 1033 Karlstad University 1034 Dept. of Computer Science 1035 Karlstad 65188 1036 Sweden 1038 Email: toke.hoiland-jorgensen@kau.se 1039 Paul McKenney 1040 IBM Linux Technology Center 1041 1385 NW Amberglen Parkway 1042 Hillsboro, OR 97006 1043 USA 1045 Email: paulmck@linux.vnet.ibm.com 1046 URI: http://www2.rdrop.com/~paulmck/ 1048 Dave Taht 1049 Teklibre 1050 2104 W First street 1051 Apt 2002 1052 FT Myers, FL 33901 1053 USA 1055 Email: dave.taht@gmail.com 1056 URI: http://www.teklibre.com/ 1058 Jim Gettys 1059 21 Oak Knoll Road 1060 Carlisle, MA 993 1061 USA 1063 Email: jg@freedesktop.org 1064 URI: https://en.wikipedia.org/wiki/Jim_Gettys 1066 Eric Dumazet 1067 Google, Inc. 1068 1600 Amphitheater Pkwy 1069 Mountain View, CA 94043 1070 USA 1072 Email: edumazet@gmail.com