idnits 2.17.00 (12 Aug 2021) /tmp/idnits26246/draft-stenberg-httpbis-tcp-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 31, 2016) is 2021 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 456 -- Looks like a reference, but probably isn't: '2' on line 458 -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 httpbis D. Stenberg 3 Internet-Draft Mozilla 4 Intended status: Best Current Practice T. Wicinski 5 Expires: May 4, 2017 Salesforce 6 October 31, 2016 8 TCP Tuning for HTTP 9 draft-stenberg-httpbis-tcp-03 11 Abstract 13 This document records current best practice for using all versions of 14 HTTP over TCP. 16 Status of This Memo 18 This Internet-Draft is submitted in full conformance with the 19 provisions of BCP 78 and BCP 79. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF). Note that other groups may also distribute 23 working documents as Internet-Drafts. The list of current Internet- 24 Drafts is at http://datatracker.ietf.org/drafts/current/. 26 Internet-Drafts are draft documents valid for a maximum of six months 27 and may be updated, replaced, or obsoleted by other documents at any 28 time. It is inappropriate to use Internet-Drafts as reference 29 material or to cite them other than as "work in progress." 31 This Internet-Draft will expire on May 4, 2017. 33 Copyright Notice 35 Copyright (c) 2016 IETF Trust and the persons identified as the 36 document authors. All rights reserved. 38 This document is subject to BCP 78 and the IETF Trust's Legal 39 Provisions Relating to IETF Documents 40 (http://trustee.ietf.org/license-info) in effect on the date of 41 publication of this document. Please review these documents 42 carefully, as they describe your rights and restrictions with respect 43 to this document. Code Components extracted from this document must 44 include Simplified BSD License text as described in Section 4.e of 45 the Trust Legal Provisions and are provided without warranty as 46 described in the Simplified BSD License. 48 Table of Contents 50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 51 1.1. Notational Conventions . . . . . . . . . . . . . . . . . 3 52 2. Socket planning . . . . . . . . . . . . . . . . . . . . . . . 3 53 2.1. Number of open files . . . . . . . . . . . . . . . . . . 3 54 2.2. Number of concurrent network messages . . . . . . . . . . 3 55 2.3. Number of incoming TCP SYNs allowed to backlog . . . . . 3 56 2.4. Use the whole port range for local ports . . . . . . . . 4 57 2.5. Lower the TCP FIN timeout . . . . . . . . . . . . . . . . 4 58 2.6. Reuse sockets in TIME_WAIT state . . . . . . . . . . . . 4 59 2.7. TCP socket buffer sizes and Window Scaling . . . . . . . 4 60 2.8. Set maximum allowed TCP window sizes . . . . . . . . . . 5 61 2.9. Timers and timeouts . . . . . . . . . . . . . . . . . . . 5 62 3. TCP handshake . . . . . . . . . . . . . . . . . . . . . . . . 5 63 3.1. TCP Fast Open . . . . . . . . . . . . . . . . . . . . . . 5 64 3.2. Initial Congestion Window . . . . . . . . . . . . . . . . 6 65 3.3. TCP SYN flood handling . . . . . . . . . . . . . . . . . 6 66 4. TCP transfers . . . . . . . . . . . . . . . . . . . . . . . . 6 67 4.1. Packet Pacing . . . . . . . . . . . . . . . . . . . . . . 6 68 4.2. Explicit Congestion Control . . . . . . . . . . . . . . . 6 69 4.3. Nagle's Algorithm . . . . . . . . . . . . . . . . . . . . 6 70 4.4. Delayed ACKs . . . . . . . . . . . . . . . . . . . . . . 7 71 4.5. Keep-alive . . . . . . . . . . . . . . . . . . . . . . . 7 72 5. Re-using connections . . . . . . . . . . . . . . . . . . . . 8 73 5.1. Slow Start after Idle . . . . . . . . . . . . . . . . . . 8 74 5.2. TCP-Bound Authentications . . . . . . . . . . . . . . . . 8 75 6. Closing connections . . . . . . . . . . . . . . . . . . . . . 8 76 6.1. Half-close . . . . . . . . . . . . . . . . . . . . . . . 8 77 6.2. Abort . . . . . . . . . . . . . . . . . . . . . . . . . . 8 78 6.3. Close Idle Connections . . . . . . . . . . . . . . . . . 8 79 6.4. Tail Loss Probes . . . . . . . . . . . . . . . . . . . . 9 80 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 81 8. Security Considerations . . . . . . . . . . . . . . . . . . . 9 82 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 83 9.1. Normative References . . . . . . . . . . . . . . . . . . 9 84 9.2. Informative References . . . . . . . . . . . . . . . . . 9 85 9.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 10 86 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 10 87 Appendix B. Operating System Settings for Linux . . . . . . . . 10 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 90 1. Introduction 92 HTTP version 1.1 [RFC7230] as well as HTTP version 2 [RFC7540] are 93 defined to use TCP [RFC0793], and their performance can depend 94 greatly upon how TCP is configured. This document records the best 95 current practice for using HTTP over TCP, with a focus on improving 96 end-user perceived performance. 98 These practices are generally applicable to HTTP/1 as well as HTTP/2, 99 although some may note particular impact or nuance regarding a 100 particular protocol version. 102 There are countless scenarios, roles and setups where HTTP is being 103 using so there can be no single specific "Right Answer" to most TCP 104 questions. This document intends only to cover the most important 105 areas of concern and suggest possible actions. 107 1.1. Notational Conventions 109 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 110 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 111 document are to be interpreted as described in [RFC2119]. 113 2. Socket planning 115 Your HTTP server or intermediary may need configuration changes to 116 some system tunables and timeout periods to perform optimally. 117 Actual values will depend on how you are scaling the platform, 118 horizontally or vertically, and other connection semantics. Changing 119 system limits and altering thresholds will change the behavior of 120 your web service and its dependencies. These dependencies are 121 usually common to other services running on the same system, so good 122 planning and testing is advised. 124 This is a list of values to consider and some general advice on how 125 those values can be modified on Linux systems. 127 2.1. Number of open files 129 A modern HTTP server will serve a large number of TCP connections and 130 in most systems each open socket equals an open file. Make sure that 131 limit isn't a bottle neck. 133 2.2. Number of concurrent network messages 135 Raise the number of packets allowed to get queued when a particular 136 interface receives packets faster than the kernel can process them. 138 2.3. Number of incoming TCP SYNs allowed to backlog 140 The number of new connection requests that are allowed to queue up in 141 the kernel. These can be connections that are in SYN RECEIVED or 142 ESTABLISHED states. Historically, operating systems used a single 143 backlog queue for both of these states. Newer implemntations use two 144 separate queues: one for connections in SYN RECEIVED and one for 145 those which are ESTABLISHED state (better known as the accept queue). 147 2.4. Use the whole port range for local ports 149 To make sure the TCP stack can take full advantage of the entire set 150 of possible sockets, give it a larger range of local port numbers to 151 use. 153 2.5. Lower the TCP FIN timeout 155 High connection completion rates will consume ephemeral ports 156 quickly. Lower the time during which connections are in FIN-WAIT-2/ 157 TIME_WAIT states so that they can be purged faster and thus maintain 158 a maximal number of available sockets. The primitives for the 159 assignment of these values were described in [RFC0793], however 160 significantly lower values are commonly used. 162 2.6. Reuse sockets in TIME_WAIT state 164 When running backend servers on a managed, low latency network you 165 might allow the reuse of sockets in TIME_WAIT state for new 166 connections when a protocol complete termination has occurred. There 167 is no RFC that covers this behaviour. 169 2.7. TCP socket buffer sizes and Window Scaling 171 Systems meant to handle and serve a huge number of TCP connections at 172 high speeds need a significant amount of memory for TCP socket 173 buffers. On some systems you can tell the TCP stack what default 174 buffer sizes to use and how much they are allowed to dynamically grow 175 and shrink. Window Scaling is typically linked to socket buffer 176 sizes. 178 The minimum and default tend to require less proactive amendment than 179 the maximum value. When deriving maximum values for use, you should 180 consider the BDP (Bandwidth Delay Product) of the target environment 181 and clients. Consider also that 'read' and 'write' values do not 182 require to be synchronised, as the BDP requirements for a load 183 balancer or middle-box might be very different when acting as a 184 sender or receiver. 186 Allowing needlessly high values beyond the expected limitations of 187 the platform might increase the probability of retransmissions and 188 buffer induced delays within the path. Extensions such as ECN 189 coupled with AQM can help mitigate this undesirable behaviour 190 [RFC7141]. 192 [RFC7323] covers Window Scaling in greater detail. 194 2.8. Set maximum allowed TCP window sizes 196 You may have to increase the largest allowed window size. Window 197 scaling must be accommodated within the maximal values, however it is 198 not uncommon to see the maximum definable higher than the scalable 199 limit; these values can statically defined within socket parameters 200 (SO_RCVBUF,SO_SNDBUF). 202 2.9. Timers and timeouts 204 On a modern shared platform it can be common to plan for both long 205 and short lived connections on the same implementation. However, the 206 delivery of static assets and a 'web push' or 'long poll' service 207 provide very different quality of service promises. 209 Fail 'fast': TCP resources can be highly contended. For fault 210 tolerance reasons a server needs to be able to determine within a 211 reasonable time frame whether a connection is still active or 212 required. e.g. If static assets typically return in 100s of 213 milliseconds, and users 'switch off' after <10s keeping timeouts of 214 >30s make little sense and defining a 'quality of service' 215 appropriate to the target platform is encouraged. On a shared 216 platform with mixed session lifetimes, applications that require 217 longer render times have various options to ensure the underlying 218 service and upstream servers in the path can identify the session as 219 not failed: HTTP continuations, Redirects, 202s or sending data. 221 Clients and servers typically have many timeout options, a few 222 notable options are: Connect(client), time to request(server), time 223 to first byte(client), between bytes(server/client), total connection 224 time(server/client). Some implementations merge these values into a 225 single 'timeout' definition even when statistics are reported 226 individually. All should be considered as the defaults in many 227 implementations are highly underiable, even infinite timeouts have 228 been observed. 230 3. TCP handshake 232 3.1. TCP Fast Open 234 TCP Fast Open (a.k.a. TFO, [RFC7413]) allows data to be sent on the 235 TCP handshake, thereby allowing a request to be sent without any 236 delay if a connection is not open. 238 TFO requires both client and server support, and additionally 239 requires application knowledge, because the data sent on the SYN 240 needs to be idempotent. Therefore, TFO can only be used on 241 idempotent, safe HTTP methods (e.g., GET and HEAD), or with 242 intervening negotiation (e.g, using TLS). It should be noted that 243 TFO requires a secret to be defined on the server to mitigate 244 security vulnerabilities it introduces. TFO therefore requires more 245 server side deployment planning than other enhancements. 247 Support for TFO is growing in client platforms, especially mobile, 248 due to the significant performance advantage it gives. 250 3.2. Initial Congestion Window 252 [RFC6928] specifies an initcwnd (initial congestion window) of 10, 253 and is now fairly widely deployed server-side. There has been 254 experimentation with larger initial windows, in combination with 255 packet pacing. Many implementations allow initcwnd to be applied to 256 specific routes which allows a greater degree of flexibility than 257 some other TCP parameters. 259 IW10 has been reported to perform fairly well even in high volume 260 servers. 262 3.3. TCP SYN flood handling 264 TCP SYN Flood mitigations [RFC4987] are necessary and there will be 265 thresholds to tweak. 267 4. TCP transfers 269 4.1. Packet Pacing 271 TBD 273 4.2. Explicit Congestion Control 275 Apple deploying in iOS and OSX [1]. 277 4.3. Nagle's Algorithm 279 Nagle's Algorithm [RFC0896] is the mechanism that makes the TCP stack 280 hold (small) outgoing packets for a short period of time so that it 281 can potentially merge that packet with the next outgoing one. It is 282 optimized for throughput at the expense of latency. 284 HTTP/2 in particular requires that the client can send a packet back 285 fast even during transfers that are perceived as single direction 286 transfers. Even small delays in those sends can cause a significant 287 performance loss. 289 HTTP/1.1 is also affected, especially when sending off a full request 290 in a single write() system call. 292 In POSIX systems you switch it off like this: 294 int one = 1; 295 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)); 297 4.4. Delayed ACKs 299 Delayed ACK [RFC1122] is a mechanism enabled in most TCP stacks that 300 causes the stack to delay sending acknowledgement packets in response 301 to data. The ACK is delayed up until a certain threshold, or until 302 the peer has some data to send, in which case the ACK will be sent 303 along with that data. Depending on the traffic flow and TCP stack 304 this delay can be as long as 500ms. 306 This interacts poorly with peers that have Nagle's Algorithm enabled. 307 Because Nagle's Algorithm delays sending until either one MSS of data 308 is provided _or_ until an ACK is received for all sent data, delaying 309 ACKs can force Nagle's Algorithm to buffer packets when it doesn't 310 need to (that is, when the other peer has already processed the 311 outstanding data). 313 Delayed ACKs can be useful in situations where it is reasonable to 314 assume that a data packet will almost immediately (within 500ms) 315 cause data to be sent in the other direction. In general in both 316 HTTP/1.1 and HTTP/2 this is unlikely: therefore, disabling Delayed 317 ACKs can provide an improvement in latency. 319 However, the TLS handshake is a clear exception to this case. For 320 the duration of the TLS handshake it is likely to be useful to keep 321 Delayed ACKs enabled. 323 Additionally, for low-latency servers that can guarantee responses to 324 requests within 500ms, on long-running connections (such as HTTP/2), 325 and when requests are small enough to fit within a small packet, 326 leaving delayed ACKs turned on may provide minor performance 327 benefits. 329 Effective use of switching off delayed ACKs requires extensive 330 profiling. 332 4.5. Keep-alive 334 TCP keep-alive is likely disabled - at least on mobile clients for 335 energy saving purposes. App-level keep-alive is then required for 336 long-lived requests to detect failed peers or connections reset by 337 stateful firewalls etc. 339 5. Re-using connections 341 5.1. Slow Start after Idle 343 Slow-start is one of the algorithms that TCP uses to control 344 congestion inside the network. It is also known as the exponential 345 growth phase. Each TCP connection will start off in slow-start but 346 will also go back to slow-start after a certain amount of idle time. 348 5.2. TCP-Bound Authentications 350 There are several HTTP authentication mechanisms in use today that 351 are used or can be used to authenticate a connection rather than a 352 single HTTP request. Two popular ones are NTLM and Negotiate. 354 If such an authentication has been negotiated on a TCP connection, 355 that connection can remain authenticated throughout the rest of its 356 lifetime. This discrepancy with how other HTTP authentications work 357 makes it important to handle these connections with care. 359 6. Closing connections 361 6.1. Half-close 363 The client or server is free to half-close after a request or 364 response has been completed; or when there is no pending stream in 365 HTTP/2. 367 Half-closing is sometimes the only way for a server to make sure it 368 closes down connections cleanly so that it doesn't accept more 369 requests while still allowing clients to receive the ongoing 370 responses. 372 6.2. Abort 374 No client abort for HTTP/1.1 after the request body has been sent. 375 Delayed full close is expected following an error response to avoid 376 RST on the client. 378 6.3. Close Idle Connections 380 Keeping open connections around for subsequent connection reuse is 381 key for many HTTP clients' performance. The value of an existing 382 connection quickly degrades and after only a few minutes the chance 383 that a connection will successfully get reused by a web browser is 384 slim. 386 6.4. Tail Loss Probes 388 draft [2] 390 7. IANA Considerations 392 This document does not require action from IANA. 394 8. Security Considerations 396 TBD 398 9. References 400 9.1. Normative References 402 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 403 RFC 793, DOI 10.17487/RFC0793, September 1981, 404 . 406 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 407 Requirement Levels", BCP 14, RFC 2119, 408 DOI 10.17487/RFC2119, March 1997, 409 . 411 [RFC7230] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer 412 Protocol (HTTP/1.1): Message Syntax and Routing", 413 RFC 7230, DOI 10.17487/RFC7230, June 2014, 414 . 416 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 417 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 418 DOI 10.17487/RFC7540, May 2015, 419 . 421 9.2. Informative References 423 [RFC0896] Nagle, J., "Congestion Control in IP/TCP Internetworks", 424 RFC 896, DOI 10.17487/RFC0896, January 1984, 425 . 427 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 428 Communication Layers", STD 3, RFC 1122, 429 DOI 10.17487/RFC1122, October 1989, 430 . 432 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 433 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 434 . 436 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, 437 "Increasing TCP's Initial Window", RFC 6928, 438 DOI 10.17487/RFC6928, April 2013, 439 . 441 [RFC7141] Briscoe, B. and J. Manner, "Byte and Packet Congestion 442 Notification", BCP 41, RFC 7141, DOI 10.17487/RFC7141, 443 February 2014, . 445 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 446 Scheffenegger, Ed., "TCP Extensions for High Performance", 447 RFC 7323, DOI 10.17487/RFC7323, September 2014, 448 . 450 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 451 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 452 . 454 9.3. URIs 456 [1] https://developer.apple.com/videos/wwdc/2015/?id=719 458 [2] http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01 460 Appendix A. Acknowledgments 462 This specification builds upon previous work and help from Mark 463 Nottingham, Craig Taylor 465 Appendix B. Operating System Settings for Linux 467 Here are some sample operating system settings for the Linux 468 operating system, along with the section it refers to. 470 Section 2.1 472 fs.file-max = 474 Section 2.2 476 net.core.netdev_max_backlog = 478 Section 2.3 479 net.core.somaxconn = 481 Section 2.4 483 net.ipv4.ip_local_port_range = 1024 65535 485 Section 2.5 487 net.ipv4.tcp_fin_timeout = 489 Section 2.6 491 net.ipv4.tcp_tw_reuse = 1 493 Section 2.7 495 net.ipv4.tcp_wmem = 497 Section 2.7 499 net.ipv4.tcp_rmem = 501 Section 2.8 503 net.core.rmem_max = 505 Section 2.8 507 net.core.wmem_max = 509 Section 5.1 511 net.ipv4.tcp_slow_start_after_idle = 0 513 Section 4.3 Turning off Nagle's Algorithm: 515 int one = 1; 516 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)); 518 Section 4.4 520 On recent Linux kernels (since Linux 2.4.4), Delayed ACKs can be 521 disabled like this: 523 int one = 1; 524 setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &one, sizeof(one)); 525 Unlike disabling Nagle's Algorithm, disabling Delayed ACKs on Linux 526 is not a one-time operation: processing within the TCP stack can 527 cause Delayed ACKs to be re-enabled. As a result, to use 528 "TCP_QUICKACK" effectively requires setting and unsetting the socket 529 option during the life of the connection. 531 Authors' Addresses 533 Daniel Stenberg 534 Mozilla 536 Email: daniel@haxx.se 537 URI: http://daniel.haxx.se 539 Tim Wicinski 540 Salesforce 542 Email: tjw.ietf@gmail.com