idnits 2.17.00 (12 Aug 2021) /tmp/idnits15197/draft-miao-iccrg-hpccplus-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 22, 2021) is 394 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: draft-ietf-avtcore-cc-feedback-message has been published as RFC 8888 == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-09 == Outdated reference: A later version (-04) exists of draft-kumar-ippm-ifa-01 Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group R. Miao 3 Internet-Draft H. Liu 4 Intended status: Experimental Alibaba Group 5 Expires: October 24, 2021 R. Pan 6 J. Lee 7 C. Kim 8 Intel Corporation 9 B. Gafni 10 Y. Shpigelman 11 Mellanox Technologies, Inc. 12 April 22, 2021 14 HPCC++: Enhanced High Precision Congestion Control 15 draft-miao-iccrg-hpccplus-00 17 Abstract 19 Congestion control (CC) is the key to achieving ultra-low latency, 20 high bandwidth and network stability in high-speed networks. 21 However, the existing high-speed CC schemes have inherent limitations 22 for reaching these goals. 24 In this document, we describe HPCC++ (High Precision Congestion 25 Control), a new high-speed CC mechanism which achieves the three 26 goals simultaneously. HPCC++ leverages inband telemetry to obtain 27 precise link load information and controls traffic precisely. By 28 addressing challenges such as delayed signaling during congestion and 29 overreaction to the congestion signaling using inband and granular 30 telemetry, HPCC++ can quickly converge to utilize all the available 31 bandwidth while avoiding congestion, and can maintain near-zero in- 32 network queues for ultra-low latency. HPCC++ is also fair and easy 33 to deploy in hardware, implementable with commodity NICs and 34 switches. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on October 24, 2021. 53 Copyright Notice 55 Copyright (c) 2021 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 72 3. System Overview . . . . . . . . . . . . . . . . . . . . . . . 4 73 4. HPCC++ Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 74 4.1. Notations . . . . . . . . . . . . . . . . . . . . . . . . 5 75 4.2. Design Functions and Procedures . . . . . . . . . . . . . 6 76 5. Configuration Parameters . . . . . . . . . . . . . . . . . . 8 77 6. Design Enhancement and Implementation . . . . . . . . . . . . 8 78 6.1. HPCC++ Guidelines . . . . . . . . . . . . . . . . . . . . 9 79 6.2. Receiver-based HPCC . . . . . . . . . . . . . . . . . . . 9 80 7. Reference Implementations . . . . . . . . . . . . . . . . . . 10 81 7.1. Inband telemetry padding at the network elements . . . . 10 82 7.2. Congestion control at NICs . . . . . . . . . . . . . . . 10 83 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 84 9. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 12 85 9.1. Internet Deployment . . . . . . . . . . . . . . . . . . . 12 86 9.2. Switch-assisted congestion control . . . . . . . . . . . 12 87 9.3. Work with transport protocols . . . . . . . . . . . . . . 13 88 9.4. Work with QoS queuing . . . . . . . . . . . . . . . . . . 13 89 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 14 90 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 14 91 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 14 92 12.1. Normative References . . . . . . . . . . . . . . . . . . 14 93 12.2. Informative References . . . . . . . . . . . . . . . . . 14 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 96 1. Introduction 98 The link speed in data center networks has grown from 1Gbps to 99 100Gbps in the past decade, and this growth is continuing. Ultralow 100 latency and high bandwidth, which are demanded by more and more 101 applications, are two critical requirements in today's and future 102 high-speed networks. 104 Given that traditional software-based network stacks in hosts can no 105 longer sustain the critical latency and bandwidth requirements as 106 described in [Zhu-SIGCOMM2015], offloading network stacks into 107 hardware is an inevitable direction in high-speed networks. As an 108 example, large-scale networks with RDMA (remote direct memory access) 109 often uses hardware-offloading solutions. In some cases, the RDMA 110 networks still face fundamental challenges to reconcile low latency, 111 high bandwidth utilization, and high stability. 113 This document describes a new congestion control mechanism, HPCC++ 114 (Enhanced High Precision Congestion Control), for large-scale, high- 115 speed networks. The key idea behind HPCC++ is to leverage the 116 precise link load information from signaled through inband telemetry 117 to compute accurate flow rate updates. Unlike existing approaches 118 that often require a large number of iterations to find the proper 119 flow rates, HPCC++ requires only one rate update step in most cases. 120 Using precise information from inband telemetry enables HPCC++ to 121 address the limitations in current congestion control schemes. 122 First, HPCC++ senders can quickly ramp up flow rates for high 123 utilization and ramp down flow rates for congestion avoidance. 124 Second, HPCC++ senders can quickly adjust the flow rates to keep each 125 link's output rate slightly lower than the link's capacity, 126 preventing queues from being built-up as well as preserving high link 127 utilization. Finally, since sending rates are computed precisely 128 based on direct measurements at switches, HPCC++ requires merely 129 three independent parameters that are used to tune fairness and 130 efficiency. 132 The base form of HPCC++ is the original HPCC algorithm and its full 133 description can be found in [SIGCOMM-HPCC]. While the original 134 design lays the foundation for inband telemetry based precision 135 congestion control, HPCC++ is an enhanced version which takes into 136 account system constraints and aims to reduce the design overhead and 137 further improves the performance. Section 6 describes these detailed 138 proposed design enhancements and guidelines. 140 This document describes the architecture changes in switches and end- 141 hosts to support the needed tranmission of inband telemetry and its 142 consumption, that imporves the efficiency in handling network 143 congestion. 145 2. Terminology 147 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 148 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 149 "OPTIONAL" in this document are to be interpreted as described in BCP 150 14 [RFC2119] [RFC8174] when, and only when, they appear in all 151 capitals, as shown here. 153 3. System Overview 155 Figure 1 shows the end-to-end system that HPCC++ operates in. During 156 the traverse of the packet from the sender to the receiver, each 157 switch along the path inserts inband telemetry that reports the 158 current state of the packet's egress port, including timestamp (ts), 159 queue length (qLen), transmitted bytes (txBytes), and the link 160 bandwidth capacity (B), together with switch_ID and port_ID. When 161 the receiver gets the packet, it may copy all the inband telemetry 162 recorded from the network to the ACK message it sends back to the 163 sender, and then the sender decides how to adjust its flow rate each 164 time it receives an ACK with network load information. 165 Alternatively, the receiver may calculate the flow rate based on the 166 inband telemetry information and feedback the calculated rate back to 167 the sender. The notification packets would include delayed ack 168 information as well. 170 Note that there also exist network nodes along the reverse 171 (potentially uncongested) path that the RTCP feedback reports 172 traverse. Those network nodes are not shown in the figure for sake 173 of brevity. 175 +---------+ pkt +-------+ pkt+tlm +-------+ pkt+tlm +----------+ 176 | Data |-------->| |-------->| |-------->| Data | 177 | Sender |=========|Switch1|=========|Switch2|=========| Receiver | 178 +---------+ Link-0 +-------+ Link-1 +-------+ Link-2 +----------+ 179 /|\ | 180 | | 181 +---------------------------------------------------------+ 182 Notification Packets/ACKs 184 Figure 1: System Overview (tlm=inband telemtry) 186 o Data sender: responsible for controlling inflight bytes. HPCC++ 187 is a window-based congestion control scheme that controls the 188 number of inflight bytes. The inflight bytes mean the amount of 189 data that have been sent, but not acknowledged by the sender yet. 190 Controlling inflight bytes has an important advantage compared to 191 controlling rates. In the absence of congestion, the inflight 192 bytes and rate are interchangeable with equation inflight = rate * 193 T where T is the base propagation RTT. The rate can be calculated 194 locally or obtained from the notification packet. The sender may 195 further use the data pacing mechanism, potentially implemented in 196 hardware, to limit the rate accordingly. 198 o Network nodes: responsible of inserting the inband telemetry 199 information to the data packet. The inband telemetry information 200 reports the current load of the packet's egress port, including 201 timestamp (ts), queue length (qLen), transmitted bytes (txBytes), 202 and link bandwidth capacity (B). Besides, the inband telemetry 203 contains switch_ID and port_ID to identify a link. 205 o Data receiver: responsible for either reflecting back the inband 206 telemetry information in the data packet or calculating the proper 207 flow rate based on network congestion information in inband 208 telemetry and sending notification packets back to the sender. 210 4. HPCC++ Algorithm 212 HPCC++ is a window-based congestion control algorithm. The key 213 design choice of HPCC++ is to rely on network nodes to provide fine- 214 grained load information, such as queue size and accumulated tx/rx 215 traffic to compute precise flow rates. This has two major benefits: 216 (i) HPCC++ can quickly converge to proper flow rates to highly 217 utilize bandwidth while avoiding congestion; and (ii) HPCC++ can 218 consistently maintain a close-to-zero queue for low latency. 220 This section introduces the list of notations and describes the core 221 congestion control algorithm. 223 4.1. Notations 225 This section summarizes the list of variables and parameters used in 226 the HPCC++ algorithm. Figure 3 also includes the default values for 227 choosing the algorithm parameters either to represent a typical 228 setting in practical applications or based on theoretical and 229 simulation studies. 231 +--------------+-------------------------------------------------+ 232 | Notation | Variable Name | 233 +--------------+-------------------------------------------------+ 234 | W_i | Window for flow i | 235 | Wc_i | Reference window for flow i | 236 | B_j | Bandwidth for Link j | 237 | I_j | Estimated inflight bytes for Link j | 238 | U_j | Normalized inflight bytes for Link j | 239 | qlen | Telemetry info: link j queue length | 240 | txRate | Telemetry info: link j output rate | 241 | ts | Telemetry info: timestamp | 242 | txBytes | Telemetry info: link j total transmitted bytes | 243 | | associated with timestamp ts | 244 +--------------+-------------------------------------------------+ 246 Figure 2: List of variables. 248 +--------------+----------------------------------+----------------+ 249 | Notation | Parameter Name | Default Value | 250 +--------------+----------------------------------+----------------+ 251 | T | Known baseline RTT | 5us | 252 | eta | Target link utilization | 95% | 253 | maxStage | Maximum stages for additive | | 254 | | increases | 5 | 255 | N | Maximum number of flows | ... | 256 | W_ai | Additive increase amount | ... | 257 +--------------+----------------------------------+----------------+ 259 Figure 3: List of algorithm parameters and their default values. 261 4.2. Design Functions and Procedures 263 The HPCC++ algorithm can be outlined as below: 265 1: Function MeasureInflight(ack) 266 2: u = 0; 267 3: for each link i on the path do 268 4: ack.L[i].txBytes-L[i].txBytes 269 txRate = ----------------------------- ; 270 ack.L[i].ts-L[i].ts 271 5: min(ack.L[i].qlen,L[i].qlen) txRate 272 u' = ----------------------------- + ---------- ; 273 ack.L[i].B*T ack.L[i].B 274 6: if u' > u then 275 7: u = u'; tau = ack.L[i].ts - L[i].ts; 276 8: tau = min(tau, T); 277 9: U = (1 - tau/T)*U + tau/T*u; 278 10: return U; 279 11: Function ComputeWind(U, updateWc) 280 12: if U >= eta or incStage >= maxStagee then 281 13: Wc 282 W = ----- + W_ai; 283 U/eta 284 14: if updateWc then 285 15: incStagee = 0; Wc = W ; 286 16: else 287 17: W = Wc + W_ai ; 288 18: if updateWc then 289 19: incStage++; Wc = W ; 290 20: return W 292 21: Procedure NewAck(ack) 293 22: if ack.seq > lastUpdateSeq then 294 23: W = ComputeWind(MeasureInflight(ack), True); 295 24: lastUpdateSeq = snd_nxt; 296 25: else 297 26: W = ComputeWind(MeasureInflight(ack), False); 298 27: R = W/T; L = ack.L; 300 The above illustrates the overall process of CC at the sender side 301 for a single flow. Each newly received ACK message triggers the 302 procedure NewACK at Line 21. At Line 22, the variable lastUpdateSeq 303 is used to remember the first packet sent with a new W c , and the 304 sequence number in the incoming ACK should be larger than 305 lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and 306 18-19). The sender also remembers the pacing rate and current inband 307 telemetry information at Line 27. The sender computes a new window 308 size W at Line 23 or Line 26, depending on whether to update W c , 309 with function MeasureInflight and ComputeWind. Function 310 MeasureInflight estimates normalized inflight bytes with Eqn (2) at 311 Line 5. First, it computes txRate of each link from the current and 312 last accumulated transferred bytes txBytes and timestamp ts (Line 4). 313 It also uses the minimum of the current and last qlen to filter out 314 noises in qlen (Line 5). The loop from Line 3 to 7 selects maxi(Ui) 315 in Eqn. (3). Instead of directly using maxi(Ui), we use an EWMA 316 (Exponentially Weighted Moving Average) to filter the noises from 317 timer inaccuracy and transient queues. (Line 9). Function 318 ComputeWind combines multiplicative increase/ decrease (MI/MD) and 319 additive increase (AI) to balance the reaction speed and fairness. 320 If a sender finds it should increase the window size, it first tries 321 AI for maxStage times with the stepWAI (Line 17). If it still finds 322 room to increase after maxStage times of AI or the normalized 323 inflight bytes is above, it calls Eqn (4) once to quickly ramp up or 324 ramp down the window size (Line 12-13). 326 5. Configuration Parameters 328 HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai. 329 eta controls a simple tradeoff between utilization and transient 330 queue length (due to the temporary collision of packets caused by 331 their random arrivals, so we set it to 95% by default, which only 332 loses 5% bandwidth but achieves almost zero queue. maxStage controls 333 a simple tradeoff between steady state stability and the speed to 334 reclaim free bandwidth. We find maxStage = 5 is conservatively large 335 for stability, while the speed of reclaiming free bandwidth is still 336 much faster than traditional additive increase, especially in high 337 bandwidth networks. W_ai controls the tradeoff between the maximum 338 number of concurrent flows on a link that can sustain near-zero 339 queues and the speed of convergence to fairness. Note that none of 340 the three parameters are reliability-critical. 342 HPCC++'s design brings advantages to short-lived flows, by allowing 343 flows starting at line-rate and the separation of utilization 344 convergence and fairness convergence. HPCC++ achieves fast 345 utilization convergence to mitigate congestion in almost one round- 346 trip time, while allows flows to gradually converge to fairness. 347 This design feature of HPCC++ is especially helpful for the workload 348 of datacenter applications, where flows are usually short and 349 latency-sensitive. Normally we set a very small W_ai to support a 350 large number of concurrent flows on a link, because slower fairness 351 is not critical. A rule of thumb is to set W_ai = W_init*(1-eta) / N 352 where N is the expected or receiver reported maximum number of 353 concurrent flows on a link. The intuition is that the total additive 354 increase every round (N*W_ai ) should not exceed the bandwidth 355 headroom, and thus no queue forms. Even if the actual number of 356 concurrent flows on a link exceeds N, the CC is still stable and 357 achieves full utilization, but just cannot maintain zero queues. 359 6. Design Enhancement and Implementation 361 The basic design of HPCC++, i.e. HPCC, as described above is to add 362 inband telemetry information into every data packet to react to 363 congestion as soon as the very first packet observing the network 364 congestion. This is especially helpful to reduce the risk of severe 365 congestion in incast scenarios at the first round-trip time. In 366 addition, original HPCC's algorithm introduction of Wc is for the 367 purpose of solving the over-reaction issue from using this per-packet 368 response. 370 Alternatively, the inband telemetry information needs not to be added 371 to every data packet to reduce the overhead. Switches can attach 372 inband telemetry less frequently, e.g., once per RTT or upon 373 congestion occurance. 375 6.1. HPCC++ Guidelines 377 To ensure network stability, HPCC++ establishes a few guidelines for 378 different implementations: 380 o The algorithm should commit the window/rate update at most once 381 per round-trip time, similar to the procedure of updating Wc. 383 o To support different workloads and to properly set W_ai, HPCC++ 384 allows the option to incorporate mechanisms to speed up the 385 fairness convergence. 387 o The switch should capture inband telemetry information that 388 includes link load (txBytes, qlen, ts) and link spec (switch_ID, 389 port_ID, B) at the egress port. Note, each switch should record 390 all those information at the single snapshot to achieve a precise 391 link load estimate. 393 o HPCC++ can use a probe packet to query the inband telemetry 394 information. Thereby, the probe packets should take the same 395 routing path and QoS queueing with the data packets. 397 As long the above guidelines are met, this document does not mandate 398 a particular inband telemetry header format or encapsulation, which 399 are orthogonal to the HPCC++ algorithm described in this document. 400 The algorithm can be implemented with a choice of inband telemetry 401 protocols, such as in-band network telemetry [P4-INT], IOAM 402 [I-D.ietf-ippm-ioam-data], IFA [I-D.ietf-kumar-ippm-ifa] and others. 403 In fact, the emerging inband telemetry protocols can inform the 404 evolution for a broader range of protocols and network functions, 405 where this document leverages the trend to propose the architecture 406 change to support HPCC++ algorithm. 408 6.2. Receiver-based HPCC 410 Note that the window/rate calculation can be implemented at either 411 the data sender or the data receiver. If the ACK packets already 412 exist for reliability purpose, the inband telemetry information can 413 be echoed back to the sender via ACK self-clocking. Not all ACK 414 packets need to carry the inband telemetry information. To reduce 415 the Packet Per Second (PPS) overhead, the receiver may examine the 416 inband telemetry information and adopt the technique of delayed ACKs 417 that only sends out an ACK for a few of received packets. In order 418 to reduce PPS even further, one may implement the algorithm at the 419 receiver and feedback the calculated window in the ACK packet once 420 every RTT. 422 The receiver-based algorithm, Rx-HPCC, is based on int.L, which is 423 the inband telemetry information in the packet header. The receiver 424 performs the same functions except using int.L instead of ack.L. The 425 new function NewINT(int.L) is to replace NewACK(int.L) 427 28: Procedure NewINT(int.L) 428 29: if now > (lastUpdateTime + T) then 429 30: W = ComputeWind(MeasureInflight(int), True); 430 31: send_ack(W) 431 32: lastUpdateTime = now; 432 33: else 433 34: W = ComputeWind(MeasureInflight(int), False); 435 Here, since the receiver does not know the starting sequence number 436 of a burst, it simply records the lastUpdateTime. If time T has 437 passed since lastUpdateTime, the algorithm would recalcuate Wc as in 438 Line 30 and send out the ACK packet which would include W 439 information. Otherwise, it would just update W information locally. 440 This would reduce the amount of traffic that needs to be feedback to 441 the data sender. 443 Note that the receiver can also measure the number of outstanding 444 flows, N, if the last hop is the congestion point and use this 445 information to dynamically adjust W_ai to achieve better fairness. 446 The improvement would allow flows to quickly converge to fairness 447 without causing large swings under heavy load. 449 7. Reference Implementations 451 A prototype of HPCC++ is implemented in NICs to realize the 452 congestion control algorithm and in switches to realize the inband 453 telemetry feature. 455 7.1. Inband telemetry padding at the network elements 457 HPCC++ only relies on packets to share information across senders, 458 receivers, and switches. HPCC++ is open to a variety of inband 459 telemetry format standards. Inside a data center, the path length is 460 often no more than 5 hops. The overhead of the inband telemetry 461 padding for HPCC++ is considered to be low. 463 7.2. Congestion control at NICs 465 (Figure 4) shows HPCC++ implementation on a NIC. The NIC provides an 466 HPCC++ module that resides on the data path of the NIC, HPCC++ 467 modules realize both sender and receiver roles. 469 +------------------------------------------------------------------+ 470 | +---------+ window update +-----------+ PktSend +-----------+ | 471 | | |-------------->| Scheduler |-------> |Tx pipeline|---+-> 472 | | | rate update +-----------+ +-----------+ | 473 | | HPCC++ | ^ | 474 | | | inband telemetry| | 475 | | module | | | 476 | | | +-----+-----+ | 477 | | |<----------------------------------- |Rx pipeline| <-+-- 478 | +---------+ telemetry response event +-----------+ | 479 +------------------------------------------------------------------+ 481 Figure 4: Overview of NIC Implementation 483 1. Sender side flow 485 The HPCC++ module running the HPCC CC algorithm in the sender side 486 for every flow in the NIC. Flow can be defined by some transport 487 parameters including 5-tuples, destination QP (queue pair), etc. It 488 receives inband telemetry response events per flow which are 489 generated from the RX pipeline, adjusts the sending window and rate, 490 and update the scheduler on the rate and window of the flow. 492 The scheduler contains a pacing mechanism that determine the flow 493 rate by the value it got from the algorithm. It also maintains the 494 current sending window size for active flows. If the pacing 495 mechanism and the flow's sending window permits, the scheduler 496 invokes for the flow a PktSend command to TX pipeline. 498 The TX pipeline implements packet processing. Once it receives the 499 PktSend event with flow ID from the scheduler, it generates the 500 corresponding packet and delivers to the Network. If a sent packet 501 should collect telemetry on its way the TX pipeline may add 502 indications/headers that triggers the network elements to add 503 telemetry data according to the inband telemetry protocol in use. 504 The telemetry can be collected by the data packet or by dedicated 505 prob packets generated in the TX pipeline. 507 The RX pipe parses the incoming packets from the network and 508 identifies whether telemetry is embedded in the parsed packet. On 509 receiving a telemetry response packet, the RX pipeline extracts the 510 network status from the packet and passes it to the HPCC++ module for 511 processing. A telemetry response packet can be an ACK containing 512 inband telemetry, or a dedicated telemetry response prob packet. 514 2. Receiver side flow 515 On receiving a packet containing inband telemetry, the RX pipeline 516 extracts the network status, and the flow parameters from the packet 517 and passes it to the TX pipeline. The packet can be a data packet 518 containing inband telemetry, or a dedicated telemetry request prob 519 packet. The Tx pipeline may process and edit the telemetry data, and 520 then sends back to the sender the data using either an ACK packet of 521 the flow or a dedicated telemetry response packet. 523 8. IANA Considerations 525 This document makes no request of IANA. 527 9. Discussion 529 9.1. Internet Deployment 531 Although the discussion above mainly focuses on the data center 532 environment, HPCC++ can be adopted at Internet at large. There are 533 several security considerations one should be aware of. 535 There may rise privacy concern when the telemetry information is 536 conveyed across Autonomous Systems (ASes) and back to end-users. The 537 link load information captured in telemetry can potentially reveal 538 the provider's network capacity, route utilization, scheduling 539 policy, etc. Those usually are considered to be sensitive data of 540 the network providers. Hence, certain action may take to anonymize 541 the telemetry data and only convey the relative ratio in rate 542 adaptation across ASes without revealing the actual network load. 544 Another consideration is the security of receiving telemetry 545 information. The rate adaptation mechanism in HPCC++ relies on 546 feedback from the network. As such, it is vulnerable to attacks 547 where feedback messages are hijacked, replaced, or intentionally 548 injected with misleading information resulting in denial of service, 549 similar to those that can affect TCP. It is therefore RECOMMENDED 550 that the notification feedback message is at least integrity checked. 551 In addition, [I-D.ietf-avtcore-cc-feedback-message] discusses the 552 potential risk of a receiver providing misleading congestion feedback 553 information and the mechanisms for mitigating such risks. 555 9.2. Switch-assisted congestion control 557 HPCC++ falls in the general category of switch-assisted congestion 558 control. However, HPCC++ includes a few unique design choices that 559 are different from other switch-assisted approaches. 561 o First, HPCC++ implements a primal-mode algorithm that requires 562 only the ``write-to-packet'' operation from switches, which has 563 already been supported by telemetry protocols like INT [P4-INT] or 564 IOAM [I-D.ietf-ippm-ioam-data]. Please note that this is very 565 different from dual-mode algorithms such as XCP 566 [Katabi-SIGCOMM2002] and RCP [Dukkipati-RCP], where switches take 567 an actively role in determining flows' rates. 569 o Second, HPCC++ achieves a fast utilization convergence by 570 decoupling it from fairness convergence, which is inspired by XCP. 572 o Third, HPCC++ enables the switch-guided multiplicative increase 573 (MI) by defining the ``inflight byte'' to quantify the link load. 574 The inflight byte tells both the underload and overload of the 575 link precisely and thus it allows the flow to increase/decrease 576 the rate multiplicatively and safely. By contrast, traditional 577 approaches of using the queue length or RTT as the feedback cannot 578 guide the rate increase and instead have to rely on additive 579 increase (AI) with heuristics. As the link speed contines to 580 grow, this becomes increasingly slow in reclaiming the unused 581 bandwidth. Besides, queue-based feedback mechanisms subject to 582 latency inflation. 584 o Last, HPCC++ uses TX rate instead of RX rate used by XCP and RCP. 585 As detailed in [SIGCOMM-HPCC], we view the TX rate is more precise 586 because RX rate and queue length are overlapped and thus it causes 587 oscillation. 589 9.3. Work with transport protocols 591 HPCC++ can be adopted as the CC algorithm by a wide range of 592 transport protocols such as TCP and UDP, as well as others that may 593 run on top of them, such as iWARP, RoCE etc. It requires to have the 594 window limit and congestion feedback through ACK self-clocking, which 595 naturally conforms to the paradigm of TCP design. With that, HPCC++ 596 introduces a scheme to measure the total inflight bytes for more 597 precise congestion control. To run in UDP, some modifications need 598 to be done to enforce the window limit and collect congestion 599 feedback via probing packets, which is incremental. 601 9.4. Work with QoS queuing 603 Under the use of QoS (Quality of service) priority queuing in 604 switches, the length of flow's own queue cannot tell the actual 605 queuing time and the exact extent of congestion. Although general 606 approaches for running congestion control with QoS queuing are out of 607 the scope of this document, we provide a few hints for HPCC++ running 608 friendly with QoS queuing. In this case, HPCC++ can leverage the 609 packet sojourn time (the egress timestamp minus the ingress 610 timestamp) instead of the queue length to quantify the packet's 611 actual queuing delay. In addition, the operators typically use the 612 Deficit Weighted Round Robin (DWRR) instead of the strict priority 613 (SP) as their QoS scheduling to prevent traffic starvation. DWRR 614 provides a minimum bandwdith guarantee for each queue so that HPCC++ 615 can leverage it for precise rate update to avoid congestion. 617 10. Acknowledgments 619 The authors would like to thank ICCRG members for their valuable 620 review comments and helpful input to this specification. 622 11. Contributors 624 The following individuals have contributed to the implementation and 625 evaluation of the proposed scheme, and therefore have helped to 626 validate and substantially improve this specification: Pedro Y. 627 Segura, Roberto P. Cebrian, Robert Southworth and Malek Musleh. 629 12. References 631 12.1. Normative References 633 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 634 Requirement Levels", BCP 14, RFC 2119, 635 DOI 10.17487/RFC2119, March 1997, 636 . 638 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 639 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 640 May 2017, . 642 12.2. Informative References 644 [Dukkipati-RCP] 645 Dukkipati, N., "Rate Control Protocol (RCP): Congestion 646 control to make flows complete quickly.", Stanford 647 University , 2008. 649 [I-D.ietf-avtcore-cc-feedback-message] 650 Sarker, Z., Perkins, C., Singh, V., and M. Ramalho, "RTP 651 Control Protocol (RTCP) Feedback for Congestion Control", 652 draft-ietf-avtcore-cc-feedback-message-09 (work in 653 progress), November 2020. 655 [I-D.ietf-ippm-ioam-data] 656 "Data Fields for In-situ OAM", March 2020, 657 . 660 [I-D.ietf-kumar-ippm-ifa] 661 "Inband Flow Analyzer", February 2019, 662 . 664 [Katabi-SIGCOMM2002] 665 Katabi, D., Handley, M., and C. Rohrs, "Congestion Control 666 for High Bandwidth-Delay Product Networks", ACM 667 SIGCOMM Pittsburgh, Pennsylvania, USA, October 2002. 669 [P4-INT] "In-band Network Telemetry (INT) Dataplane Specification, 670 v2.0", February 2020, . 673 [SIGCOMM-HPCC] 674 Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang, 675 L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M. 676 Yu, "HPCC: High Precision Congestion Control", ACM 677 SIGCOMM Beijing, China, August 2019. 679 [Zhu-SIGCOMM2015] 680 Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., 681 Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M. 682 Zhang, "Congestion Control for Large-Scale RDMA 683 Deployments", ACM SIGCOMM London, United Kingdom, August 684 2015. 686 Authors' Addresses 688 Rui Miao 689 Alibaba Group 690 525 Almanor Ave, 4th Floor 691 Sunnyvale, CA 94085 692 USA 694 Email: miao.rui@alibaba-inc.com 696 Hongqiang H. Liu 697 Alibaba Group 698 108th Ave NE, Suite 800 699 Bellevue, WA 98004 700 USA 702 Email: hongqiang.liu@alibaba-inc.com 703 Rong Pan 704 Intel, Corp. 705 2200 Mission College Blvd. 706 Santa Clara, CA 95054 707 USA 709 Email: rong.pan@intel.com 711 Jeongkeun Lee 712 Intel, Corp. 713 4750 Patrick Henry Dr. 714 Santa Clara, CA 95054 715 USA 717 Email: jk.lee@intel.com 719 Changhoon Kim 720 Intel Corporation 721 4750 Patrick Henry Dr. 722 Santa Clara, CA 95054 723 USA 725 Email: chang.kim@intel.com 727 Barak Gafni 728 Mellanox Technologies, Inc. 729 350 Oakmead Parkway, Suite 100 730 Sunnyvale, CA 94085 731 USA 733 Email: gbarak@mellanox.com 735 Yuval Shpigelman 736 Mellanox Technologies, Inc. 737 Haim Hazaz 3A 738 Netanya 4247417 739 Israel 741 Email: yuvals@nvidia.com