idnits 2.17.00 (12 Aug 2021) /tmp/idnits7257/draft-stein-great-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 470. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 447. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 454. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 460. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (Sept 29, 2005) is 6345 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y(J) Stein 3 Internet-Draft RAD Data Communications 4 Expires: April 2, 2006 Sept 29, 2005 6 Great Real-Time Problem Statement 7 draft-stein-great-00.txt 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on April 2, 2006. 34 Copyright Notice 36 Copyright (C) The Internet Society (2005). 38 Abstract 40 VoIP is commonly perceived to be a low quality, but low cost, 41 alternative to standard telephony. This poor perception is often 42 well deserved, being fueled by implementations designed without 43 regard to characteristics of IP networks. This problem statement 44 attempts to catalog the shortcomings of current implementations, in 45 order to explore the IETF community's interest in working to improve 46 this situation. 48 1. Introduction 50 Consider the placing of a phone call over the PSTN. The end-user 51 terminal is extremely simplistic and inexpensive, the scaleability of 52 the PSTN being based on 'dumb' terminals at end-points, with all the 53 intelligence concentrated in the core. From the moment the user 54 requests service by off-hooking, an imperceptible amount of time 55 passes before the network indicates that is ready to receive 56 signaling by delivering audible dial-tone. Since the service 57 availability is 'five nines' (i.e. 99.999 percent) the user will 58 probably not remember an event where dial-tone was not heard 59 immediately after off-hooking. The user then enters the required 60 part of a hierarchical destination address, and will then receive 61 feedback as to the usage status of the destination terminal, in the 62 form of ringback or busy tone, usually within seconds. Assuming that 63 the destination terminal is not in use and the called party is 64 present and decides to accept the call, a session is established an 65 imperceptible time after off-hooking of the destination terminal. 67 For the duration of the conversation the voice is guaranteed to be 68 'toll quality' defined to be at least 4 on a Mean Opinion Scale (MOS) 69 scale from 1 to 5. This quality is admittedly imperfect, due to the 70 audio spectrum being truncated at 4 kHz (thus making differentiation 71 of various unvoiced fricatives impossible, and distorting music) but 72 preserves speaker identity and does not impede understandability for 73 native speakers of the language spoken. There will, in general, be 74 no unusual noises or audible artifacts (unless due to sources 75 radiating close to the end-user terminals), and no gaps or 76 discontinuities in the received information. Furthermore, the one- 77 way propagation delay is usually close to the physically minimum 78 possible (i.e. the time taken for light to travel between the two 79 points) and no perceivable echo is introduced due to the telephone 80 electronics. With extremely high probability the session will only 81 be terminated when either the originator or called party decide to 82 terminate. 84 Now, for comparison, let us consider a typical VoIP call over the 85 public Internet. The end-user terminal may either be a personal 86 computer (PC) or IP-phone, the former being a multifunctional 87 computational device and the latter smaller and less computationally 88 able, but relatively expensive terminal. Assuming a PC as terminal, 89 the user initiates a call by typing an identifier if IP address, or 90 by choosing the desired destination from a list. Thereafter follows 91 a rather prolonged period during which the user has no call progress 92 feedback; the duration is usually longer for peer-to-peer systems, 93 but is often considerable even for systems with centralized 94 registries. Afterwards a simulation of ringback or busy-tone is 95 commonly played, and assuming the destination terminal is powered and 96 the called party is willing to take the call, a bi-directional 97 session is setup. 99 Once the session commences the voice quality will usually not be as 100 good as that experienced in the PSTN (see Section 3 for a 101 discussion). In fact, the quality may be variable ranging from 102 telephone-like to incomprehensible. Depending on network 103 characteristics there will often be gaps when the sound completely 104 disappears, or becomes metallic, or sounds like Martians are 105 speaking. At times artifacts such as beeps may be heard. When using 106 the public Internet the round-tip delay will often be so high (over 107 one half second) that free conversation is impossible, and the 108 parties to the conversation may repeatedly speak at the same time, or 109 may purposely leave long pauses (for the other to interrupt or to 110 aver that the connection is still operational), or say 'over' as in 111 push-to-talk systems. The session may also terminate unexpectedly, 112 and then may or may not be restored by reconnecting. 114 The sum total of the user's perception of the audio quality, delay, 115 reliability and other factors is sometimes called the 'user 116 experience'. Why would anyone use a VoIP systems if the user 117 experience is significantly inferior to that of the standard 118 telephone system? The situation is analagous to that of cellular 119 phones, which also have noticeably lower audio quality and may 120 unexpectedly disconnect, but the mediocre user experience is 121 tolerated due to a new feature, namely mobility. Here some 122 enthusiasts have suggested that the attraction of VoIP is due to the 123 additional functionality that is, or will be, available (e.g. instant 124 messaging, video). However, in most cases it is probably either the 125 economics (free calls) or the ready accessibility for people already 126 seated at a PC (along with presence indications) that induces most 127 people to tolerate the poor quality. In fact, many times the latter 128 type of user will start a conversation on VoIP in order to ask 129 whether they can call over the PSTN. The feeling of most users is 130 that the quality is good enough for casual, hobby type conversations 131 (reminding some of us of our ham radio origins), and thus such users 132 are willing to use it to speak with remote acquaintances, mothers-in- 133 law, etc. They might not, however, choose to use VoIP to call their 134 bank branch, an important client, or their boss. 136 Of course, much of what was said above is specific to the present 137 state of the public Internet, while well engineered, highly 138 overprovisioned, networks suffer much less from these troubles. 139 However, this does not mean that the public Internet is inherently 140 unsuitable for quality transport of voice traffic, nor that it is 141 imperative to make major changes in order for it to become suitable 142 (although such changes may help). Many of the above problems can be 143 amended, although not completely solved, by taking the 144 characteristics of the Internet into account at all stages of the 145 VoIP implementation. We call such an implementation and its 146 components, 'PSN-aware'. 148 The above discussion focused on VoIP, but similar statements could be 149 made concerning other forms of real-time traffic transported over the 150 Internet, such as videoconferencing. On the other hand not all real- 151 time traffic is as problematic. For example, streaming audio that 152 can be delivered after a certain delay may be able to exploit 153 retransmission mechanisms, and thus be immunized to many of the above 154 hindrances. The essential ingredients are real-time constraints and 155 delay insensitivity, characteristics present in interactive real-time 156 applications. 158 2. Characteristics of PSNs 160 The design philosophy of the Public Switched Telephone Network (PSTN) 161 presumes that routing is expensive but bandwidth plentiful, while 162 that of Packet Switched Networks (PSNs), such as the Internet, 163 presupposes bandwidth to be dear while routing affordable. The 164 former tenets lead to a circuit switched network that naturally 165 supports reliable and high quality interactive audio sessions, while 166 the resource sharing required by the latter postulates makes 167 providing such services a challenge. 169 The very fact that PSN users share bandwidth means that no user 170 traffic receives treatment identical to that of a PSTN circuit. The 171 major sources of performance degradation for real-time delay- 172 sensitive PSN traffic can be identified as follows: 173 * packet creation time 174 * network propagation delay 175 * packet delay variation 176 * packet loss and mis-ordering 177 * congestion events 178 * lack of inherent timing transport 179 * bandwidth conservation algorithms 180 * emulation mechanisms 182 Unlike PSTN traffic, PSN traffic is sent in packets. The first byte 183 of data placed in the packet experiences latency corresponding to the 184 time required to fill the packet at the source. Although the last 185 byte placed in the packet experiences only minimal delay, it is the 186 last to be played out, and thus all data experiences latency equal to 187 the packet creation time (PCT). In VoIP systems this may be less 188 than 1 millisecond (for example When using the G.728 LD-CELP 189 encoder), it is typically tens of milliseconds (for example 10 190 millisecond for G.729, 60 millisecond for a two-frame superpacket of 191 G.723.1). PCT is a frame-size related latency introduced by the 192 source, but additional delay is usually added at the destination. 193 Most speech decoders require 'lookahead', and (as will be discussed 194 below) jitter buffer based systems require storing of packets. These 195 additional delays may greatly increase the overall one-way delay. 197 While TDM switches typically add 1/8000 of a second latency per 198 switch, Queuing delay in IP routers may be orders of magnitude 199 higher. 201 This aforementioned latency is not constant from packet to packet, 202 and successive packets do not even necessarily follow the same route. 203 For these reasons packets injected into the PSN at a constant rate 204 exit it at stochastic intervals. As we wish to play out audio at a 205 constant rate, this packet delay variation (PDV) must be compensated. 206 There are two ways this may be accomplished. In jitter buffer based 207 systems Incoming packets are not directly played out, but rather 208 placed in a 'jitter buffer' and later played out at a constant rate. 209 The jitter buffer is usually configured to be able to absorb the 210 maximum expected PDV, and thus introduces a significant amount of 211 delay. In 'shock absorber' based systems packets are played out as 212 they arrive, and when a packet is not yet available, a signal 213 processing algorithm is employed to extrapolate based on previous 214 packets, until such time as a packet arrives. These systems 215 introduce only minimal additional latency, but require considerably 216 more computational power. 218 IP networks are intrinsically best-effort, and thus there is no 219 guarantee that a packet injected into the PSN is actually received. 220 In fact, all PSNs introduce some percentage of packet loss (PL), due 221 to packets rejected due to detectable errors, packets dropped due to 222 congested resources, and packets dropped due to policy decisions. 223 Packet loss due to random errors will be independently distributed, 224 but other types may cause bursts of lost packets. In addition, when 225 parallel paths exist, packets may be received out-of-order, and must 226 be either reordered (may be possible in jitter buffer based systems) 227 or treated as lost. When a packet has not been received a decision 228 must be made as to what to play out. One possibility is silence, but 229 this will lead to reduced perceived audio quality. Depending on the 230 expected percentage of packet loss, packet loss concealment (PLC) 231 mechanisms may need to be employed. 233 Another consequence of the bandwidth sharing of PSNs is the 234 possibility of congestion events, statistically infrequent peaks of 235 activity during which there is insufficient bandwidth or processing 236 power to transport all packets. For non-real-time traffic there are 237 self-regulating rate control mechanisms, but for real-time traffic it 238 is not clear that such mechanisms can be useful. 240 The PSTN is based on TDM networks that inherently transport timing 241 information in the physical layer along with the data. PSNs do not 242 include such a physical layer clock, and when such a clock is 243 required, an appropriate mechanism must be supplied. This mechanism 244 may rely on a clock source external to the PSN (e.g. GPS 245 satellites), or may involve clock recovery over the PSN itself (e.g. 246 NTP). 248 By bandwidth conservation algorithms we mean all source codings 249 employed for reduction of data rate to closer to the Shannon rate. 250 These range from lossless data compression, through speech encoding, 251 fax image encoding, to video encoding. Except for lossless 252 compression, all such mechanisms introduce some quality reduction, 253 and all (including lossless compression) reduce robustness to errors 254 and packet loss. 256 The final source of degradation is emulation mechanisms internal to 257 gateways that enable access to the PSN. These mechanisms may try to 258 simulate behavior of a PSTN system, to terminate or relay PSTN- 259 specific signaling, or to optimize operation of interactive real-time 260 traffic over the PSN. These mechanisms are typically required to 261 detect various characteristics of the incoming real-time signals, and 262 need to do so rapidly, with high probability of detection, and with 263 low false alarm rate. When such a mechanism fails, the gateway may 264 enter a state from which it may take time to exit, creating a severe 265 anomaly in user perceived performance. 267 3. Bandwidth and Audio Quality Problems 269 Even assuming a perfect PSN, i.e. one with no packet loss (PL) nor 270 packet mis-ordering and only minimal packet delay variation (PDV), 271 the perceived voice quality of VoIP calls is highly dependent on 272 bandwidth reduction mechanisms. First, in order to minimize 273 bandwidth consumption speech encoding algorithms are employed that 274 reduce the MOS to somewhere between 3.5 and 3.8. Second, voice 275 activity detection (VAD) is typically employed to mute (or replace 276 with locally generated 'comfort noise') one direction of the 277 conversation; this VAD is never perfect and may clip the start of 278 voice spurts. Due to the speech compressions not passing various 279 tones (e.g. DTMF), are passed using special relay functions; false 280 alarms in such detection produce annoying beeps known as 'talk-offs'. 282 When the present generation of speech encoders was developed, the 283 only design criteria were compression ratio, speech quality (MOS), 284 and to a certain degree delay (although G.723.1 was supposedly 285 designed with VoIP in mind, its round-trip combined delay of 75 286 milliseconds is not conducive to use over the public Internet). At 287 about the same time speech encoders were developed for satellite 288 applications that were built to be robust to individual bit errors; 289 but no encoders were built to be robust to loss of entire packets. 290 Indeed, even the common event of the loss of a single packet may 291 cause a disruption to the decoded audio that may last for a long 292 time. Later the iLBC speech coder (described in RFC 3952) was 293 designed to eliminate this problem (and today other encoder 294 techniques are known that are inherently insensitive to missing 295 data). When the packet loss problem was better understood, PLC 296 mechanisms were added to speech encoders used over PSNs, but these 297 PLCs helped mainly for loss of isolated packets. Typical PL patterns 298 of IP networks (e.g. loss bursts) were not taken into account. 300 As the development of speech encoding algorithms has in general 301 proceeded without detailed knowledge of PSN characteristics, required 302 functionality, such as PLC, has been added on a posteriori. Higher 303 efficiency and performance may be gained by a priori design of PSN- 304 aware speech and other audio (and later video) encoders and PSN-aware 305 PLC mechanisms. 307 In addition, when the end-user terminals are no longer POTS phones, 308 one may ask why we are still limiting ourselves to 4 kHz bandwidth. 309 Wideband telephony (8 kHz bandwidth) speech is noticeably superior, 310 and may go far to convincing users that VoIP quality may actually 311 exceed that of the PSTN. Design of standardized PSN-aware wideband 312 encoders is a worthwhile task waiting to be tackled. 314 Most speech encoders used today take in a constant number of bytes of 315 uncompressed audio, and produce a constant number of compressed 316 bytes. Some speech coders are called adaptive multirate, in that 317 they may be configured to produce a specified number of compressed 318 bytes. Truly variable rate compression techniques vary in output 319 rate according to the character of the input sounds. While the use 320 of constant rate transport infrastructures dictates constant rate 321 encoders, PSN packets may vary in size from packet to packet, and 322 thus variable rate encoders may be used. It is an open question as 323 to how to match these encoder parameters to PSN characteristics. 325 4. Delay and Delay Variation Problems 327 Standard PSTN practice places tight constraints on the tolerable end- 328 to-end and round-trip delays. Although the more modern approach is 329 to consider the effect of delay along with other degradations, one- 330 way transmission times of up to 150 milliseconds are considered 331 universally acceptable, assuming adequate echo control is provided. 332 Echo cancellation is required when the delay exceeds about 20 333 milliseconds. 335 The one-way delay in PSNs is greater than that of the PSTN, due at 336 very least to PCT and lookahead, and often to queuing delays and 337 jitter buffer latency. Indeed, network propagation times alone may 338 be in the 100 millisecond range, and thus incompatible with the 339 minimum delay introduced by G.723.1. Thus a sensible approach would 340 be to start with a specification of the network delay, and to derive 341 allowable buffering and processing budgets. This would probably 342 require smaller frame sizes and minimization of lookahead, and 343 innovative designs would be needed to keep bit rates reasonable. 345 More attention should be drawn to the perfection of shock absorber 346 based systems. These may need to be more fully integrated into the 347 encoder, perhaps more specifically into the PLC mechanism. 349 5. Congestion Problems 351 When congestion is detected, either by explicit notification or via 352 detection of packet loss, even real-time systems should heed the 353 network's warning of imminent trouble. In addition to PLC on any 354 missing packet, in the other direction rate cutback needs to be 355 attempted, e.g. by lowering VAD thresholds, via adaptation of the 356 rate of adaptive multirate encoders or the average output rate 357 parameter of variable rate encoders, and in extreme cases by 358 deliberate dropping of packets that are likely to be more effectively 359 concealed by the PLC. Although all these activities reduce the 360 user's perception of voice quality, they do so less drastically than 361 complete loss of all audio. 363 Adaptive multirate encoders can generally change rate on a packet by 364 packet basis in 'hitless' fashion, but it is unknown how to do this 365 when changing encoder. There has not been sufficient study of how to 366 identify packets that may less harmful to discard. 368 6. Emulation Problems 370 The lack of precise clock synchronization between source and 371 destination (play out) clocks is usually considered unimportant for 372 voice. This is because even a missed or extra speech sample every 373 few minutes is undetectable to the ear. The situation is different 374 when the system is used to transport non-speech data, such as fax and 375 data modem transfer without appropriate relays. In such cases it is 376 necessary to match the destination clock to that of the source in 377 order to eliminate sample slips. 379 Accurate (line or acoustic) echo cancellation is essential for high 380 ratings of user experience. At present echo cancellation is 381 typically performed where its computational cost is minimized, i.e. 382 close to the place where the echo is generated, rather than where it 383 would be heard . It would be useful to be able to employ an echo 384 cancellation server anywhere in the network, but there are problems 385 that need to be solved before this can be accomplished. For example, 386 the relative timing of the signals flowing in opposite directions 387 needs to be determined (including clock synchronization), and the 388 fact that neither signal may be echo-free. 390 Real-time monitoring of voice quality has been previously considered. 391 Such measures may be based on acoustic models or on measurement of 392 network degradations and use of previously determined calibrations. 393 Timely feedback of such end-to-end information quality may be useful 394 in improving the audio quality, but the precise mechanisms need to be 395 worked out. 397 Another problem that may be addressed concern multi-user 398 conferencing. Many present-day systems choose a single dominant 399 speaker, squelching others desiring to talk. This introduces various 400 perceived quality degradations, in addition to giving a bad 401 impression to the user wanting to 'break in'. Complete summing of 402 audio from all users is problematic for several reasons. It requires 403 decompression and recompression of user audio, and rescaling to avoid 404 excessive signal levels. Advances would be welcome here. 406 Reduction of the connection setup delay, and the related delays for 407 entering/exiting fax-relay and modem-relay modes is an important 408 signalling problem to be solved. 410 Integration of real-time delay-sensitive traffic along a time line 411 with other applications may be interesting. The most important 412 application here is lip syncing, but syncing text for Karaoke, 413 whiteboard motions to spoken words, etc. may need to be addressed. 415 7. Security Considerations 417 Although not directly related to the real-time character of the 418 traffic authentication, encryption , and methods for lawful 419 interception (CALEA) need to be integrated in a standard way into 420 VoIP systems. 422 8. IANA Considerations 424 This Internet Draft does not propose a protocol, nor a change to any 425 existing protocol, and thus no IANA considerations are raised. 427 Author's Address 429 Yaakov (J) Stein 430 RAD Data Communications 431 24 Raoul Wallenberg St., Bldg C 432 Tel Aviv 69719 433 ISRAEL 435 Phone: +972 3 645-5389 436 Email: yaakov_s@rad.com 438 Intellectual Property Statement 440 The IETF takes no position regarding the validity or scope of any 441 Intellectual Property Rights or other rights that might be claimed to 442 pertain to the implementation or use of the technology described in 443 this document or the extent to which any license under such rights 444 might or might not be available; nor does it represent that it has 445 made any independent effort to identify any such rights. Information 446 on the procedures with respect to rights in RFC documents can be 447 found in BCP 78 and BCP 79. 449 Copies of IPR disclosures made to the IETF Secretariat and any 450 assurances of licenses to be made available, or the result of an 451 attempt made to obtain a general license or permission for the use of 452 such proprietary rights by implementers or users of this 453 specification can be obtained from the IETF on-line IPR repository at 454 http://www.ietf.org/ipr. 456 The IETF invites any interested party to bring to its attention any 457 copyrights, patents or patent applications, or other proprietary 458 rights that may cover technology that may be required to implement 459 this standard. Please address the information to the IETF at 460 ietf-ipr@ietf.org. 462 Disclaimer of Validity 464 This document and the information contained herein are provided on an 465 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 466 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 467 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 468 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 469 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 470 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 472 Copyright Statement 474 Copyright (C) The Internet Society (2005). This document is subject 475 to the rights, licenses and restrictions contained in BCP 78, and 476 except as set forth therein, the authors retain all their rights. 478 Acknowledgment 480 Funding for the RFC Editor function is currently provided by the 481 Internet Society.