idnits 2.17.00 (12 Aug 2021) /tmp/idnits24858/draft-hoffman-utf16-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Found some kind of copyright notice around line 28 but it does not match any copyright boilerplate known by this tool. Expected boilerplate is as follows today (2022-05-20) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 485 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 85 instances of too long lines in the document, the longest one being 3 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'MIME' is mentioned on line 407, but not defined ** Obsolete normative reference: RFC 2278 (ref. 'CHARSET-REG') (Obsoleted by RFC 2978) -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO-10646' ** Obsolete normative reference: RFC 2279 (ref. 'UTF-8') (Obsoleted by RFC 3629) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' Summary: 12 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft Paul Hoffman 3 Internet Mail Consortium 4 November 12, 1998 Francois Yergeau 5 Alis Technologies 7 UTF-16, an encoding of ISO 10646 9 Status of this Memo 11 This document is an Internet-Draft. Internet-Drafts are working documents 12 of the Internet Engineering Task Force (IETF), its areas, and its working 13 groups. Note that other groups may also distribute working documents as 14 Internet- Drafts. 16 Internet-Drafts are draft documents valid for a maximum of six months. 17 Internet-Drafts may be updated, replaced, or obsoleted by other documents 18 at any time. It is not appropriate to use Internet-Drafts as reference 19 material or to cite them other than as a "working draft" or "work in 20 progress". 22 To view the entire list of current Internet-Drafts, please check the 23 "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow 24 Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), 25 ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), 26 ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). 28 Copyright (C) The Internet Society (1998). All Rights Reserved. 30 1. Introduction 32 This document specifies the UTF-16 encoding of Unicode/ISO-10646 and 33 contains the registration for three MIME charset parameter values: 34 UTF-16BE, UTF-16LE, and UTF-16. 36 1.1 Background 38 The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly 39 define a character set (hereafter referred to as Unicode) which encompasses 40 most of the world's writing systems. UTF-16, the object of this 41 specification, is an encoding scheme of this character set that has the 42 characteristics of encoding the vast majority of currently-defined 43 characters in exactly two octets and of being able to encode all other 44 characters that will be defined in exactly four octets. 46 The Unicode Standard further defines additional character properties and 47 other application details of great interest to implementors. Up to the 48 present time, changes in Unicode and amendments to ISO/IEC 10646 have 49 tracked each other, so that the character repertoires and code point 50 assignments have remained in sync. The relevant standardization committees 51 have committed to maintain this very useful synchronism. 53 1.2 Motivation 55 The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF 56 policy on character sets, [CHARPOLICY], says that IETF protocols MUST be 57 able to use the UTF-8 charset. However, relative to UTF-16, UTF-8 imposes a 58 space penalty for characters whose values are greater than 0x0800. Also, 59 characters represented in UTF-8 have varying sizes. Using UTF-16 provides a 60 way to transmit character data that is mostly uniform in size. Some 61 products and network standards already specify UTF-16. (Note, however, that 62 UTF-8 has many other advantages over UTF-16 in many protocols, such as the 63 direct encoding of US-ASCII characters.) 65 UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as 66 a sequence of 16-bit quantities. This document addresses the issues of 67 serializing UTF-16 as an octet stream for transmission over the Internet 68 and of MIME charset naming as described in [CHARSET-REG]. 70 1.3 Terminology 72 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 73 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 74 document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. 76 Throughout this document, character values are shown in hexadecimal 77 notation. For example, "0x013C" is the character whose value is at the 78 codepoint that is 316 (decimal) positions from the base of the character 79 set. 81 2. UTF-16 definition 83 In ISO 10646, each character is assigned a number, which Unicode calls the 84 Unicode scalar value. This number is the same as the UCS-4 value of the 85 character, and this document will refer to it as the "character value" for 86 brevity. In the UTF-16 encoding, characters are represented using either 87 one or two unsigned 16-bit integers, depending on the character value. 88 Serialization of these integers for transmission as a byte stream is 89 discussed in Section 3. 91 The rules for how characters are encoded in UTF-16 are: 93 - Characters with values less than 0x10000 are represented as a single 94 integer with a value equal to that of the character number. 96 - Characters with values between 0x10000 and 0x10FFFF are represented by 97 an integer with a value between 0xD800 and 0xDBFF (within the so-called 98 high-half zone or high surrogate area) followed by an integer with a 99 value between 0xDC00 and 0xDFFF (within the so-called low-half zone or 100 low surrogate area). 102 - Characters with values greater than 0x10FFFF cannot be encoded in 103 UTF-16. 105 2.1 Encoding UTF-16 107 Encoding of a single character proceeds as follows. Let U be the character 108 number, no greater than 0x10FFFF. 110 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 112 2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF, 113 that is, U' can be represented in 20 bits. 115 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 116 0xDC00, respectively. These integers each have 10 bits free to encode the 117 character value, for a total of 20 bits. 119 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits 120 of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. 121 Terminate. 123 Graphically, steps 2 through 4 look like: 124 U' = yyyyyyyyyyxxxxxxxxxx 125 W1 = 110110yyyyyyyyyy 126 W2 = 110111xxxxxxxxxx 128 2.2 Decoding UTF-16 130 Decoding of a single character proceeds as follows. Let W1 be the next 131 16-bit integer in the sequence of integers representing the text. Let W2 be 132 the (eventual) next integer following W1. 134 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1. 135 Terminate. 137 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in 138 error and no valid character can be obtained using W1. Terminate. 140 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not 141 between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. 143 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of 144 W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 145 low-order bits. 147 5) Add 0x10000 to U' to obtain the character value U. Terminate. 149 Note that steps 2 and 3 indicate errors. Error recovery is not specified by 150 this document. 152 3. Serialization of characters 154 3.1 Definition of big-endian and little-endian 156 Historically, computer hardware has processed two-octet entities such as 157 16-bit integers in one of two ways. So-called "big-endian" hardware handles 158 two-octet entities with the higher-order octet first, that is at the lower 159 address in memory; when written out to disk or to a network interface 160 (serializing), the high-order octet thus appears first in the data stream. 161 "Little-endian" hardware handles two-octet entities with the lower-order 162 octet first. Most modern hardware is little-endian, but there are many 163 current examples of big-endian hardware. 165 For example, the unsigned 16-bit integer that represents the decimal number 166 258 is 0x0102. The big-endian serialization of that number is the octet 167 0x01 followed by the octet 0x02. The little-endian serialization of that 168 number is the octet 0x02 followed by the octet 0x01. 170 The term "network byte order" has been used in many RFCs to indicate 171 big-endian serialization, although that term has never been formally 172 defined in a standards-track document. ISO 10646 prefers big-endian 173 serialization (section 6.3 of [ISO-10646]), but it is nonetheless 174 considered likely that little-endian order will also be used on the 175 Internet. 177 This specification thus contains registration for three charset parameter 178 values: "UTF-16BE", "UTF-16LE", and "UTF-16". The three character encodings 179 are identical except for the serialization order of the octets in each 180 character, and the external determination of which serialization is used. 182 The Unicode Standard defines the character "ZERO WIDTH NON-BREAKING SPACE" 183 (0xFEFF) which is also known as the "BYTE ORDER MARK", abbreviated "BOM". 184 All BOM characters MUST be considered to be characters of the text object 185 that is labelled with the "UTF-16BE", "UTF-16LE", or "UTF-16" charset 186 parameter values. The BOM characters MUST be included when performing 187 MIME-related operations over the entire text, such as in hash algorithms 188 and length calculations. After the text has been processed, the BOM MAY be 189 removed, although this will prevent later comparison with the original MIME 190 object. 192 3.2 Serialization in UTF-16BE 194 Text labelled with the "UTF-16BE" charset parameter value MUST be 195 serialized with the octets which make up a single 16-bit UTF-16 value in 196 big-endian order. The detection of an initial BOM or a reversed BOM does 197 not affect de-serialization of text labelled as UTF-16BE. Finding a 198 reversed BOM (that is, the octet 0xFF followed by the octet 0xFE) is an 199 error since there is no Unicode character 0xFFFE. 201 3.3 Serialization in UTF-16LE 203 Text labelled with the "UTF-16LE" charset parameter value MUST be 204 serialized with the octets which make up a single 16-bit UTF-16 value in 205 little-endian order. The detection of an initial BOM or a reversed BOM does 206 not affect de-serialization of text labelled as UTF-16BE. Finding a 207 non-reversed BOM (that is, the octet 0xFE followed by the octet 0xFF) is an 208 error since there is no Unicode character 0xFFFE, which is the 209 interpretation of the non-reversed BOM under little-endian order. 211 3.4 Serialization in UTF-16 213 Text labelled with the "UTF-16" charset parameter value MAY be serialized 214 in either big-endian or little-endian order. Text labelled as UTF-16 MUST 215 be big-endian unless the first two octets of the text is sequence of octets 216 0xFF 0xFE, in which case the serialization MUST be little-endian. 218 Big-endian text labelled with the "UTF-16" charset parameter value MAY 219 start with the big-endian BOM (the character 0xFEFF), but the BOM is not 220 required. BOM characters other than the first character of a body part are 221 not interpreted as BOMs. 223 All applications that process text that uses the "UTF-16" charset parameter 224 value MUST be able to read at least the first two octets of the text and be 225 able to process those octets in order to determine the serialization of the 226 text. Applications that use the "UTF-16" charset parameter value MUST NOT 227 assume the serialization without first checking the first two octets to see 228 if they are a big-endian BOM or a little-endian BOM or not a BOM. 230 4. Choosing a charset 232 Any labelling application that uses UTF-16 character encoding, and puts an 233 explicit charset label on the text, and knows the serialization of the 234 characters in text, MUST label the text with the "UTF-16BE" or the 235 "UTF-16LE" charset parameter values. This allows applications that are 236 processing the text that are not able to look inside the text to know the 237 serialization definitively. 239 Any labelling application that uses UTF-16 character encoding, and puts an 240 explicit charset label on the text, and does not know the serialization of 241 the characters in text, MUST label the text with the "UTF-16" charset 242 parameter value, and SHOULD be sure the text starts with a BOM. An 243 application processing text that is labelled with the "UTF-16" charset 244 parameter value knows that the serialization cannot be determined without 245 looking inside the text itself. Fortunately, the processing application 246 needs only look at the first character (the first two octets) of the text 247 to determine the serialization. 249 Because creating text that uses the "UTF-16" charset parameter value forces 250 the recipient to read and understand the first character of the text 251 object, a text-creating program SHOULD create text labelled with the 252 "UTF-16BE" or the "UTF-16LE" charset parameter values if possible. 253 Text-creating programs that create text using UTF-16 encoding SHOULD emit 254 big-endian text if possible. 256 5. Examples 258 For the sake of example, let's suppose that there is a hieroglyphic 259 character representing the Egyptian god Ra with character value 0x00012345 260 (this character does not exist at present in Unicode). 262 The examples here all evaluate to the phrase: 264 *=Ra 266 where the "*" represents the Ra hieroglyph (0x00012345). 268 Text that is labelled with UTF-16BE, with no BOM: 269 D8 48 DF 45 00 3D 00 52 00 61 271 Text that is labelled with UTF-16BE, with a BOM: 272 FE FF D8 48 DF 45 00 3D 00 52 00 61 274 Text that is labelled with UTF-16LE, with no BOM: 275 48 D8 45 DF 3D 00 52 00 61 00 277 Little-endian text that is labelled with UTF-16: 278 FF FE 48 D8 45 DF 3D 00 52 00 61 00 280 6. Versions of the standards 282 ISO/IEC 10646 is updated from time to time by published amendments; 283 similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, 284 and 2.1 as of this writing. Each new version obsoletes and replaces the 285 previous one, but implementations, and more significantly data, are not 286 updated instantly. 288 In general, the changes amount to adding new characters, which does not 289 pose particular problems with old data. Amendment 5 to ISO/IEC 10646, 290 however, has moved and expanded the Korean Hangul block, thereby making any 291 previous data containing Hangul characters invalid under the new version. 292 Unicode 2.0 has the same difference from Unicode 1.1. The official 293 justification for allowing such an incompatible change was that no 294 implementations and no data containing Hangul existed, a statement that is 295 likely to be true but remains unprovable. The incident has been dubbed the 296 "Korean mess", and the relevant committees have pledged to never, ever 297 again make such an incompatible change. 299 New versions, and in particular any incompatible changes, have consequences 300 regarding MIME character encoding labels, to be discussed in Appendix A. 302 7. Security considerations 304 UTF-16 is based on the ISO 10646 character set, which is frequently being 305 added to, as described in Section 6 and Appendix A of this document. 306 Processors must be able to handle characters that are not defined at the 307 time that the processor was created in such a way as to not allow an 308 attacker to harm a recipient by including unknown characters. 310 Processors that handle any type of text, including text encoded as UTF-16, 311 must be vigilant for control characters that might reprogram a display 312 terminal or keyboard. Similarly, processors that interpret text entities 313 (such as looking for embedded programming code), must be careful not to 314 execute the code without first alerting the recipient. 316 Text in UTF-16 may contain special characters, such as the OBJECT 317 REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, 318 depending on the interpretation of the processing program and the 319 availability of an external data stream that would be executed. This 320 external processing may have side-effects that allow the sender of a 321 message to attack the receiving system. 323 Implementors of UTF-16 need to consider the security aspects of how they 324 handle illegal UTF-16 sequences (that is, sequences involving surrogate 325 pairs that have illegal values). It is conceivable that in some 326 circumstances an attacker would be able to exploit an incautious UTF-16 327 parser by sending it an octet sequence that is not permitted by the UTF-16 328 syntax. 330 8. References 332 [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration 333 Procedures", BCP 19, RFC 2278, January 1998. 335 [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information 336 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: 337 Architecture and Basic Multilingual Plane. Twelve amendments and two 338 technical corrigenda have been published up to now. UTF-16 is described in 339 Annex Q, published as Amendment 1. Many other amendments are currently at 340 various stages of standardization. 342 [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate 343 Requirement Levels", BCP 14, RFC 2119, March 1997. 345 [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", 346 BCP 18, RFC 2277, January 1998. 348 [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 349 2279, January 1998. 351 [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", 352 Unicode Technical Report #8. 354 9. Acknowledgments 356 David Goldsmith wrote a great deal of the initial wording for this 357 specification. Other significant contributors include: 359 Mati Allouche 360 Walt Daniels 361 Mark Davis 362 Martin Duerst 363 Asmus Freytag 364 Lloyd Honomichl 365 Murata Makoto 366 Ken Whistler 368 Some of the text in this specification was copied from [UTF-8], and that 369 document was worked on by many people. Please see the acknowledgements 370 section in that document for more people who may have contributed 371 indirectly to this document. 373 10. Authors' address 375 Paul Hoffman 376 Internet Mail Consortium 377 127 Segre Place 378 Santa Cruz, CA 95060 USA 379 phoffman@imc.org 381 Francois Yergeau 382 Alis Technologies 383 100, boul. Alexis-Nihon, Suite 600 384 Montreal QC H4M 2P2 Canada 385 fyergeau@alis.com 387 A. Charset registrations 389 This memo is meant to serve as the basis for registration of three MIME 390 character set parameters (charsets) [CHARSET-REG]. The proposed charset 391 parameter values are "UTF-16BE", "UTF-16LE", and "UTF-16". These strings 392 label media types containing text consisting of characters from the 393 repertoire of ISO/IEC 10646 including all amendments at least up to 394 amendment 5 (Korean block), encoded to a sequence of octets using the 395 encoding and serialization schemes outlined above. 397 Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in 398 MIME content types under the "text" top-level type, because they do not 399 encode line endings in the way required for MIME "text" media types. 401 It is noteworthy that the labels described here do not contain a version 402 identification, referring generically to ISO/IEC 10646. This is 403 intentional, the rationale being as follows: 405 A MIME charset label is designed to give just the information needed to 406 interpret a sequence of bytes received on the wire into a sequence of 407 characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As long as 408 a character set standard does not change incompatibly, version numbers 409 serve no purpose, because one gains nothing by learning from the tag that 410 newly assigned characters may be received that one doesn't know about. The 411 tag itself doesn't teach anything about the new characters, which are going 412 to be received anyway. 414 Hence, as long as the standards evolve compatibly, the apparent advantage 415 of having labels that identify the versions is only that, apparent. But 416 there is a disadvantage to such version-dependent labels: when an older 417 application receives data accompanied by a newer, unknown label, it may 418 fail to recognize the label and be completely unable to deal with the data, 419 whereas a generic, known label would have triggered mostly correct 420 processing of the data, which may well not contain any new characters. 422 The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in 423 principle contradicting the appropriateness of a version independent MIME 424 charset label as described above. But the compatibility problem can only 425 appear with data containing Korean Hangul characters encoded according to 426 Unicode 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there 427 is arguably no such data to worry about, this being the very reason the 428 incompatible change was deemed acceptable. 430 In practice, then, a version-independent label is warranted, provided the 431 label is understood to refer to all versions after Amendment 5, and 432 provided no incompatible change actually occurs. Should incompatible 433 changes occur in a later version of ISO/IEC 10646, the MIME charset labels 434 defined here will stay aligned with the previous version until and unless 435 the IETF specifically decides otherwise. 437 A.1 Registration for UTF-16BE 439 To: ietf-charsets@iana.org 440 Subject: Registration of new charset 442 Charset name(s): UTF-16BE 444 Published specification(s): This specification 446 Suitable for use in MIME content types under the 447 "text" top-level type: No 449 Person & email address to contact for further information: 450 Paul Hoffman 451 Francois Yergeau 453 A.2 Registration for UTF-16LE 455 To: ietf-charsets@iana.org 456 Subject: Registration of new charset 458 Charset name(s): UTF-16LE 460 Published specification(s): This specification 462 Suitable for use in MIME content types under the 463 "text" top-level type: No 465 Person & email address to contact for further information: 466 Paul Hoffman 467 Francois Yergeau 469 A.3 Registration for UTF-16 471 To: ietf-charsets@iana.org 472 Subject: Registration of new charset 474 Charset name(s): UTF-16 476 Published specification(s): This specification 478 Suitable for use in MIME content types under the 479 "text" top-level type: No 481 Person & email address to contact for further information: 482 Paul Hoffman 483 Francois Yergeau