idnits 2.17.00 (12 Aug 2021) /tmp/idnits25091/draft-hoffman-utf16-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 593 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 100 instances of too long lines in the document, the longest one being 6 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'MIME' is mentioned on line 517, but not defined ** Obsolete normative reference: RFC 2278 (ref. 'CHARSET-REG') (Obsoleted by RFC 2978) -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO-10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' ** Obsolete normative reference: RFC 2279 (ref. 'UTF-8') (Obsoleted by RFC 3629) ** Downref: Normative reference to an Informational RFC: RFC 2130 (ref. 'WORKSHOP') Summary: 10 errors (**), 0 flaws (~~), 4 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Paul Hoffman 2 Internet Mail Consortium 3 February 10, 1999 Francois Yergeau 4 Alis Technologies 6 UTF-16, an encoding of ISO 10646 8 Status of this Memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other groups 15 may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months and 18 may be updated, replaced, or obsoleted by other documents at any time. It 19 is inappropriate to use Internet- Drafts as reference material or to cite 20 them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 Copyright (C) The Internet Society (1999). All Rights Reserved. 30 1. Introduction 32 This document describes the UTF-16 encoding of Unicode/ISO-10646 and 33 contains the registration for three MIME charset parameter values: 34 UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16. 36 1.1 Background 38 The Unicode Standard [UNICODE], and ISO/IEC 10646 [ISO-10646] jointly 39 define a coded character set (CCS), hereafter referred to as Unicode, which 40 encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the 41 object of this specification, is a way to encode Unicode characters that 42 has the characteristics of encoding the vast majority of currently-defined 43 characters in exactly two octets and of being able to encode all other 44 characters that will be defined in exactly four octets. 46 The Unicode Standard further defines additional character properties and 47 other application details of great interest to implementors. Up to the 48 present time, changes in Unicode and amendments to ISO/IEC 10646 have 49 tracked each other, so that the character repertoires and code point 50 assignments have remained in sync. The relevant standardization committees 51 have committed to maintain this very useful synchronism. 53 1.2 Motivation 55 The UTF-8 transformation of Unicode is described in [UTF-8]. The IETF 56 policy on character sets and languages, [CHARPOLICY], says that IETF 57 protocols MUST be able to use the UTF-8 charset. However, relative to 58 UTF-16, UTF-8 imposes a space penalty for characters whose values are 59 between 0x0800 and 0xFFFF. Also, characters represented in UTF-8 have varying 60 sizes. Using UTF-16 provides a way to transmit character data that is 61 mostly uniform in size. Some products and network standards already specify 62 UTF-16. (Note, however, that UTF-8 has many other advantages over UTF-16 in 63 many protocols, such as the direct encoding of US-ASCII characters and 64 re-synchronization after loss of octets.) 66 UTF-16 is a format that allows encoding the first 17 planes of ISO 10646 as 67 a sequence of 16-bit quantities. This document addresses the issues of 68 serializing UTF-16 as an octet stream for transmission over the Internet 69 and of MIME charset naming as described in [CHARSET-REG]. 71 1.3 Terminology 73 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 74 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 75 document are to be interpreted as described in RFC 2119 [MUSTSHOULD]. 77 Throughout this document, character values are shown in hexadecimal 78 notation. For example, "0x013C" is the character whose value is the 79 character assigned the integer value 316 (decimal) in the CCS. 81 2. UTF-16 definition 83 In ISO 10646, each character is assigned a number, which Unicode calls the 84 Unicode scalar value. This number is the same as the UCS-4 value of the 85 character, and this document will refer to it as the "character value" for 86 brevity. In the UTF-16 encoding, characters are represented using either 87 one or two unsigned 16-bit integers, depending on the character value. 88 Serialization of these integers for transmission as a byte stream is 89 discussed in Section 3. 91 The rules for how characters are encoded in UTF-16 are: 93 - Characters with values less than 0x10000 are represented as a single 94 16-bit integer with a value equal to that of the character number. 96 - Characters with values between 0x10000 and 0x10FFFF are represented by a 97 16-bit integer with a value between 0xD800 and 0xDBFF (within the 98 so-called high-half zone or high surrogate area) followed by a 16-bit 99 integer with a value between 0xDC00 and 0xDFFF (within the so-called 100 low-half zone or low surrogate area). 102 - Characters with values greater than 0x10FFFF cannot be encoded in 103 UTF-16. 105 2.1 Encoding UTF-16 107 Encoding of a single character from an ISO 10646 character value to UTF-16 108 proceeds as follows. Let U be the character number, no greater than 109 0x10FFFF. 111 1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 113 2) Let U' = U - 0x10000. Note that because U <= 0x10FFFF, U' <= 0xFFFFF, 114 that is, U' can be represented in 20 bits. 116 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 117 0xDC00, respectively. These integers each have 10 bits free to encode the 118 character value, for a total of 20 bits. 120 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits 121 of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. 122 Terminate. 124 Graphically, steps 2 through 4 look like: 125 U' = yyyyyyyyyyxxxxxxxxxx 126 W1 = 110110yyyyyyyyyy 127 W2 = 110111xxxxxxxxxx 129 2.2 Decoding UTF-16 131 Decoding of a single character from UTF-16 to an ISO 10646 character value 132 proceeds as follows. Let W1 be the next 16-bit integer in the sequence of 133 integers representing the text. Let W2 be the (eventual) next integer 134 following W1. 136 1) If W1 < 0xD800 or W1 > 0xDFFF, the character value is the value of W1. 137 Terminate. 139 2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in 140 error and no valid character can be obtained using W1. Terminate. 142 3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not 143 between 0xDC00 and 0xDFFF, the sequence is in error. Terminate. 145 4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of 146 W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 147 low-order bits. 149 5) Add 0x10000 to U' to obtain the character value U. Terminate. 151 Note that steps 2 and 3 indicate errors. Error recovery is not specified by 152 this document. When terminating with an error in steps 2 and 3, it may be 153 wise to set U to the value of W1 to help the caller diagnose the error and 154 not lose information. 156 3. Labelling UTF-16 text 158 This specification contains registration for three MIME charsets: 159 "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the 160 combination of a CCS and a CES. Here the CCS is Unicode/ISO 10646 and the 161 CES is the same in all three cases, except for the serialization order of 162 the octets in each character, and the external determination of which 163 serialization is used. 165 This section describes which of the three labels to apply to a stream of text. 167 3.1 Definition of big-endian and little-endian 169 Historically, computer hardware has processed two-octet entities such as 170 16-bit integers in one of two ways. So-called "big-endian" hardware handles 171 two-octet entities with the higher-order octet first, that is at the lower 172 address in memory; when written out to disk or to a network interface 173 (serializing), the high-order octet thus appears first in the data stream. 174 On the other hand, "Little-endian" hardware handles two-octet entities with 175 the lower-order octet first. Hardware of both kinds is common today. 177 For example, the unsigned 16-bit integer that represents the decimal number 178 258 is 0x0102. The big-endian serialization of that number is the octet 179 0x01 followed by the octet 0x02. The little-endian serialization of that 180 number is the octet 0x02 followed by the octet 0x01. The following C code 181 fragment demonstrates a way to write 16-bit quantities to a file in 182 big-endian order, irrespective of the hardware's native byte order. 184 void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ 185 { 186 putc(u >> 8, f); /* output high-order byte */ 187 putc(u & 0xFF, f); /* then low-order */ 188 } 190 The term "network byte order" has been used in many RFCs to indicate 191 big-endian serialization, although that term has yet to be formally 192 defined in a standards-track document. ISO 10646 prefers big-endian 193 serialization (section 6.3 of [ISO-10646]), but it is nonetheless 194 considered likely that little-endian order will also be used on the 195 Internet. 197 3.2 Byte order mark (BOM) 199 The Unicode Standard and ISO 10646 define the character "ZERO WIDTH 200 NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER 201 MARK" (abbreviated "BOM"). The latter name hints at a second possible usage 202 of the character, in addition to its normal use as a genuine "ZERO WIDTH 203 NON-BREAKING SPACE" within text. This usage, suggested by Unicode section 204 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character 205 to a stream of Unicode characters as a "signature"; a receiver of such a 206 serialized stream may then use the initial character both as a hint that 207 the stream consists of Unicode characters and as a way to recognize the 208 serialization order. In serialized UTF-16 prepended with such a signature, 209 the order is big-endian if the first two octets are 0xFE followed by 0xFF; 210 if they are 0xFF followed by 0xFE, the order is little-endian. Note that 211 0xFFFE is not a Unicode character, precisely to preserve the usefulness of 212 0xFEFF as a byte-order mark. 214 It is important to understand that the character 0xFEFF appearing at any 215 position other than the beginning of a stream MUST be interpreted with the 216 semantics for the zero-width non-breaking space, and MUST NOT be 217 interpreted as a byte-order mark. The contrapositive of that statement is 218 not always true: the character 0xFEFF in the first position of a stream MAY 219 be interpreted as a zero-width non-breaking space, and is not always a 220 byte-order mark. For example, if a process splits a UTF-16 string into 221 many parts, a part might begin with 0xFEFF because there was a 222 zero-width non-breaking space at the beginning of that substring. 224 The Unicode standard further suggests than an initial 0xFEFF character may 225 be stripped before processing the text, the rationale being that such a 226 character in initial position may be an artifact of the encoding (an 227 encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING 228 SPACE". Note that such stripping might affect an external process at a 229 different layer (such as a digital signature or a count of the characters) 230 that is relying on the presence of all characters in the stream. 232 In particular, in UTF-16 plain text it is likely, but not certain, that an 233 initial 0xFEFF is a signature; when concatenating two strings, it is 234 important to strip out those signatures, for otherwise the resulting string 235 may contain an unintended "ZERO WIDTH NON-BREAKING SPACE" at the connection 236 point. Also, some specifications mandate an initial 0xFEFF character in 237 objects encoded in UTF-16 and specify that this signature is not part of 238 the object. 240 3.3 Choosing a label for UTF-16 text 242 Any labelling application that uses UTF-16 character encoding, and puts an 243 explicit charset label on the text, and knows the serialization order of 244 the characters in text, SHOULD label the text as either "UTF-16BE" or 245 "UTF-16LE", whichever is appropriate based on the endianness of the text. 246 This allows applications processing the text, but unable to look inside the 247 text, to know the serialization definitively. 249 Text in the "UTF-16BE" charset MUST be serialized with the octets which 250 make up a single 16-bit UTF-16 value in big-endian order. Systems labelling 251 UTF-16BE text MUST NOT prepend a BOM to the text. 253 Text in the "UTF-16LE" charset MUST be serialized with the octets which 254 make up a single 16-bit UTF-16 value in little-endian order. Systems 255 labelling UTF-16LE text MUST NOT prepend a BOM to the text. 257 Any labelling application that uses UTF-16 character encoding, and puts an 258 explicit charset label on the text, and does not know the serialization 259 order of the characters in text, MUST label the text as "UTF-16", and 260 SHOULD make sure the text starts with 0xFEFF. 262 An (unfortunate) exception to the "SHOULD" rule of using "UTF-16BE" or 263 "UTF-16LE" is that some document formats mandate a BOM in UTF-16 text, 264 thereby requiring the use of the "UTF-16" tag only. 266 4. Interpreting text labels 268 When a program sees text labelled as "UTF-16BE", "UTF-16LE", or "UTF-16", 269 it can make some assumptions, based on the labelling rules given in the 270 previous section. These assumptions allow the program to then process the 271 text. 273 4.1 Interpreting text labelled as UTF-16BE 275 Text labelled "UTF-16BE" can always be interpreted as always being 276 big-endian. The detection of an initial BOM does not affect 277 de-serialization of text labelled as UTF-16BE. Finding 0xFF followed by 278 0xFE is an error since there is no Unicode character 0xFFFE. 280 4.2 Interpreting text labelled as UTF-16LE 282 Text labelled "UTF-16LE" can always be interpreted as always being 283 little-endian. The detection of an initial BOM does not affect 284 de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by 285 0xFF is an error since there is no Unicode character 0xFFFE, which would be 286 the interpretation of those octets under little-endian order. 288 4.3 Interpreting text labelled as UTF-16 290 Text labelled with the "UTF-16" charset might be serialized in either 291 big-endian or little-endian order. If the first two octets of the text is 292 0xFE followed by 0xFF, then the text can be interpreted as being 293 big-endian. If the first two octets of the text is 0xFF followed by 0xFE, 294 then the text can be interpreted as being little-endian. If the first two 295 octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed 296 by 0xFE, then the text SHOULD be interpreted as being big-endian. 298 All applications that process text with the "UTF-16" charset label MUST be 299 able to read at least the first two octets of the text and be able to 300 process those octets in order to determine the serialization order of the 301 text. Applications that process text with the "UTF-16" charset label MUST 302 NOT assume the serialization without first checking the first two octets to 303 see if they are a big-endian BOM, a little-endian BOM, or not a BOM. 305 5. Examples 307 For the sake of example, let's suppose that there is a hieroglyphic 308 character representing the Egyptian god Ra with character value 0x00012345 309 (this character does not exist at present in Unicode). 311 The examples here all evaluate to the phrase: 313 *=Ra 315 where the "*" represents the Ra hieroglyph (0x00012345). 317 Text labelled with UTF-16BE, without a BOM: 318 D8 48 DF 45 00 3D 00 52 00 61 320 Text labelled with UTF-16LE, without a BOM: 321 48 D8 45 DF 3D 00 52 00 61 00 323 Big-endian text labelled with UTF-16, with a BOM: 324 FE FF D8 48 DF 45 00 3D 00 52 00 61 326 Little-endian text labelled with UTF-16, with a BOM: 327 FF FE 48 D8 45 DF 3D 00 52 00 61 00 329 6. Versions of the standards 331 ISO/IEC 10646 is updated from time to time by published amendments; 332 similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, 333 and 2.1 as of this writing. Each new version replaces the previous one, 334 but implementations, and more significantly data, are not updated 335 instantly. 337 In general, the changes amount to adding new characters, which does not 338 pose particular problems with old data. Amendment 5 to ISO/IEC 10646, 339 however, has moved and expanded the Korean Hangul block, thereby making any 340 previous data containing Hangul characters invalid under the new version. 341 Unicode 2.0 has the same difference from Unicode 1.1. The official 342 justification for allowing such an incompatible change was that no 343 significant implementations and data containing Hangul existed, a statement 344 that is likely to be true but remains unprovable. The incident has been 345 dubbed the "Korean mess", and the relevant committees have pledged to 346 never, ever again make such an incompatible change. 348 New versions, and in particular any incompatible changes, have consequences 349 regarding MIME character encoding labels, to be discussed in Appendix A. 351 7. Security considerations 353 UTF-16 is based on the ISO 10646 character set, which is frequently being 354 added to, as described in Section 6 and Appendix A of this document. 355 Processors must be able to handle characters that are not defined at the 356 time that the processor was created in such a way as to not allow an 357 attacker to harm a recipient by including unknown characters. 359 Processors that handle any type of text, including text encoded as UTF-16, 360 must be vigilant in checking for control characters that might reprogram a 361 display terminal or keyboard. Similarly, processors that interpret text 362 entities (such as looking for embedded programming code), must be careful 363 not to execute the code without first alerting the recipient. 365 Text in UTF-16 may contain special characters, such as the OBJECT 366 REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, 367 depending on the interpretation of the processing program and the 368 availability of an external data stream that would be executed. This 369 external processing may have side-effects that allow the sender of a 370 message to attack the receiving system. 372 Implementors of UTF-16 need to consider the security aspects of how they 373 handle illegal UTF-16 sequences (that is, sequences involving surrogate 374 pairs that have illegal values or unpaired surrogates). It is conceivable 375 that in some circumstances an attacker would be able to exploit an 376 incautious UTF-16 parser by sending it an octet sequence that is not 377 permitted by the UTF-16 syntax, causing it to behave in some anomalous 378 fashion. 380 8. References 382 [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", 383 BCP 18, RFC 2277, January 1998. 385 [CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration 386 Procedures", BCP 19, RFC 2278, January 1998. 388 [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information 389 technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: 390 Architecture and Basic Multilingual Plane. Twelve amendments and two 391 technical corrigenda have been published up to now. UTF-16 is described in 392 Annex Q, published as Amendment 1. Many other amendments are currently at 393 various stages of standardization. 395 [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate 396 Requirement Levels", BCP 14, RFC 2119, March 1997. 398 [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.1", 399 Unicode Technical Report #8. 401 [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 402 2279, January 1998. 404 [WORKSHOP] Weider, C., et. al., "Report of the IAB Character Set Workshop", 405 RFC 2130, April 1997. 407 9. Acknowledgments 409 Deborah Goldsmith wrote a great deal of the initial wording for this 410 specification. Martin Duerst gave numerous significant changes. Other 411 significant contributors include: 413 Mati Allouche 414 Walt Daniels 415 Mark Davis 416 Ned Freed 417 Asmus Freytag 418 Lloyd Honomichl 419 Dan Kegel 420 Murata Makoto 421 Larry Masinter 422 Ken Whistler 424 Some of the text in this specification was copied from [UTF-8], and that 425 document was worked on by many people. Please see the acknowledgments 426 section in that document for more people who may have contributed 427 indirectly to this document. 429 10. Authors' address 431 Paul Hoffman 432 Internet Mail Consortium 433 127 Segre Place 434 Santa Cruz, CA 95060 USA 435 phoffman@imc.org 437 Francois Yergeau 438 Alis Technologies 439 100, boul. Alexis-Nihon, Suite 600 440 Montreal QC H4M 2P2 Canada 441 fyergeau@alis.com 443 11. Changes between draft -01 and -02 445 Fixed some spelling mistakes throughout. 447 Updated the status boilerplate. 449 Clarified the parameter values in 1. 451 Added [WORKSHOP] reference in 1.1 and 8. Also fuzzified the description of 452 what UTF-16 is (instead of getting into hair-splitting on CESs, CCSs, and 453 so on). 455 Corrected 1.2 on the characters for which UTF-8 incurs a space penalty. 457 Added "from ISO 10646 to UTF-16" to the beginning of 2.1. 459 Added "from UTF-16 to ISO 10646" to the beginning of 2.2. 461 Added text to the end of the note at the end of 2.2 about possibly emitting 462 the ill-formed characters when decoding. 464 Rearranged much of sections 3 and 4. This makes the following changes 465 hard to follow; the references refer to the *old* section numbers, 466 not necessarily the ones as they exist in this draft. Sorry about that... 468 Changed the end of the first paragraph of 3.1 to get out of the 469 which-endian-has-most debate. 471 Clarified the fourth paragraph of 3.1 (the one that begins 472 "This specification thus...") about the use of "UTF-16" as both a 473 sequencing mechanism and a charset label. 475 Added Martin Duerst's C code fragment for big-endian order. 477 Added the sentence to the end of the sixth paragraph of 3.1 (the one 478 that begins "It is important...") with the example of substrings and 479 ZWNBSs. 481 Added text about SHOULD NOT put an intial BOM in both 3.2 and 3.3. 483 Clarified the last clause in section 3.3. 485 Removed the last paragraph of 4 (the paragraph that used to start 486 "Because creating text labelled...") because it related to text-creating 487 programs instead of text-labelling programs. 489 Rearragned and relabelled some of the examples in 5. 491 Removed "obsoletes" from the first paragraph of 6. Slightly fuzzified 492 the "no implementations" sentence in the second paragraph. 494 Alphabatized the references in 8. 496 Added Larry Masinter to section 9. Gave Martin Duerst more credit. 498 A. Charset registrations 500 This memo is meant to serve as the basis for registration of three MIME 501 charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", 502 and "UTF-16". These strings label objects containing text consisting of 503 characters from the repertoire of ISO/IEC 10646 including all amendments at 504 least up to amendment 5 (Korean block), encoded to a sequence of octets 505 using the encoding and serialization schemes outlined above. 507 Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in 508 media types under the "text" top-level type, because they do not encode 509 line endings in the way required for MIME "text" media types. 511 It is noteworthy that the labels described here do not contain a version 512 identification, referring generically to ISO/IEC 10646. This is 513 intentional, the rationale being as follows: 515 A MIME charset is designed to give just the information needed to interpret 516 a sequence of bytes received on the wire into a sequence of characters, 517 nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character 518 set standard does not change incompatibly, version numbers serve no 519 purpose, because one gains nothing by learning from the tag that newly 520 assigned characters may be received that one doesn't know about. The tag 521 itself doesn't teach anything about the new characters, which are going to 522 be received anyway. 524 Hence, as long as the standards evolve compatibly, the apparent advantage 525 of having labels that identify the versions is only that, apparent. But 526 there is a disadvantage to such version-dependent labels: when an older 527 application receives data accompanied by a newer, unknown label, it may 528 fail to recognize the label and be completely unable to deal with the data, 529 whereas a generic, known label would have triggered mostly correct 530 processing of the data, which may well not contain any new characters. 532 The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in 533 principle contradicting the appropriateness of a version independent MIME 534 charset as described above. But the compatibility problem can only appear 535 with data containing Korean Hangul characters encoded according to Unicode 536 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is 537 arguably no such data to worry about, this being the very reason the 538 incompatible change was deemed acceptable. 540 In practice, then, a version-independent label is warranted, provided the 541 label is understood to refer to all versions after Amendment 5, and 542 provided no incompatible change actually occurs. Should incompatible 543 changes occur in a later version of ISO/IEC 10646, the MIME charsets 544 defined here will stay aligned with the previous version until and unless 545 the IETF specifically decides otherwise. 547 A.1 Registration for UTF-16BE 549 To: ietf-charsets@iana.org 550 Subject: Registration of new charset 552 Charset name(s): UTF-16BE 554 Published specification(s): This specification 556 Suitable for use in MIME content types under the 557 "text" top-level type: No 559 Person & email address to contact for further information: 560 Paul Hoffman 561 Francois Yergeau 563 A.2 Registration for UTF-16LE 565 To: ietf-charsets@iana.org 566 Subject: Registration of new charset 568 Charset name(s): UTF-16LE 570 Published specification(s): This specification 572 Suitable for use in MIME content types under the 573 "text" top-level type: No 575 Person & email address to contact for further information: 576 Paul Hoffman 577 Francois Yergeau 579 A.3 Registration for UTF-16 581 To: ietf-charsets@iana.org 582 Subject: Registration of new charset 584 Charset name(s): UTF-16 586 Published specification(s): This specification 588 Suitable for use in MIME content types under the 589 "text" top-level type: No 591 Person & email address to contact for further information: 592 Paul Hoffman 593 Francois Yergeau