idnits 2.17.00 (12 Aug 2021) /tmp/idnits26185/draft-yergeau-rfc2279bis-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 4 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC2279, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 6, 2003) is 7043 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group F. Yergeau 3 Internet-Draft Alis Technologies 4 Expires: August 7, 2003 February 6, 2003 6 UTF-8, a transformation format of ISO 10646 7 draft-yergeau-rfc2279bis-03 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance with 12 all provisions of Section 10 of RFC2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at http:// 25 www.ietf.org/ietf/1id-abstracts.txt. 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 This Internet-Draft will expire on August 7, 2003. 32 Copyright Notice 34 Copyright (C) The Internet Society (2003). All Rights Reserved. 36 Abstract 38 ISO/IEC 10646-1 defines a large character set called the Universal 39 Character Set (UCS) which encompasses most of the world's writing 40 systems. The originally proposed encodings of the UCS, however, were 41 not compatible with many current applications and protocols, and this 42 has led to the development of UTF-8, the object of this memo. UTF-8 43 has the characteristic of preserving the full US-ASCII range, 44 providing compatibility with file systems, parsers and other software 45 that rely on US-ASCII values but are transparent to other values. 46 This memo obsoletes and replaces RFC 2279. 48 Discussion of this draft should take place on the ietf- 49 charsets@iana.org mailing list. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Notational conventions . . . . . . . . . . . . . . . . . . . . 5 55 3. UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . . 6 56 4. Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . . 8 57 5. Versions of the standards . . . . . . . . . . . . . . . . . . 9 58 6. Byte order mark (BOM) . . . . . . . . . . . . . . . . . . . . 10 59 7. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 60 8. MIME registration . . . . . . . . . . . . . . . . . . . . . . 13 61 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 62 10. Security Considerations . . . . . . . . . . . . . . . . . . . 15 63 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16 64 12. Changes from RFC 2279 . . . . . . . . . . . . . . . . . . . . 17 65 Normative references . . . . . . . . . . . . . . . . . . . . . 19 66 Informative references . . . . . . . . . . . . . . . . . . . . 20 67 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 21 68 Full Copyright Statement . . . . . . . . . . . . . . . . . . . 22 70 1. Introduction 72 ISO/IEC 10646 [ISO.10646] defines a large character set called the 73 Universal Character Set (UCS), which encompasses most of the world's 74 writing systems. The same set of characters is defined by the 75 Unicode standard [UNICODE], which further defines additional 76 character properties and other application details of great interest 77 to implementers. Up to the present time, changes in Unicode and 78 amendments and additions to ISO/IEC 10646 have tracked each other, so 79 that the character repertoires and code point assignments have 80 remained in sync. The relevant standardization committees have 81 committed to maintain this very useful synchronism. 83 ISO/IEC 10646 and Unicode define several encoding forms of their 84 common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an 85 encoding form, each character is represented as one or more encoding 86 units. All standard UCS encoding forms except UTF-8 have an encoding 87 unit larger than one octet, making them hard to use in many current 88 applications and protocols that assume 8 or even 7 bit characters. 90 UTF-8, the object of this memo, has a one-octet encoding unit. It 91 uses all bits of an octet, but has the quality of preserving the full 92 US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one 93 octet having the normal US-ASCII value, and any octet with such a 94 value can only stand for a US-ASCII character, and nothing else. 96 UTF-8 encodes UCS characters as a varying number of octets, where the 97 number of octets, and the value of each, depend on the integer value 98 assigned to the character in ISO/IEC 10646 (the character number, 99 a.k.a. code point or Unicode scalar value). This encoding form has 100 the following characteristics (all values are in hexadecimal): 102 o Character numbers from U+0000 to U+007F (US-ASCII repertoire) 103 correspond to octets 00 to 7F (7 bit US-ASCII values). A direct 104 consequence is that a plain ASCII string is also a valid UTF-8 105 string. 107 o US-ASCII octet values do not appear otherwise in a UTF-8 encoded 108 character stream. This provides compatibility with file systems 109 or other software (e.g. the printf() function in C libraries) 110 that parse based on US-ASCII values but are transparent to other 111 values. 113 o Round-trip conversion is easy between UTF-8 and other encoding 114 forms. 116 o The first octet of a multi-octet sequence indicates the number of 117 octets in the sequence. 119 o The octet values C0, C1, FE and FF never appear. If the range of 120 character numbers is restricted to U+0000..U+10FFFF (the UTF-16 121 accessible range), then the octet values F5..FD also never appear. 123 o Character boundaries are easily found from anywhere in an octet 124 stream. 126 o The lexicographic sorting order of UTF-8 strings is the same as if 127 ordered by character numbers. Of course this is of limited 128 interest since a sort order based on character numbers is not 129 culturally valid. 131 o The Boyer-Moore fast search algorithm can be used with UTF-8 data. 133 o UTF-8 strings can be fairly reliably recognized as such by a 134 simple algorithm, i.e. the probability that a string of 135 characters in any other encoding appears as valid UTF-8 is low, 136 diminishing with increasing string length. 138 UTF-8 was originally a project of the X/Open Joint 139 Internationalization Group XOJIG with the objective to specify a File 140 System Safe UCS Transformation Format [FSS_UTF] that is compatible 141 with UNIX systems, supporting multilingual text in a single encoding. 142 The original authors were Gary Miller, Greger Leijonhufvud and John 143 Entenmann. Later, Ken Thompson and Rob Pike did significant work for 144 the formal definition of UTF-8. 146 2. Notational conventions 148 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 149 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 150 document are to be interpreted as described in [RFC2119]. 152 UCS characters are designated by the U+HHHH notation, where HHHH is a 153 string of from 4 to 6 hexadecimal digits representing the character 154 number in ISO/IEC 10646. 156 3. UTF-8 definition 158 UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and 159 formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646] 161 In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 162 accessible range) are encoded using sequences of 1 to 4 octets. The 163 only octet of a "sequence" of one has the higher-order bit set to 0, 164 the remaining 7 bits being used to encode the character number. In a 165 sequence of n octets, n>1, the initial octet has the n higher-order 166 bits set to 1, followed by a bit set to 0. The remaining bit(s) of 167 that octet contain bits from the number of the character to be 168 encoded. The following octet(s) all have the higher-order bit set to 169 1 and the following bit set to 0, leaving 6 bits in each to contain 170 bits from the character to be encoded. 172 The table below summarizes the format of these different octet types. 173 The letter x indicates bits available for encoding bits of the 174 character number. 176 Char. number range | UTF-8 octet sequence 177 (hexadecimal) | (binary) 178 --------------------+--------------------------------------------- 179 0000 0000-0000 007F | 0xxxxxxx 180 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 181 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 182 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 184 Encoding a character to UTF-8 proceeds as follows: 186 1. Determine the number of octets required from the character number 187 and the first column of the table above. It is important to note 188 that the rows of the table are mutually exclusive, i.e. there is 189 only one valid way to encode a given character. 191 2. Prepare the high-order bits of the octets as per the second 192 column of the table. 194 3. Fill in the bits marked x from the bits of the character number, 195 expressed in binary. Start by putting the lowest-order bit of 196 the character number in the lowest-order position of the last 197 octet of the sequence, then put the next higher-order bit of the 198 character number in the next higher-order position of that octet, 199 etc. When the x bits of the last octet are filled in, move on to 200 the next to last octet, then to the preceding one, etc. until 201 all x bits are filled in. 203 The definition of UTF-8 prohibits encoding character numbers between 204 U+D800 and U+DFFF, which are reserved for use with the UTF-16 205 encoding form (as surrogate pairs) and do not directly represent 206 characters. When encoding in UTF-8 from UTF-16 data, it is necessary 207 to first decode the UTF-16 data to obtain character numbers, which 208 are then encoded in UTF-8 as described above. This contrasts with 209 CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for 210 use on the Internet. CESU-8 operates similarly to UTF-8 but encodes 211 the UTF-16 code values (16-bit quantities) instead of the character 212 number (code point). This leads to different results for character 213 numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT 214 valid UTF-8. 216 Decoding a UTF-8 character proceeds as follows: 218 1. Initialize a binary number with all bits set to 0. Up to 21 bits 219 may be needed. 221 2. Determine which bits encode the character number from the number 222 of octets in the sequence and the second column of the table 223 above (the bits marked x). 225 3. Distribute the bits from the sequence to the binary number, first 226 the lower-order bits from the last octet of the sequence and 227 proceeding to the left until no x bits are left. The binary 228 number is now equal to the character number. 230 Implementations of the decoding algorithm above MUST protect against 231 decoding invalid sequences. For instance, a naive implementation may 232 decode the overlong UTF-8 sequence C0 80 into the character U+0000, 233 or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding 234 invalid sequences may have security consequences or cause other 235 problems. See Security Considerations (Section 10) below. 237 4. Syntax of UTF-8 Byte Sequences 239 A UTF-8 string is a sequence of octets representing a sequence of UCS 240 characters. An octet sequence is valid UTF-8 only if it matches the 241 following syntax, which is derived from the rules for encoding UTF-8 242 and is expressed in the ABNF of [RFC2234]. 244 UTF8-octets = *( UTF8-char ) 245 UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 246 UTF8-1 = %x00-7F 247 UTF8-2 = %xC2-DF UTF8-tail 248 UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / 249 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) 250 UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / 251 %xF4 %x80-8F 2( UTF8-tail ) 252 UTF8-tail = %x80-BF 254 5. Versions of the standards 256 ISO/IEC 10646 is updated from time to time by publication of 257 amendments and additional parts; similarly, new versions of the 258 Unicode standard are published over time. Each new version obsoletes 259 and replaces the previous one, but implementations, and more 260 significantly data, are not updated instantly. 262 In general, the changes amount to adding new characters, which does 263 not pose particular problems with old data. In 1996, Amendment 5 to 264 the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded 265 the Korean Hangul block, thereby making any previous data containing 266 Hangul characters invalid under the new version. Unicode 2.0 has the 267 same difference from Unicode 1.1. The justification for allowing 268 such an incompatible change was that there were no major 269 implementations and no significant amounts of data containing Hangul. 270 The incident has been dubbed the "Korean mess", and the relevant 271 committees have pledged to never, ever again make such an 272 incompatible change (see Unicode Consortium Policies [1]). 274 New versions, and in particular any incompatible changes, have 275 consequences regarding MIME charset labels, to be discussed in MIME 276 registration (Section 8). 278 6. Byte order mark (BOM) 280 The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known 281 informally as "BYTE ORDER MARK" (abbreviated "BOM"). This character 282 can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but 283 the BOM name hints at a second possible usage of the character: to 284 prepend a U+FEFF character to a stream of UCS characters as a 285 "signature". A receiver of such a serialized stream may then use the 286 initial character as a hint that the stream consists of UCS 287 characters and also to recognize which UCS encoding is involved and, 288 with encodings having a multi-octet encoding unit, as a way to 289 recognize the serialization order of the octets. UTF-8 having a 290 single-octet encoding unit, this last function is useless and the BOM 291 will always appear as the octet sequence EF BB BF. 293 It is important to understand that the character U+FEFF appearing at 294 any position other than the beginning of a stream MUST be interpreted 295 with the semantics for the zero-width non-breaking space, and MUST 296 NOT be interpreted as a signature. When interpreted as a signature, 297 the Unicode standard suggests than an initial U+FEFF character may be 298 stripped before processing the text. Such stripping is necessary in 299 some cases (e.g. when concatenating two strings, because otherwise 300 the resulting string may contain an unintended "ZERO WIDTH NO-BREAK 301 SPACE" at the connection point), but might affect an external process 302 at a different layer (such as a digital signature or a count of the 303 characters) that is relying on the presence of all characters in the 304 stream. It is therefore RECOMMENDED to avoid stripping an initial 305 U+FEFF interpreted as a signature without a good reason, to ignore it 306 instead of stripping it when appropriate (such as for display) and to 307 strip it only when really necessary. 309 U+FEFF in the first position of a stream MAY be interpreted as a 310 zero-width non-breaking space, and is not always a signature. In an 311 attempt at diminishing this uncertainty, Unicode 3.2 adds a new 312 character, U+2060 "WORD JOINER", with exactly the same semantics and 313 usage as U+FEFF except for the signature function, and strongly 314 recommends its exclusive use for expressing word-joining semantics. 315 Eventually, following this recommendation will make it all but 316 certain that any initial U+FEFF is a signature, not an intended "ZERO 317 WIDTH NO-BREAK SPACE". 319 In the meantime, the uncertainty unfortunately remains and may affect 320 Internet protocols. Protocol specifications MAY restrict usage of 321 U+FEFF as a signature in order to reduce or eliminate the potential 322 ill effects of this uncertainty. In the interest of striking a 323 balance between the advantages (reduction of uncertainty) and 324 drawbacks (loss of the signature function) of such restrictions, it 325 is useful to distinguish a few cases: 327 o A protocol SHOULD forbid use of U+FEFF as a signature for those 328 textual protocol elements that the protocol mandates to be always 329 UTF-8, the signature function being totally useless in those 330 cases. 332 o A protocol SHOULD also forbid use of U+FEFF as a signature for 333 those textual protocol elements for which the protocol provides 334 character encoding identification mechanisms, when it is expected 335 that implementations of the protocol will be in a position to 336 always use the mechanisms properly. This will be the case when 337 the protocol elements are maintained tightly under the control of 338 the implementation from the time of their creation to the time of 339 their (properly labeled) transmission. 341 o A protocol SHOULD NOT forbid use of U+FEFF as a signature for 342 those textual protocol elements for which the protocol does not 343 provide character encoding identification mechanisms, when a ban 344 would be unenforceable, or when it is expected that 345 implementations of the protocol will not be in a position to 346 always use the mechanisms properly. The latter two cases are 347 likely to occur with larger protocol elements such as MIME 348 entities, especially when implementations of the protocol will 349 obtain such entities from file systems, from protocols that do not 350 have encoding identification mechanisms for payloads (such as FTP) 351 or from other protocols that do not guarantee proper 352 identification of character encoding (such as HTTP). 354 When a protocol forbids use of U+FEFF as a signature for a certain 355 protocol element, then any initial U+FEFF in that protocol element 356 MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE". When a 357 protocol does NOT forbid use of U+FEFF as a signature for a certain 358 protocol element, then implementations SHOULD be prepared to handle a 359 signature in that element and react appropriately: using the 360 signature to identify the character encoding as necessary and 361 stripping or ignoring the signature as appropriate. 363 7. Examples 365 The character sequence U+0041 U+2262 U+0391 U+002E "A." is encoded in UTF-8 as follows: 368 --+--------+-----+-- 369 41 E2 89 A2 CE 91 2E 370 --+--------+-----+-- 372 The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo", 373 meaning "the Korean language") is encoded in UTF-8 as follows: 375 --------+--------+-------- 376 ED 95 9C EA B5 AD EC 96 B4 377 --------+--------+-------- 379 The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo", 380 meaning "the Japanese language") is encoded in UTF-8 as follows: 382 --------+--------+-------- 383 E6 97 A5 E6 9C AC E8 AA 9E 384 --------+--------+-------- 386 The character U+233B4 (a Chinese character meaning 'stump of tree'), 387 prepended with a UTF-8 BOM, is encoded in UTF-8 as follows: 389 --------+----------- 390 EF BB BF F0 A3 8E B4 391 --------+----------- 393 8. MIME registration 395 This memo serves as the basis for registration of the MIME charset 396 parameter for UTF-8, according to [RFC2978]. The charset parameter 397 value is "UTF-8". This string labels media types containing text 398 consisting of characters from the repertoire of ISO/IEC 10646 399 including all amendments at least up to amendment 5 of the 1993 400 edition (Korean block), encoded to a sequence of octets using the 401 encoding scheme outlined above. UTF-8 is suitable for use in MIME 402 content types under the "text" top-level type. 404 It is noteworthy that the label "UTF-8" does not contain a version 405 identification, referring generically to ISO/IEC 10646. This is 406 intentional, the rationale being as follows: 408 A MIME charset label is designed to give just the information needed 409 to interpret a sequence of bytes received on the wire into a sequence 410 of characters, nothing more (see [RFC2045], section 2.2). As long as 411 a character set standard does not change incompatibly, version 412 numbers serve no purpose, because one gains nothing by learning from 413 the tag that newly assigned characters may be received that one 414 doesn't know about. The tag itself doesn't teach anything about the 415 new characters, which are going to be received anyway. 417 Hence, as long as the standards evolve compatibly, the apparent 418 advantage of having labels that identify the versions is only that, 419 apparent. But there is a disadvantage to such version-dependent 420 labels: when an older application receives data accompanied by a 421 newer, unknown label, it may fail to recognize the label and be 422 completely unable to deal with the data, whereas a generic, known 423 label would have triggered mostly correct processing of the data, 424 which may well not contain any new characters. 426 Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible 427 change, in principle contradicting the appropriateness of a version 428 independent MIME charset label as described above. But the 429 compatibility problem can only appear with data containing Korean 430 Hangul characters encoded according to Unicode 1.1 (or equivalently 431 ISO/IEC 10646 before amendment 5), and there is arguably no such data 432 to worry about, this being the very reason the incompatible change 433 was deemed acceptable. 435 In practice, then, a version-independent label is warranted, provided 436 the label is understood to refer to all versions after Amendment 5, 437 and provided no incompatible change actually occurs. Should 438 incompatible changes occur in a later version of ISO/IEC 10646, the 439 MIME charset label defined here will stay aligned with the previous 440 version until and unless the IETF specifically decides otherwise. 442 9. IANA Considerations 444 The entry for UTF-8 in the IANA charset registry should be updated to 445 point to this memo. 447 10. Security Considerations 449 Implementers of UTF-8 need to consider the security aspects of how 450 they handle illegal UTF-8 sequences. It is conceivable that in some 451 circumstances an attacker would be able to exploit an incautious UTF- 452 8 parser by sending it an octet sequence that is not permitted by the 453 UTF-8 syntax. 455 A particularly subtle form of this attack can be carried out against 456 a parser which performs security-critical validity checks against the 457 UTF-8 encoded form of its input, but interprets certain illegal octet 458 sequences as characters. For example, a parser might prohibit the 459 NUL character when encoded as the single-octet sequence 00, but 460 erroneously allow the illegal two-octet sequence C0 80 and interpret 461 it as a NUL character. Another example might be a parser which 462 prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the 463 illegal octet sequence 2F C0 AE 2E 2F. This last exploit has 464 actually been used in a widespread virus attacking Web servers in 465 2001; the security threat is thus very real. 467 Another security issue occurs when encoding to UTF-8: the ISO/IEC 468 10646 description of UTF-8 allows encoding character numbers up to 469 U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore 470 a risk of buffer overflow if the range of character numbers is not 471 explicitly limited to U+10FFFF or if buffer sizing doesn't take into 472 account the possibility of 5- and 6-byte sequences. 474 11. Acknowledgements 476 The following have participated in the drafting and discussion of 477 this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer, 478 Mark Davis, Martin J. DÈ­rst, Patrick FÈñltstrȵm, Ned Freed, David 479 Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood, 480 Simon Josefsson, Kent Karlsson, Markus Kuhn, Michael Kung, Alain 481 LaBontȨ, Ira McDonald, Alexey Melnikov, John Gardiner Myers, Dan 482 Oscarsson, Murray Sargent, Markus Scherer, Keld Simonsen, Arnold 483 Winkler, Kenneth Whistler and Misha Wolf. 485 12. Changes from RFC 2279 487 o Restricted the range of characters to 0000-10FFFF (the UTF-16 488 accessible range). 490 o Made Unicode the source of the normative definition of UTF-8, 491 keeping ISO/IEC 10646 as the reference for characters. 493 o Significantly shortened Introduction. No more mention of UTF-1 or 494 UTF-7, of Transformation Formats. 496 o Straightened out terminology. UTF-8 now described in terms of an 497 encoding form of the character number. UCS-2 and UCS-4 almost 498 disappeared. 500 o Turned the note warning against decoding of invalid sequences into 501 a normative MUST NOT. 503 o Added a new section about the UTF-8 BOM, with advice for 504 protocols. 506 o Updated a couple of references (10646-1:2000, Unicode 3.2, RFC 507 2978). 509 o Added TOC. 511 o Removed suggested UNICODE-1-1-UTF-8 MIME charset registration. 513 o Added new "Notational conventions" section about RFC 2119 and 514 U+HHHH notation. 516 o Added pointer to Unicode Consortium Policies in "Versions of the 517 standards" section. 519 o Added a fourth example with a non-BMP character and a BOM. 521 o Added a paragraph about U+2060 WORD JOINER. 523 o Enumerate more byte values impossible in UTF-8, either as a result 524 of forbidding overlong sequences or of restricting to the UTF-16 525 accessible range. 527 o Added "IANA Considerations" section to ask that the UTF-8 entry in 528 the charset registry point to this memo. 530 o Added an ABNF syntax for valid UTF-8 octet sequences 532 o Added some warning language about CESU-8 533 o Split References into Normative and Informative 535 Normative references 537 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 538 Requirement Levels", BCP 14, RFC 2119, March 1997. 540 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 541 Specifications: ABNF", RFC 2234, November 1997. 543 [ISO.10646] International Organization for Standardization, 544 "Information Technology - Universal Multiple-octet coded 545 Character Set (UCS)", ISO/IEC Standard 10646, comprised 546 of ISO/IEC 10646-1:2000, "Information technology -- 547 Universal Multiple-Octet Coded Character Set (UCS) -- 548 Part 1: Architecture and Basic Multilingual Plane", ISO/ 549 IEC 10646-2:2001, "Information technology -- Universal 550 Multiple-Octet Coded Character Set (UCS) -- Part 2: 551 Supplementary Planes" and ISO/IEC 10646-1:2000/Amd 552 1:2002, "Mathematical symbols and other characters". 554 [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 555 3.2", defined by The Unicode Standard, Version 3.0 556 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), 557 as amended by the Unicode Standard Annex #27: Unicode 558 3.1 (see http://www.unicode.org/reports/tr27) and by the 559 Unicode Standard Annex #28: Unicode 3.2 (see http:// 560 www.unicode.org/reports/tr28), March 2002, . 564 Informative references 566 [CESU-8] Phipps, T., "Compatibility Encoding Scheme for UTF-16: 8- 567 Bit (CESU-8)", UTR 26, April 2002, . 570 [FSS_UTF] X/Open Company Ltd., "X/Open CAE Specification C501 -- 571 File System Safe UCS Transformation Format (FSS_UTF)", 572 ISBN 1-85912-082-2, April 1995. 574 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 575 Extensions (MIME) Part One: Format of Internet Message 576 Bodies", RFC 2045, November 1996. 578 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration 579 Procedures", BCP 19, RFC 2978, October 2000. 581 [US-ASCII] American National Standards Institute, "Coded Character 582 Set - 7-bit American Standard Code for Information 583 Interchange", ANSI X3.4, 1986. 585 URIs 587 [1] 589 Author's Address 591 FranȺois Yergeau 592 Alis Technologies 593 100, boul. Alexis-Nihon, bureau 600 594 MontrȨal, QC H4M 2P2 595 Canada 597 Phone: +1 514 747 2547 598 Fax: +1 514 747 2561 599 EMail: fyergeau@alis.com 601 Full Copyright Statement 603 Copyright (C) The Internet Society (2003). All Rights Reserved. 605 This document and translations of it may be copied and furnished to 606 others, and derivative works that comment on or otherwise explain it 607 or assist in its implementation may be prepared, copied, published 608 and distributed, in whole or in part, without restriction of any 609 kind, provided that the above copyright notice and this paragraph are 610 included on all such copies and derivative works. However, this 611 document itself may not be modified in any way, such as by removing 612 the copyright notice or references to the Internet Society or other 613 Internet organizations, except as needed for the purpose of 614 developing Internet standards in which case the procedures for 615 copyrights defined in the Internet Standards process must be 616 followed, or as required to translate it into languages other than 617 English. 619 The limited permissions granted above are perpetual and will not be 620 revoked by the Internet Society or its successors or assigns. 622 This document and the information contained herein is provided on an 623 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 624 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 625 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 626 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 627 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 629 Acknowledgement 631 Funding for the RFC Editor function is currently provided by the 632 Internet Society.