idnits 2.17.00 (12 Aug 2021) /tmp/idnits18469/draft-hoffman-utf8-rfcs-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 380 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 109: '... language (MUST, SHOULD, and so on). ...' -- The draft header indicates that this document updates RFC2223, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 29, 2010) is 4429 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 2223 (Obsoleted by RFC 7322) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Hoffman 3 Internet-Draft VPN Consortium 4 Updates: 2223 (if approved) T. Bray 5 Intended status: Informational Sun Microsystems 6 Expires: September 30, 2010 March 29, 2010 8 Using non-ASCII Characters in RFCs 9 draft-hoffman-utf8-rfcs-06 11 Abstract 13 This document specifies a change to the IETF process in which 14 Internet Drafts and RFCs are allowed to contain non-ASCII characters. 15 The proposed change is to change the encoding of Internet Drafts and 16 RFCs to UTF-8 when non-ASCII characters are needed. 18 Status of this Memo 20 This Internet-Draft is submitted to IETF in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt. 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html. 39 This Internet-Draft will expire on September 30, 2010. 41 Copyright Notice 43 Copyright (c) 2010 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the BSD License. 56 This document may contain material from IETF Documents or IETF 57 Contributions published or made publicly available before November 58 10, 2008. The person(s) controlling the copyright in some of this 59 material may not have granted the IETF Trust the right to allow 60 modifications of such material outside the IETF Standards Process. 61 Without obtaining an adequate license from the person(s) controlling 62 the copyright in such materials, this document may not be modified 63 outside the IETF Standards Process, and derivative works of it may 64 not be created outside the IETF Standards Process, except to format 65 it for publication as an RFC or to translate it into languages other 66 than English. 68 1. Introduction 70 The purpose of this document is to specify a way for the IETF to use 71 non-ASCII characters in Internet Drafts and RFCs. 73 Various guideline documents in the IETF, notably [RFC2223], specify 74 that RFCs must use only the US-ASCII character set. This restriction 75 has historically caused problems, notably: 77 o Names and addresses of authors of IETF documents are misspelled 79 o Names and document titles in references are misspelled 81 o Protocol examples that include non-ASCII characters cannot be 82 included straightforwardly 84 The first two issues cause real problems for people searching for 85 RFCs for particular authors or references that contain non-ASCII 86 characters. For many languages that use Latin characters outside the 87 ASCII range, there are no absolute mappings between those non-ASCII 88 characters and ASCII equivalents. A common example is that "u-with- 89 umlaut" (U+00FC) may be mapped to "u" or to "ue"; many other mapping 90 difficulties exist. 92 The third issue reduces the effectiveness of IETF specifications; 93 implementors of protocols which carry textual payloads often 94 experience difficulty in achieving interoperability related to the 95 use of character sets from around the world. Specifications which 96 can provide concrete examples of such protocol scenarios will be of 97 significant benefit to these implementors. 99 Now that UTF-8 [RFC3629] is nearly universally available in text- 100 editing and display systems, the IETF can eliminate these problems by 101 allowing RFCs to use UTF-8. As a reminder, UTF-8 is fully and 102 thoroughly upwards compatible to US-ASCII. 104 This document uses example characters as specified in [RFC5137]. Had 105 the recommendations from this document already been implemented, this 106 alternate representation would, of course, not be necessary. 108 It is important to note that this document does not use RFC 2119 109 language (MUST, SHOULD, and so on). Instead, it lists practices that 110 the IETF should consider. If the ideas in this document are adopted, 111 the final list of rules for using UTF-8 in Internet Drafts and RFCs 112 would be published by the IETF Secretariat and the RFC Editor. The 113 authors are open to changing this and using 2119-style language if 114 the community prefers it. 116 2. Use of UTF-8 in Internet Drafts and RFCs 118 Upon publication of this document as an RFC, new RFCs and Internet 119 Drafts will be considered to be encoded in UTF-8 if they contain any 120 non-ASCII characters; otherwise, they will continue to be considered 121 encoded in US-ASCII. The IETF Secretariat and RFC Editor need to 122 change their processes to publish documents that are valid UTF-8. 124 2.1. Limits On the Locations In Which Non-ASCII Text May Be Used 126 It is suggested that the IETF Secretariat and RFC Editor limit non- 127 ASCII characters to the following: 129 o Names and addresses of authors, used at the top of RFCs and in 130 Author Contact sections 132 o Names and document titles used in References sections 134 o Quotations where the original contains non-ASCII characters 136 o Protocol examples that include non-ASCII characters, for example 137 in Internationalized Domain Names (IDNs), Internationalized 138 Resource Identifiers (IRIs), and Internationalized Email Addresses 139 (IEA). 141 Using non-ASCII characters in areas other than those listed above is 142 prohibited. In specific, using "curly quotes", m-dashes, and other 143 punctuation that appear in normal publishing is not allowed under 144 these guidelines. This limitation is to help those people who are 145 reading Internet Drafts and RFCs on systems that do not render UTF-8 146 legibly. 148 2.2. Allowable Character Repertoire 150 UTF-8 is an encoding of the Unicode Character Set and can be used to 151 encode any of its numeric codepoints, from U+0000 to U+10FFFF 152 inclusive. Specifications using UTF-8 must not use the following 153 codepoint ranges: 155 o The "ASCII control characters" in the ranges U+0000 to U+0008, 156 U+000B, and U+000D to U+001F. Also, the "C1 control characters" in 157 the ranges U+0080 to U+009F. These lack either visual 158 representations, interoperable semantics, or both. 160 o The Surrogate-block range U+D800 to U+DFFF. These codepoints do 161 not identify characters, but exist to support the UTF-16 encoding. 163 o The ZERO WIDTH NO-BREAK SPACE U+FEFF and its mirror image U+FFFE. 165 o The Private-Use-Area ranges, U+E000 to U+F8FF, U+F0000 to U+FFFFD, 166 and U+100000 to U+10FFFD. 168 Internet Drafts and RFCs should not contain Unicode codepoints which 169 are "Compatibility Characters", that is, those whose properties 170 include a compatibility decomposition. Note that such characters 171 occur rarely and detecting them requires run-time access to the 172 Unicode character database, which may not be practical in some 173 situations. 175 [[ Need to add additional types of characters that should not be 176 allowed: unassigned characters, other control characters, ones that 177 are really formatting characters, and maybe others. This needs some 178 wording, given that the lists of these change over time. ]] 180 2.3. Normalization 182 Due to the way that Unicode uses combining characters, there are 183 sometimes multiple codepoint sequences that denote what, to a human, 184 is the same character. For example, the character "lowercase-a-with- 185 acute" can be spelled in two ways: as a single character (U+00E1) or 186 as two characters (U+0061 followed by U+0301). This can present 187 problems in searching and rendering. 189 The process of standardizing on one of these possibilities is 190 referred to as "normalization" and several "normalization forms" are 191 defined by the Unicode Consortium. All UTF-8 text appearing in RFCs 192 (but not necessarily Internet Drafts) ought to be normalized using 193 Normalization Form C [[ reference needed, should be the version of 194 Unicode when this is finalized ]]. 196 2.4. Author and Employer Names 198 Authors can choose how to spell their names and the names of their 199 employers in the various parts of Internet Drafts they are writing. 200 The spelling at the top of the first page of the document needs to 201 match the spelling in the "Authors' Addresses" section near the end 202 of the document, but the latter can have alternate spellings to help 203 those searching documents by name. Postal information listed in the 204 "Authors' Addresses" section can also use non-ASCII. 206 For example, assume that an author whose name is 207 Fltstrm has a preferred all-ASCII spelling of 208 Xiaodong Faltstrom. One expected allowed methods for spelling his 209 name would be: 211 Network Working Group X. Faltstrom 212 Internet-Draft ExampleCo 213 . . . 214 Author's Address 216 Xiaodong Faltstrom ( Fltstrm) 217 ExampleCo 219 Email: xiaodong.faltstrom@example.com 221 Another expected allowed methods for spelling his name would be: 223 Network Working Group X. Fltstrm 224 Internet-Draft ExampleCo 225 . . . 226 Author's Address 228 Fltstrm (Xiaodong Faltstrom) 229 ExampleCo 231 Email: xiaodong.faltstrom@example.com 233 3. Document Content 235 In order to assist text display software, any Internet Draft or RFC 236 that contains non-ASCII characters should start with the byte order 237 mark (BOM) U+FEFF. The UTF-8 byte order mark should not be included 238 in any Internet Draft or RFC that does not contain non-ASCII 239 characters. Detecting if an Internet Draft or RFC contains non-ASCII 240 characters and being sure that such a document has a byte order mark 241 can be done by the IETF's Internet Draft submission tool and the RFC 242 Editor's publishing process. 244 RFCs are currently published with form-feed characters between pages. 245 These marks work on some printers but not others. This proposed 246 change does not affect any policy whether or not to use form-feed 247 characters. 249 4. Security Considerations 251 A display program that expects only US-ASCII input may fail when it 252 encounters octets outside the US-ASCII range of values. Such a 253 failure may become a security issue. For example, the program may 254 display incorrect results for the input. More seriously, the program 255 may have an internal error that causes it to fail in a security- 256 compromising fashion. Note that such a program is vulnerable to many 257 attacks other than just showing IETF documents. 259 Someone could insert a UTF-8 host name in an RFC that has visually 260 confusing characters. Another person could copy that host name out 261 of the RFC and have it resolve to an unintended DNS name. This 262 scenario seems quite far-fetched, given that tracking the RFC back to 263 the author is trivial. 265 5. IAOC and IAB considerations 267 If this document is adopted by the IETF, it will be up to the IAOC 268 and IAB to have the IETF Secretariat and the RFC Editor, 269 respectively, implement it. The two bodies need to consider all of 270 the suggested rules in this document, both the positive ones (such as 271 allowing additional characters in some parts of Internet Drafts and 272 RFCs) and the negative ones (such as disallowing particular 273 characters from being used). The IAOC and IAB might want to publish 274 proposed instructions to the IETF Secretariat and the RFC Editor and 275 ask for community input on the specific instructions. 277 6. Informative References 279 [RFC2223] Postel, J. and J. Reynolds, "Instructions to RFC Authors", 280 RFC 2223, October 1997. 282 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 283 10646", STD 63, RFC 3629, November 2003. 285 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", 286 BCP 137, RFC 5137, February 2008. 288 Appendix A. Arguments Against Changing to UTF-8 290 Over more than a decade, the question of changing the encoding of 291 RFCs to UTF-8 has come up repeatedly. Although many people wanted 292 the change, various people had different reasons why they felt it was 293 a bad idea. This appendix is a summary of those arguments and an 294 explanation of why they are no longer as critical as they were long 295 ago. 297 A.1. Difficulty in Displaying 299 Some text display systems only know how to display US-ASCII. 300 Displaying an RFC that uses non-ASCII characters encoded in UTF-8 301 will cause those characters to be unreadable. 303 There are, of course, still such display systems, and there always 304 will be. However, the number is dwindling as more software is 305 improved to display non-ASCII characters and, in particular, to read 306 UTF-8 as an encoding. Of the systems that can only render US-ASCII, 307 only a small subset drop non-ASCII characters: the others show an 308 incorrect character in its place. Thus, the person using such a 309 system can often see that there is a problem, and can possibly choose 310 to get better display software. 312 A.2. Difficulty in Printing 314 Some printers can only print a limited set of characters due to the 315 fact that they are character-oriented, not graphical. Such printers 316 inherently cannot print characters they do not understand. Almost 317 all such printers print the visible ASCII characters just fine, but 318 many cannot print the formfeeds currently used correctly. 320 There are, of course, still such printers, and there always will be. 321 However, the number is dwindling as older printers are replaced with 322 ones that can print graphics so that now-common text features like 323 boldface and italics can be printed. 325 A.3. Insufficient Fonts 327 Almost no display system that can display text that is encoded with 328 UTF-8 can display every character in the Unicode repertoire. Thus, 329 some non-ASCII characters that are included in RFCs will not display 330 properly. 332 Virtually every system that can display Unicode knows how to 333 substitute a replacement character for ones that cannot be displayed. 334 In fact, many such systems have glyphs for rendering unknown 335 characters and different glyphs for rendering known characters for 336 which the system has no font. 338 A.4. Inability to Search for Non-ASCII Characers 340 If authors start using non-ASCII characters in their names and/or 341 addresses, people who know the characters but are unfamiliar with the 342 user interface on their computers may not be able to enter those 343 characters in the search criteria. For example, some people do not 344 know how to enter "u-with-umlaut" in their operating system, even 345 though the operating system allows such input. 347 This is a valid concern, but one that is orthogonal to whether or not 348 RFCs should use these characters. The alternative (never go to 349 UTF-8) simply shifts the problem to forcing the user to guess which 350 ASCII-only spelling to use when searching. 352 Appendix B. Changes from -05 to -06 354 None significant. 356 Authors' Addresses 358 Paul Hoffman 359 VPN Consortium 361 Email: paul.hoffman@vpnc.org 363 Tim Bray 364 Sun Microsystems 366 Email: tbray@textuality.com