idnits 2.17.00 (12 Aug 2021) /tmp/idnits9501/draft-ietf-idnabis-rationale-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 2214. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2225. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2232. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2238. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 10, 2008) is 5123 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Unicode-PropertyValueAliases' is defined on line 2109, but no explicit reference was found in the text == Unused Reference: 'Unicode-RegEx' is defined on line 2114, but no explicit reference was found in the text == Unused Reference: 'Unicode-Scripts' is defined on line 2119, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' == Outdated reference: draft-ietf-idnabis-protocol has been published as RFC 5891 == Outdated reference: draft-ietf-idnabis-tables has been published as RFC 5892 ** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564) ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) == Outdated reference: draft-ietf-idnabis-protocol has been published as RFC 5891 -- Duplicate reference: draft-ietf-idnabis-protocol, mentioned in 'RulesInit', was also mentioned in 'IDNA2008-Protocol'. -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode-PropertyValueAliases' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode-RegEx' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode-Scripts' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode51' -- Obsolete informational reference (is this intentional?): RFC 810 (Obsoleted by RFC 952) Summary: 4 errors (**), 0 flaws (~~), 7 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft May 10, 2008 4 Intended status: Standards Track 5 Expires: November 11, 2008 7 Internationalizing Domain Names for Applications (IDNA): Definitions, 8 Background and Rationale 9 draft-ietf-idnabis-rationale-00.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on November 11, 2008. 36 Abstract 38 Several years have passed since the original protocol for 39 Internationalized Domain Names (IDNs) was completed and deployed. 40 During that time, a number of issues have arisen, including the need 41 to update the system to deal with newer versions of Unicode. Some of 42 these issues require tuning of the existing protocols and the tables 43 on which they depend. This document provides an overview of a 44 revised system and provides explanatory material for its components. 46 Table of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 49 1.1. Context and Overview . . . . . . . . . . . . . . . . . . . 4 50 1.2. Discussion Forum . . . . . . . . . . . . . . . . . . . . . 4 51 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4 52 1.4. Applicability and Function of IDNA . . . . . . . . . . . . 5 53 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 54 1.5.1. Documents and Standards . . . . . . . . . . . . . . . 6 55 1.5.2. Terminology about Characters and Character Sets . . . 6 56 1.5.3. DNS-related Terminology . . . . . . . . . . . . . . . 7 57 1.5.4. Terminology Specific to IDNA . . . . . . . . . . . . . 7 58 1.5.5. Punycode is an Algorithm, not a Name . . . . . . . . . 11 59 1.5.6. Other Terminology Issues . . . . . . . . . . . . . . . 11 60 1.5.7. Comprehensibility of IDNA Mechanisms and Processing . 12 61 2. Summary of Major Changes from IDNA2003 . . . . . . . . . . . . 13 62 3. The Revised IDNA Model . . . . . . . . . . . . . . . . . . . . 14 63 4. Processing in IDNA2008 . . . . . . . . . . . . . . . . . . . . 14 64 5. IDNA2008 Document List . . . . . . . . . . . . . . . . . . . . 15 65 6. Permitted Characters: An Inclusion List . . . . . . . . . . . 15 66 6.1. A Tiered Model of Permitted Characters and Labels . . . . 16 67 6.1.1. PROTOCOL-VALID . . . . . . . . . . . . . . . . . . . . 16 68 6.1.2. DISALLOWED . . . . . . . . . . . . . . . . . . . . . . 18 69 6.1.3. UNASSIGNED . . . . . . . . . . . . . . . . . . . . . . 19 70 6.2. Registration Policy . . . . . . . . . . . . . . . . . . . 19 71 6.3. Layered Restrictions: Tables, Context, Registration, 72 Applications . . . . . . . . . . . . . . . . . . . . . . . 19 73 7. Issues that Constrain Possible Solutions . . . . . . . . . . . 20 74 7.1. Display and Network Order . . . . . . . . . . . . . . . . 20 75 7.2. Entry and Display in Applications . . . . . . . . . . . . 21 76 7.3. Linguistic Expectations: Ligatures, Digraphs, and 77 Alternate Character Forms . . . . . . . . . . . . . . . . 22 78 7.4. Case Mapping and Related Issues . . . . . . . . . . . . . 24 79 7.5. Right to Left Text . . . . . . . . . . . . . . . . . . . . 25 80 8. IDNs and the Robustness Principle . . . . . . . . . . . . . . 26 81 9. Front-end and User Interface Processing . . . . . . . . . . . 26 82 10. Migration and Version Synchronization . . . . . . . . . . . . 28 83 10.1. Design Criteria . . . . . . . . . . . . . . . . . . . . . 28 84 10.1.1. General IDNA Validity Criteria . . . . . . . . . . . . 29 85 10.1.2. Labels in Registration . . . . . . . . . . . . . . . . 30 86 10.1.3. Labels in Resolution (Lookup) . . . . . . . . . . . . 31 87 10.2. More Flexibility in User Agents . . . . . . . . . . . . . 31 88 10.3. The Question of Prefix Changes . . . . . . . . . . . . . . 33 89 10.3.1. Conditions Requiring a Prefix Change . . . . . . . . . 33 90 10.3.2. Conditions Not Requiring a Prefix Change . . . . . . . 34 91 10.3.3. Implications of Prefix Changes . . . . . . . . . . . . 34 92 10.4. Stringprep Changes and Compatibility . . . . . . . . . . . 35 93 10.5. The Symbol Question . . . . . . . . . . . . . . . . . . . 35 94 10.6. Migration Between Unicode Versions: Unassigned Code 95 Points . . . . . . . . . . . . . . . . . . . . . . . . . . 37 96 10.7. Other Compatibility Issues . . . . . . . . . . . . . . . . 38 97 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 38 98 12. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 39 99 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 100 13.1. IDNA Character Registry . . . . . . . . . . . . . . . . . 39 101 13.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 39 102 13.3. IANA Repository of IDN Practices of TLDs . . . . . . . . . 40 103 14. Security Considerations . . . . . . . . . . . . . . . . . . . 40 104 15. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 41 105 15.1. Version -01 of draft-klensin-idnabis-issues . . . . . . . 42 106 15.2. Version -02 of draft-klensin-idnabis-issues . . . . . . . 42 107 15.3. Version -03 of draft-klensin-idnabis-issues . . . . . . . 42 108 15.4. Version -04 of draft-klensin-idnabis-issues . . . . . . . 42 109 15.5. Version -05 of draft-klensin-idnabis-issues . . . . . . . 43 110 15.6. Version -06 of draft-klensin-idnabis-issues . . . . . . . 43 111 15.7. Version -07 of draft-klensin-idnabis-issues . . . . . . . 43 112 15.8. Version -00 of draft-ietf-idnabis-rationale . . . . . . . 44 113 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 44 114 16.1. Normative References . . . . . . . . . . . . . . . . . . . 44 115 16.2. Informative References . . . . . . . . . . . . . . . . . . 46 116 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 47 117 Intellectual Property and Copyright Statements . . . . . . . . . . 49 119 1. Introduction 121 1.1. Context and Overview 123 Several years have passed since the original protocol for 124 Internationalized Domain Names (IDNs) was completed and deployed. 125 During that time, a number of issues have arisen, including a subset 126 of those described in a recent IAB report [RFC4690] and the need to 127 update the system to deal with newer versions of Unicode. Those 128 standards are known as Internationalized Domain Names in Applications 129 (IDNA), taken from the name of the highest level standard within that 130 group (see Section 1.5). Some tuning of the existing protocols and 131 the tables on which they depend is now required. Where it is 132 important to understanding of the revised protocols, this document 133 further explains the issues that have been encountered. It also 134 provides an overview of the new IDNA model and explanatory material 135 for it. Additional explanatory material for the specific components 136 of the proposals will appear with the associated documents. 138 1.2. Discussion Forum 140 [[anchor4: RFC Editor: please remove this section.]] 142 This work is being discussed in the IETF "idnabis" Working Group and 143 on the mailing list idna-update@alvestrand.no 145 1.3. Objectives 147 The intent of the IDNA revision effort, and hence of this document 148 and the associated ones, is to increase the usability and 149 effectiveness of internationalized domain names (IDNs) while 150 preserving or strengthening the integrity of references that use 151 them. The original "hostname" character definitions (see, e.g., 152 [RFC0810]) struck a balance between the creation of useful mnemonics 153 and the introduction of parsing problems or general confusion in the 154 contexts in which domain names are used. Our objective is to 155 preserve that balance while expanding the character repertoire to 156 include extended versions of Roman-derived scripts and scripts that 157 are not Roman in origin. No work of this sort will be able to 158 completely eliminate sources of visual or textual confusion: such 159 confusion is possible even under the original rules where only ASCII 160 characters were permitted. However, one can hope, through the 161 application of different techniques at different points (see 162 Section 6.3), to keep problems to an acceptable minimum. One 163 consequence of this general objective is that the desire of some user 164 or marketing community to use a particular string --whether the 165 reason is to try to write sentences of particular languages in the 166 DNS, to express a facsimile of the symbol for a brand, or for some 167 other purpose-- is not a primary goal within the context of 168 applications in the domain name space. 170 1.4. Applicability and Function of IDNA 172 The IDNA standard does not require any applications to conform to it, 173 nor does it retroactively change those applications. An application 174 can elect to use IDNA in order to support IDN while maintaining 175 interoperability with existing infrastructure. If an application 176 wants to use non-ASCII characters in domain names, IDNA is the only 177 currently-defined option. Adding IDNA support to an existing 178 application entails changes to the application only, and leaves room 179 for flexibility in front-end processing and more specifically in the 180 user interface (see Section 9). 182 A great deal of the discussion of IDN solutions has focused on 183 transition issues and how IDNs will work in a world where not all of 184 the components have been updated. Proposals that were not chosen by 185 the original IDN Working Group would depend on user applications, 186 resolvers, and DNS servers being updated in order for a user to apply 187 an internationalized domain name in any form or coding acceptable 188 under that method. While processing must be performed prior to or 189 after access to the DNS, no changes are needed to the DNS protocol or 190 any DNS servers or the resolvers on user's computers. 192 The IDNA specification solves the problem of extending the repertoire 193 of characters that can be used in domain names to include a large 194 subset of the Unicode repertoire. 196 IDNA does not extend the service offered by DNS to the applications. 197 Instead, the applications (and, by implication, the users) continue 198 to see an exact-match lookup service. Either there is a single 199 exactly-matching name or there is no match. This model has served 200 the existing applications well, but it requires, with or without 201 internationalized domain names, that users know the exact spelling of 202 the domain names that are to be typed into applications such as web 203 browsers and mail user agents. The introduction of the larger 204 repertoire of characters potentially makes the set of misspellings 205 larger, especially given that in some cases the same appearance, for 206 example on a business card, might visually match several Unicode code 207 points or several sequences of code points. 209 IDNA allows the graceful introduction of IDNs not only by avoiding 210 upgrades to existing infrastructure (such as DNS servers and mail 211 transport agents), but also by allowing some rudimentary use of IDNs 212 in applications by using the ASCII representation of the non-ASCII 213 name labels. While such names are user-unfriendly to read and type, 214 and hence not optimal for user input, they allow (for instance) 215 replying to email and clicking on URLs even though the domain name 216 displayed is incomprehensible to the user. In order to allow user- 217 friendly input and output of the IDNs and acceptance of some 218 characters as equivalent to those to be processed according to the 219 protocol, the applications need to be modified to conform to this 220 specification. 222 IDNA uses the Unicode character repertoire, which avoids the 223 significant delays that would be inherent in waiting for a different 224 and specific character sets to be defined for IDN purposes, 225 presumably by some other standards developing organization. 227 1.5. Terminology 229 1.5.1. Documents and Standards 231 This document uses the term "IDNA2003" to refer to the set of 232 standards that make up and support the version of IDNA published in 233 2003, i.e., those commonly known as the IDNA base specification 234 [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep 235 [RFC3454]. In this document, those names are used to refer, 236 conceptually, to the individual documents, with the base IDNA 237 specification called just "IDNA". 239 The term "IDNA2008" is used to refer to a new version of IDNA as 240 described in this document and in the documents described in 241 Section 5. References to "these specifications" are to the entire 242 set. 244 1.5.2. Terminology about Characters and Character Sets 246 A code point is an integer value associated with a character in a 247 coded character set. 249 Unicode [Unicode51] is a coded character set containing almost 250 100,000 characters as of the current version. A single Unicode code 251 point is denoted by "U+" followed by four to six hexadecimal digits, 252 while a range of Unicode code points is denoted by two four to six 253 digit hexadecimal numbers separated by "..", with no prefixes. 255 ASCII means US-ASCII [ASCII], a coded character set containing 128 256 characters associated with code points in the range 0000..007F. 257 Unicode may be thought of as an extension of ASCII; it includes all 258 the ASCII characters and associates them with equivalent code points. 260 "Letters" are, informally, generalizations from the ASCII and common- 261 sense understanding of that term, i.e., characters that are used to 262 write text that are not digits, symbols, or punctuation. Formally, 263 they are characters with a Unicode General Category value starting in 264 "L" (see Section 4.5 of [Unicode51]). 266 1.5.3. DNS-related Terminology 268 When discussing the DNS, this document generally assumes the 269 terminology used in the DNS specifications [RFC1034] [RFC1035]. The 270 terms "lookup" and "resolution" are used interchangeably and the 271 process or application component that performs DNS resolution is 272 called a "resolver". The process of placing an entry into the DNS is 273 referred to as "registration" paralleling common contemporary usage 274 in other contexts. Consequently, any DNS zone administration is 275 described as a "registry", regardless of that actual administrative 276 arrangements or level in the tree. A note about that relationship is 277 included in the text below where it seems particularly significant. 279 The term "LDH code points" is defined in this document to mean the 280 code points associated with ASCII letters, digits, and the hyphen- 281 minus; that is, U+002D, 0030..0039, 0041..005A, and 0061..007A. "LDH" 282 is an abbreviation for "letters, digits, hyphen". 284 The base DNS specifications [RFC1034] [RFC1035] discuss "domain 285 names" and "host names", but many people and sections of these 286 specifications use the terms interchangeably. Further, because those 287 documents were not terribly clear, many people who are sure they know 288 the exact definitions of each of these terms disagree on the 289 definitions. This document generally uses the term "domain name". 290 When it refers to, e.g., host name syntax restrictions, it explicitly 291 cites the relevant defining documents. The remaining definitions in 292 this subsection are essentially a review. 294 A label is an individual component of a domain name. Labels are 295 usually shown separated by dots; for example, the domain name 296 "www.example.com" is composed of three labels: "www", "example", and 297 "com". (The zero-length root label described in [RFC1123], which can 298 be explicit as in "www.example.com." or implicit as in 299 "www.example.com", is not considered a label in this specification.) 300 IDNA extends the set of usable characters in labels that are text. 301 For the rest of this document, the term "label" is shorthand for 302 "text label", and "every label" means "every text label". 304 1.5.4. Terminology Specific to IDNA 306 This section defines some terminology to reduce dependence on terms 307 and definitions that have been problematic in the past. 309 1.5.4.1. Terms for IDN Label Codings 311 1.5.4.1.1. IDNA-valid strings, A-label, and U-label 313 To improve clarity, this document introduces three new terms in this 314 subsection. In the next, it defines a historical one to be slightly 315 more precise for IDNA contexts. 317 o A string is "IDNA-valid" if it meets all of the requirements of 318 these specifications for an IDNA label. IDNA-valid strings may 319 appear in either of two forms, defined immediately below. It is 320 expected that specific reference will be made to the form 321 appropriate to any context in which the distinction is important. 323 o An "A-label" is the ASCII-Compatible Encoding (ACE, see 324 Section 1.5.4.3) form of an IDNA-valid string. It must be a 325 complete label: IDNA is defined for labels, not for parts of them 326 and not for complete domain names. This means, by definition, 327 that every A-label will begin with the IDNA ACE prefix, "xn--", 328 followed by a string that is a valid output of the Punycode 329 algorithm and hence a maximum of 59 ASCII characters in length. 330 The prefix and string together must conform to all requirements 331 for a label that can be stored in the DNS including conformance to 332 the LDH ("host name") rule described in RFC 1034, RFC 1123 and 333 elsewhere. 335 o A "U-label" is an IDNA-valid string of Unicode characters, 336 expressed in a standard Unicode Encoding Form, normally UTF-8 in 337 an Internet transmission context, and subject to the constraint 338 below. Conversions between valid U-labels and valid A-labels is 339 performed according to the specification in [RFC3492], adding or 340 removing the ACE prefix (see Section 1.5.4.3) as needed. 342 To be valid, U-labels and A-labels must obey an important symmetry 343 constraint. While that constraint may be tested in any of several 344 ways, an A-label must be capable of being produced by conversion from 345 a U-label and a U-label must be capable of being produced by 346 conversion from an A-label. Among other things, this implies that 347 both U-labels and A-labels must represent strings in normalized form. 348 These strings MUST contain only characters specified elsewhere in 349 this document and its companion documents, and only in the contexts 350 indicated as appropriate. 352 Any rules or conventions that apply to DNS labels in general, such as 353 rules about lengths of strings, apply to whichever of the U-label or 354 A-label would be more restrictive. For the U-label, constraints 355 imposed by existing protocols and their presentation forms make the 356 length restriction apply to the length in octets of the UTF-8 form of 357 those labels (which will always be greater than or equal to the 358 length in code points). The exception to this, of course, is that 359 the restriction to ASCII characters does not apply to the U-label. 361 A different way to look at these terms, which may be more clear to 362 some readers, is that U-labels, A-labels, and LDH-labels (see the 363 next subsection) are disjoint categories that, together, make up the 364 forms of legitimate strings for use in domain names that describe 365 hosts. Of the three, only A-labels and LDH-labels can actually 366 appear in DNS zone files or queries; U-labels can appear, along with 367 the other two, in presentation and user interface forms and in 368 selected protocols other than those of the DNS itself. Strings that 369 do not conform to the rules for one of these three categories and, in 370 particular, strings that contain "-" in the third or fourth character 371 position but are: 373 o not A-labels or 375 o cannot be processed as U-labels or A-labels as described in these 376 specifications, 378 are invalid as labels in domain names that identify Internet hosts or 379 similar resources. This restriction on strings containing "--" is 380 required for three reasons: 382 o to prevent confusion with pre-IDNA coding forms; 384 o to permit future extensions that would require changing the 385 prefix, no matter how unlikely those might be (see Section 10.3); 386 and 388 o to reduce the opportunities for attacks on the encoding system. 390 1.5.4.1.2. LDH-label and Internationalized Label 392 In the hope of further clarifying discussions about IDNs, these 393 specifications use the term "LDH-label" strictly to refer to an all- 394 ASCII label that obeys the "hostname" (LDH) conventions and that is 395 not an IDN. In other words, only "U-label" and "A-label" refer to 396 IDNs and LDH-labels are not IDNs. "Internationalized label" is used 397 when a term is needed to refer to any of the three categories. There 398 are some standardized DNS label formats, such as those for service 399 location (SRV) records [RFC2782] that do not fall into any of the 400 three categories and hence are not internationalized labels. 402 1.5.4.2. Equivalence 404 In IDNA, equivalence of labels is defined in terms of the A-labels. 405 If the A-labels are equal in a case-independent comparison, then the 406 labels are considered equivalent, no matter how they are represented. 407 Traditional LDH labels already have a notion of equivalence: within 408 that list of characters, upper case and lower case are considered 409 equivalent. The IDNA notion of equivalence is an extension of that 410 older notion. Equivalent labels in IDNA are treated as alternate 411 forms of the same label, just as "foo" and "Foo" are treated as 412 alternate forms of the same label. 414 1.5.4.3. ACE Prefix 416 The "ACE prefix" is defined in this document to be a string of ASCII 417 characters "xn--" that appears at the beginning of every A-label. 418 "ACE" stands for "ASCII-Compatible Encoding". 420 1.5.4.4. Domain Name Slot 422 A "domain name slot" is defined in this document to be a protocol 423 element or a function argument or a return value (and so on) 424 explicitly designated for carrying a domain name. Examples of domain 425 name slots include: the QNAME field of a DNS query; the name argument 426 of the gethostbyname() or getaddrinfo() standard C library functions; 427 the part of an email address following the at-sign (@) in the 428 parameter to the SMTP MAIL or RCPT commands or the "From:" field of 429 an email message header; and the host portion of the URI in the src 430 attribute of an HTML tag. General text that just happens to 431 contain a domain name is not a domain name slot. For example, a 432 domain name appearing in the plain text body of an email message is 433 not occupying a domain name slot. 435 An "IDN-aware domain name slot" is defined in this document to be a 436 domain name slot explicitly designated for carrying an 437 internationalized domain name as defined in this document. The 438 designation may be static (for example, in the specification of the 439 protocol or interface) or dynamic (for example, as a result of 440 negotiation in an interactive session). 442 An "IDN-unaware domain name slot" is defined in this document to be 443 any domain name slot that is not an IDN-aware domain name slot. 444 Obviously, this includes any domain name slot whose specification 445 predates IDNA. 447 1.5.5. Punycode is an Algorithm, not a Name 449 There has been some confusion about whether a "Punycode string" does 450 or does not include the prefix and about whether it is required that 451 such strings could have been the output of ToASCII (see RFC 3490, 452 Section 4 [RFC3490]). This specification discourages the use of the 453 term "Punycode" to describe anything but the encoding method and 454 algorithm of [RFC3492]. The terms defined above are preferred as 455 much more clear than terms such as "Punycode string". 457 1.5.6. Other Terminology Issues 459 The document departs from historical DNS terminology and usage in one 460 important respect. Over the years, the community has talked very 461 casually about "names" in the DNS, beginning with calling it "the 462 domain name system". That terminology is fine in the very precise 463 sense that the identifiers of the DNS do provide names for objects 464 and addresses. But, in the context of IDNs, the term has introduced 465 some confusion, confusion that has increased further as people have 466 begun to speak of DNS labels in terms of the words or phrases of 467 various natural languages. 469 Historically, many, perhaps most, of the "names" in the DNS have been 470 mnemonics to identify some particular concept, object, or 471 organization. They are typically derived from, or rooted in, some 472 language because most people think in language-based ways. But, 473 because they are mnemonics, they need not obey the orthographic 474 conventions of any language: it is not a requirement that it be 475 possible for them to be "words". 477 This distinction is important because the reasonable goal of an IDN 478 effort is not to be able to write the great Klingon (or language of 479 one's choice) novel in DNS labels but to be able to form a usefully 480 broad range of mnemonics in ways that are as natural as possible in a 481 very broad range of scripts. 483 "The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 484 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 485 document are to be interpreted as described in RFC 2119 [RFC2119]. 487 An "internationalized domain name" (IDN) is a domain name that may 488 contain any mixture of LDH-labels, A-labels, or U-labels. This 489 implies that every conventional domain name is an IDN (which implies 490 that it is possible for a domain name to be an IDN without it 491 containing any non-ASCII characters). Just as has been the case with 492 ASCII names, some DNS zone administrators may impose restrictions, 493 beyond those imposed by DNS or IDNA, on the characters or strings 494 that may be registered as labels in their zones. Because of the 495 diversity of characters that can be used in a U-label and the 496 confusion they might cause, such restrictions are mandatory for IDN 497 registries and zones even though the particular restrictions are not 498 part of these specifications. Because these restrictions, commonly 499 known as "registry restrictions", only affect what can be registered 500 and not resolution processing, they have no effect on the syntax or 501 semantics of DNS protocol messages; a query for a name that matches 502 no records will yield the same response regardless of the reason why 503 it is not in the zone. Clients issuing queries or interpreting 504 responses cannot be assumed to have any knowledge of zone-specific 505 restrictions or conventions. See Section 6.2. 507 1.5.7. Comprehensibility of IDNA Mechanisms and Processing 509 One of the major goals of this work is to improve the general 510 understanding of how IDNA works and what characters are permitted and 511 what happens to them. Comprehensibility and predictability to users 512 and registrants are themselves important motivations and design goals 513 for this effort. The effort includes some new terminology and a 514 revised and extended model, both covered in this section, and some 515 more specific protocol, processing, and table modifications. Details 516 of the latter appear in other documents (see Section 5). 518 Several issues are inherent in the application of IDNs and, indeed, 519 almost any other system that tries to handle international characters 520 and concepts. They range from the apparently trivial --e.g., one 521 cannot display a character for which one does not have a font 522 available locally-- to the more complex and subtle. Many people have 523 observed that internationalization is just a tool to enable effective 524 localization while permitting some global uniformity. Issues of 525 display, of exactly how various strings and characters are entered, 526 and so on are inherently issues about localization and user interface 527 design. 529 A protocol such as IDNA can only assume that such operations as data 530 entry and reconciliation of differences in character forms are 531 possible. It may make some recommendations about how display might 532 work when characters and fonts are not available, but they can only 533 be general recommendations and, because display functions are rarely 534 controlled by the types of applications that would call upon IDNA, 535 will rarely be very effective. 537 However, shifting responsibility for character mapping and other 538 adjustments from the protocol (where it was located in IDNA2003) to 539 the user interface or processing before invoking IDNA raises issues 540 about both what that processing should do and about compatibility for 541 references prepared in an IDNA2003 context. Those issues are 542 discussed in Section 9. 544 Operations for converting between local character sets and normalized 545 Unicode are part of this general set of user interface issues. The 546 conversion is obviously not required at all in a Unicode-native 547 system that maintains all strings in Normalization Form C (NFC). It 548 may, however, involve some complexity in a system that is not 549 Unicode-native, especially if the elements of the local character set 550 do not map exactly and unambiguously into Unicode characters or do so 551 in a way that is not completely stable over time. Perhaps more 552 important, if a label being converted to a local character set 553 contains Unicode characters that have no correspondence in that 554 character set, the application may have to apply special, locally- 555 appropriate, methods to avoid or reduce loss of information. 557 Depending on the system involved, the major difficulty may not lie in 558 the mapping but in accurately identifying the incoming character set 559 and then applying the correct conversion routine. If a local 560 operating system uses one of the ISO 8859 character sets or an 561 extensive national or industrial system such as GB18030 [GB18030] or 562 BIG5 [BIG5], one must correctly identify the character set in use 563 before converting to Unicode even though those character coding 564 systems are substantially or completely Unicode-compatible (i.e., all 565 of the code points in them have an exact and unique mapping to 566 Unicode code points). It may be even more difficult when the 567 character coding system in local use is based on conceptually 568 different assumptions than those used by Unicode about, e.g., about 569 font encodings used for publications in some Indic scripts. Those 570 differences may not easily yield unambiguous conversions or 571 interpretations even if each coding system is internally consistent 572 and adequate to represent the local language and script. 574 2. Summary of Major Changes from IDNA2003 576 1. Update base character set from Unicode 3.2 to Unicode version- 577 agnostic. 579 2. Separate the definitions for the "registration" and "lookup" 580 activities. 582 3. Disallow symbol and punctuation characters except where special 583 exceptions are necessary. 585 4. Remove the mapping and normalization steps from the protocol and 586 have them instead done by the applications themselves, possibly 587 in a local fashion, before invoking the protocol. 589 5. Change the way that the protocol specifies which characters are 590 allowed in labels from "humans decide what the table of 591 codepoints contains" to "decision about codepoints are based on 592 Unicode properties plus a small exclusion list created by 593 humans". 595 6. Introduce the new concept of characters that can be used only in 596 specific contexts. 598 7. Allow typical words and names in languages such as Dhivehi and 599 Yiddish to be expressed. 601 8. Make bidirectional domain names (delimited strings of labels, 602 not just labels standing on their own) display in a non- 603 surprising fashion. 605 9. Make bidirectional domain names in a paragraph display in a non- 606 surprising fashion. 608 10. Remove the dot separator from the mandatory part of the 609 protocol. 611 11. Make some currently-valid labels that are not actually IDNA 612 labels invalid. 614 3. The Revised IDNA Model 616 IDNA is a client-side protocol, i.e., almost all of the processing is 617 performed by the client. The strings that appear in, and are 618 resolved by, the DNS conform to the traditional rules for the naming 619 of hosts, and consist of ASCII letters, digits, and hyphens. This 620 approach permits IDNA to be deployed without modifications to the DNS 621 itself. That, in turn, avoids both having to upgrade the entire 622 Internet to support IDNs and needing to incur the unknown risks to 623 deployed systems of DNS structural or design changes especially if 624 those changes need to be deployed all at the same time. 626 4. Processing in IDNA2008 628 These specifications separate Domain Name Registration and Resolution 629 in the protocol specification. Doing so reflects current practice in 630 which per-registry restrictions and special processing are applied at 631 registration time but not on resolution. Even more important in the 632 longer term, it facilitates incremental addition of permitted 633 character groups to avoid freezing on one particular version of 634 Unicode. 636 The actual registration and lookup protocols for IDNA2008 are 637 specified in [IDNA2008-Protocol]. 639 5. IDNA2008 Document List 641 [[anchor19: This section will need to be extensively revised or 642 removed before publication.]] 644 The following documents are being produced as part of the IDNA2008 645 effort. 647 o A revised version of this document, containing an overview, 648 rationale, and conformance conditions. 650 o A separate document, drawn from material in early versions of this 651 one, that explicitly updates and replaces RFC 3490 but which has 652 most rationale material from that document moved to this one 653 [IDNA2008-Protocol]. 655 o A document describing the "Bidi problem" with Stringprep and 656 proposing a solution [IDNA2008-Bidi]. 658 o A specification of the categories and rules that identify the code 659 points allowed in a U-label, based on Unicode 5.0 code 660 assignments. See Section 6 and [IDNA2008-Tables]. 662 o One or more documents containing guidance and suggestions for 663 registries (in this context, those responsible for establishing 664 policies for any zone file in the DNS, not only those at the top 665 or second level). The documents in this category may not be IETF 666 products and may be prepared and completed asynchronously with 667 those described above. 669 6. Permitted Characters: An Inclusion List 671 This section provides an overview of the model used to establish the 672 algorithm and character lists of [IDNA2008-Tables] and describes the 673 names and applicability of the categories used there. Note that the 674 inclusion of a character in the first category group does not imply 675 that it can be used indiscriminately; some characters are associated 676 with contextual rules that must be applied as well. 678 The information given in this section is provided to make the rules, 679 tables, and protocol easier to understand. It is not normative. The 680 normative generating rules appear in [IDNA2008-Tables] and the rules 681 that actually determine what labels can be registered or looked up 682 are in [IDNA2008-Protocol]. 684 6.1. A Tiered Model of Permitted Characters and Labels 686 Moving to an inclusion model requires respecifying the list of 687 characters that are permitted in IDNs. In IDNA2003, the role and 688 utility of characters are independent of context and fixed forever 689 (or until the standard is replaced). Making completely context- 690 independent rules globally has proven impractical because some 691 characters, especially those that are called "Join_Controls" in 692 Unicode, are needed to make reasonable use of some scripts but become 693 invisible characters in others. Of necessity, IDNA2003 prohibited 694 those types of characters entirely. But the restrictions were much 695 too severe to permit an adequate range of mnemonics for terminology 696 based on some languages. The requirement to support those characters 697 but limit their use to very specific contexts was reinforced by the 698 observation that handling of particular characters across the 699 languages that use a script, or the use of similar or identical- 700 looking characters in different scripts, is less well understood than 701 many people believed it was several years ago. 703 Independently of the characters chosen (see next subsection), the 704 theory is to divide the characters that appear in Unicode into three 705 categories: 707 6.1.1. PROTOCOL-VALID 709 Characters identified as "PROTOCOL-VALID" (often abbreviated 710 "PVALID") are, in general, permitted by IDNA for all uses in IDNs. 711 Their use may be restricted by rules about the context in which they 712 appear or by other rules that apply to the entire label in which they 713 are to be embedded. For example, any label that contains a character 714 in this group that has a "right to left" property must be used in 715 context with the "Bidi" rules (see [IDNA2008-Bidi]). 717 The term "PROTOCOL-VALID", is used to stress the fact that the 718 presence of a character in this category does not imply that a given 719 registry need accept registrations containing any of the characters 720 in the category. Registries are still expected to apply judgment 721 about labels they will accept and to maintain rules consistent with 722 those judgments (see [IDNA2008-Protocol] and Section 6.3). 724 Characters that are placed in the "PROTOCOL-VALID" category are never 725 removed from it unless the code points themselves are removed from 726 Unicode (such removal would be inconsistent with the Unicode 727 stability principles (see [Unicode51], Appendix F) and hence should 728 never occur). 730 [[anchor21: Placeholder: Does this topic or comment need additional 731 discussion or explanation?]] 733 6.1.1.1. Contextual Rules 735 Characters in the PROTOCOL-VALID category may actually be unsuitable 736 for general use in IDNs but necessary for the plausible support of 737 some scripts. The two most commonly-cited examples are the zero- 738 width joiner and non-joiner characters (ZWNJ, U+200C, and ZWJ, 739 U+200D), but provisions for unambiguous labels may require that other 740 characters be restricted to particular contexts. For example, the 741 ASCII hyphen is not permitted to start or end a label, whether that 742 label contains non-ASCII characters or not. 744 These characters must not appear in IDNs without additional 745 restrictions, typically because they are invisible in most scripts 746 but affect format or presentation in a few others or because they are 747 combining characters that are safe for use only in conjunction with 748 particular characters or scripts. In order to permit them to be used 749 at all, they are specially identified as "CONTEXTUAL RULE REQUIRED" 750 and, when adequately understood, associated with a rule. In 751 addition, the rule will define whether it is to be applied on lookup 752 as well as registration. A distinction is made between characters 753 that indicate or prohibit joining (known as "CONTEXT-JOINER" or 754 "CONTEXTJ") and other characters requiring contextual treatment 755 ("CONTEXT-OTHER" or "CONTEXTO"). Only the former are fully tested at 756 lookup time. 758 6.1.1.2. Rules and Their Application 760 The actual rules may be present or absent. If present, they may have 761 values of "True" (character may be used in any position in any 762 label), "False" (character may not be used in any label), or may be 763 an extended regular expression that specifies the context in which 764 the character is permitted. 766 Examples of descriptions of typical rules, stated informally and in 767 English, include "Must follow a character from Script XYZ", "MUST 768 occur only if the entire label is in Script ABC", "MUST occur only if 769 the previous and subsequent characters have the DFG property". 771 Because it is easier to identify these characters than to know that 772 they are actually needed in IDNs or how to establish exactly the 773 right rules for each one, a rule may have a null value in a given 774 version of the tables. Characters associated with null rules MUST 775 NOT appear in putative labels for either registration or lookup. Of 776 course, a later version of the tables might contain a non-null rule. 778 [[anchor23: Definition of regular expression language to be 779 supplied]] 781 6.1.2. DISALLOWED 783 Some characters are sufficiently problematic for use in IDNs that 784 they should be excluded for both registration and lookup (i.e., 785 conforming applications performing name resolution should verify that 786 these characters are absent; if they are present, the label strings 787 should be rejected rather than converted to A-labels and looked up. 789 Of course, this category would include code points that had been 790 removed entirely from Unicode should such removals ever occur. 792 Characters that are placed in the "DISALLOWED" category are expected 793 to never be removed from it or reclassified. If a character is 794 classified as "DISALLOWED" in error and the error is sufficiently 795 problematic, the only recourse would be either to introduce a new 796 code point into Unicode and classify it as "PROTOCOL-VALID" or for 797 the IETF to accept the considerable costs of an incompatible change 798 and replace the relevant RFC with one containing appropriate 799 exceptions. 801 [[anchor24: Note in Draft: the permanence of DISALLOWED was still 802 under discussion in the WG when this draft was posted. The text 803 above reflects the editor's opinion about the emerging consensus but 804 is subject to change as the discussion continues.]] 806 There is provision for exception cases but, in general, characters 807 are placed into "DISALLOWED" if they fall into one or more of the 808 following groups: 810 o The character is a compatibility equivalent for another character. 811 In slightly more precise Unicode terms, application of 812 normalization method NFKC to the character yields some other 813 character. 815 o The character is an upper-case form or some other form that is 816 mapped to another character by Unicode casefolding. 818 o The character is a symbol or punctuation form or, more generally, 819 something that is not a letter, digit, or a mark that is used to 820 form a letter or digit. 822 6.1.3. UNASSIGNED 824 For convenience in processing and table-building, code points that do 825 not have assigned values in a given version of Unicode are treated as 826 belonging to a special UNASSIGNED category. Such code points MUST 827 NOT appear in labels to be registered or looked up. The category 828 differs from DISALLOWED in that code points are moved out of it by 829 the simple expedient of being assigned in a later version of Unicode 830 (at which point, they are classified into one of the other categories 831 as appropriate). 833 6.2. Registration Policy 835 While these recommendations cannot and should not define registry 836 policies, registries SHOULD develop and apply additional restrictions 837 to reduce confusion and other problems. For example, it is generally 838 believed that labels containing characters from more than one script 839 are a bad practice although there may be some important exceptions to 840 that principle. Some registries may choose to restrict registrations 841 to characters drawn from a very small number of scripts. For many 842 scripts, the use of variant techniques such as those as described in 843 [RFC3743] and [RFC4290], and illustrated for Chinese by the tables 844 described in RFC 4713 [RFC4713] may be helpful in reducing problems 845 that might be perceived by users. It is worth stressing that these 846 principles of policy development and application apply at all levels 847 of the DNS, not only, e.g., TLD registrations. 849 6.3. Layered Restrictions: Tables, Context, Registration, Applications 851 The essence of the character rules in IDNA2008 is based on the 852 realization that there is no magic bullet for any of the issues 853 associated with a multiscript DNS. Instead, the specifications 854 define a variety of approaches that, together, constitute multiple 855 lines of defense against ambiguity in identifiers and loss of 856 referential integrity. The actual character tables are the first 857 mechanism, protocol rules about how those characters are applied or 858 restricted in context are the second, and those two in combination 859 constitute the limits of what can be done by a protocol alone. As 860 discussed in the previous section (Section 6.2), registries are 861 expected to restrict what they permit to be registered, devising and 862 using rules that are designed to optimize the balance between 863 confusion and risk on the one hand and maximum expressiveness in 864 mnemonics on the other. 866 In addition, there is an important role for user agents in warning 867 against label forms that appear unreasonable given their knowledge of 868 local contexts and conventions. Of course, no approach based on 869 naming or identifiers alone can protect against all threats. 871 [[anchor25: Note in Draft: the last sentence above basically 872 duplicates a comment in Security Considerations. Is it worth having 873 in both places??]] 875 7. Issues that Constrain Possible Solutions 877 7.1. Display and Network Order 879 The correct treatment of domain names requires a clear distinction 880 between Network Order (the order in which the code points are sent in 881 protocols) and Display Order (the order in which the code points are 882 displayed on a screen or paper). The order of labels in a domain 883 name that contains characters that are normally written right to left 884 is discussed in [IDNA2008-Bidi]. In particular, there are questions 885 about the order in which labels are displayed if left to right and 886 right to left labels are adjacent to each other, especially if there 887 are also multiple consecutive appearances of one of the types. The 888 decision about the display order is ultimately under the control of 889 user agents --including web browsers, mail clients, and the like-- 890 which may be highly localized. Even when formats are specified by 891 protocols, the full composition of an Internationalized Resource 892 Identifier (IRI) [RFC3987] or Internationalized Email address 893 contains elements other than the domain name. For example, IRIs 894 contain protocol identifiers and field delimiter syntax such as 895 "http://" or "mailto:" while email addresses contain the "@" to 896 separate local parts from domain names. User agents are not required 897 to use those protocol-based forms directly but often do so. While 898 display, parsing, and processing within a label is specified by the 899 IDNA protocol and the associated documents, the relationship between 900 fully-qualified domain names and internationalized labels is 901 unchanged from the base DNS specifications. Comments here about such 902 full domain names are explanatory or examples of what might be done 903 and must not be considered normative. 905 Questions remain about protocol constraints implying that the overall 906 direction of these strings will always be left to right (or right to 907 left) for an IRI or email address, or if they even should conform to 908 such rules. These questions also have several possible answers. 909 Should a domain name abc.def, in which both labels are represented in 910 scripts that are written right to left, be displayed as fed.cba or 911 cba.fed? An IRI for clear text web access would, in network order, 912 begin with "http://" and the characters will appear as 913 "http://abc.def" -- but what does this suggest about the display 914 order? When entering a URI to many browsers, it may be possible to 915 provide only the domain name and leave the "http://" to be filled in 916 by default, assuming no tail (an approach that does not work for 917 other protocols). The natural display order for the typed domain 918 name on a right to left system is fed.cba. Does this change if a 919 protocol identifier, tail, and the corresponding delimiters are 920 specified? 922 While logic, precedent, and reality suggest that these are questions 923 for user interface design, not IETF protocol specifications, 924 experience in the 1980s and 1990s with mixing systems in which domain 925 name labels were read in network order (left to right) and those in 926 which those labels were read right to left would predict a great deal 927 of confusion, and heuristics that sometimes fail, if each 928 implementation of each application makes its own decisions on these 929 issues. 931 It should be obvious that any revision of IDNA, including the current 932 one, must be clear about the network (transmission on the wire) order 933 of characters in labels and for the labels in complete (fully- 934 qualified) domain names. In order to prevent user confusion and, in 935 particular, to reduce the chances for inconsistent transcription of 936 domain names from printed form, it is likely that some strong 937 suggestions should be made about display order as well. 939 7.2. Entry and Display in Applications 941 Applications can accept domain names using any character set or sets 942 desired by the application developer or specified by the operating 943 system, and can display domain names in any charset. That is, the 944 IDNA protocol does not affect the interface between users and 945 applications. 947 An IDNA-aware application can accept and display internationalized 948 domain names in two formats: the internationalized character set(s) 949 supported by the application (i.e., an appropriate local 950 representation of a U-label), and as an A-label. Applications MAY 951 allow the display and user input of A-labels, but are encouraged to 952 not do so except as an interface for special purposes, possibly for 953 debugging, or to cope with display limitations. A-labels are opaque 954 and ugly, and, where possible, should thus only be exposed to users 955 and in contexts in which they are absolutely needed. Because IDN 956 labels can be rendered either as the A-labels or U-labels, the 957 application may reasonably have an option for the user to select the 958 preferred method of display; if it does, rendering the U-label should 959 normally be the default. 961 Domain names are often stored and transported in many places. For 962 example, they are part of documents such as mail messages and web 963 pages. They are transported in many parts of many protocols, such as 964 both the control commands and the RFC 2822 body parts of SMTP, and 965 the headers and the body content in HTTP. It is important to 966 remember that domain names appear both in domain name slots and in 967 the content that is passed over protocols. 969 In protocols and document formats that define how to handle 970 specification or negotiation of charsets, labels can be encoded in 971 any charset allowed by the protocol or document format. If a 972 protocol or document format only allows one charset, the labels MUST 973 be given in that charset. Of course, not all charsets can properly 974 represent all labels. If a U-label cannot be displayed in its 975 entirety, the only choice (without loss of information) may be to 976 display the A-label. 978 In any place where a protocol or document format allows transmission 979 of the characters in internationalized labels, labels SHOULD be 980 transmitted using whatever character encoding and escape mechanism 981 the protocol or document format uses at that place. This provision 982 is intended to prevent situations in which, e.g., UTF-8 domain names 983 appear embedded in text that is otherwise in some other character 984 coding. 986 All protocols that use domain name slots already have the capacity 987 for handling domain names in the ASCII charset. Thus, A-labels can 988 inherently be handled by those protocols. 990 7.3. Linguistic Expectations: Ligatures, Digraphs, and Alternate 991 Character Forms 993 Users often have expectations about character matching or equivalence 994 that are based on their languages and the orthography of those 995 languages. These expectations may not be consistent with forms or 996 actions that can be naturally accommodated in a character coding 997 system, especially if multiple languages are written using the same 998 script but using different conventions. A Norwegian user might 999 expect a label with the ae-ligature to be treated as the same label 1000 as one using the Swedish spelling with a-umlaut even though applying 1001 that mapping to English would be astonishing to users. A user in 1002 German might expect a label with an o-umlaut and a label that had 1003 "oe" substituted, but was otherwise the same, treated as equivalent 1004 even though that substitution would be a clear error in Swedish. A 1005 Chinese user might expect automatic matching of Simplified and 1006 Traditional Chinese characters, but applying that matching for Korean 1007 or Japanese text would create considerable confusion. For that 1008 matter, an English user might expect "theater" and "theatre" to 1009 match. 1011 Related issues arise because there are a number of languages written 1012 with alphabetic scripts in which single phonemes are written using 1013 two characters, termed a "digraph", for example, the "ph" in 1014 "pharmacy" and "telephone". (Note that characters paired in this 1015 manner can also appear consecutively without forming a digraph, as in 1016 "tophat".) Certain digraphs are normally indicated typographically 1017 by setting the two characters closer together than they would be if 1018 used consecutively to represent different phonemes. Some digraphs 1019 are fully joined as ligatures (strictly designating setting totally 1020 without intervening white space, although the term is sometimes 1021 applied to close set pairs). An example of this may be seen when the 1022 word "encyclopaedia" is set with a U+00E6 LATIN SMALL LIGATURE AE 1023 (and some would not consider that word correctly spelled unless the 1024 ligature form was used or the "a" was dropped entirely). When these 1025 ligature and digraph forms have the same interpretation across all 1026 languages that use a given script, application of Unicode 1027 normalization generally resolves the differences and causes them to 1028 match. When they have different interpretations, any requirements 1029 for matching must utilize other methods or users must be educated to 1030 understand that matching will not occur. 1032 Difficulties arise from the fact that a given ligature may be a 1033 completely optional typographic convenience for representing a 1034 digraph in one language (as in the above example with some spelling 1035 conventions), while in another language it is a single character that 1036 may not always be correctly representable by a two-letter sequence 1037 (as in the above example with different spelling conventions). This 1038 can be illustrated by many words in the Norwegian language, where the 1039 "ae" ligature is the 27th letter of a 29-letter extended Latin 1040 alphabet. It is equivalent to the 28th letter of the Swedish 1041 alphabet (also containing 29 letters), U+00E4 LATIN SMALL LETTER A 1042 WITH DIAERESIS, for which an "ae" cannot be substituted according to 1043 current orthographic standards. 1045 That character (U+00E4) is also part of the German alphabet where, 1046 unlike in the Nordic languages, the two-character sequence "ae" is 1047 usually treated as a fully acceptable alternate orthography. The 1048 inverse is however not true, and those two characters cannot 1049 necessarily be combined into an "umlauted a". This also applies to 1050 another German character, the "umlauted o" (U+00F6 LATIN SMALL LETTER 1051 O WITH DIAERESIS) which, for example, cannot be used for writing the 1052 name of the author "Goethe". It is also a letter in the Swedish 1053 alphabet where, in parallel to the "umlauted a", it cannot be 1054 correctly represented as "oe" and in the Norwegian alphabet, where it 1055 is represented, not as "umlauted o", but as "slashed o", U+00F8. 1057 Some of the ligatures that have explicit code points in Unicode were 1058 given special handling in IDNA2003 and now pose additional problems 1059 as people argue that they should have been treated differently to 1060 preserve important information. For example, the German character 1061 Eszett (Sharp S, U+00DF) is retained as itself by NFKC but case- 1062 folded by Stringprep to "ss", but the closely-related, but less 1063 frequently seen, character "Long S T" (U+FB05) is a compatibility 1064 character that is mapped out by NFKC. Unless exceptions are made, 1065 both will be treated as DISALLOWED by IDNA2008. But there is 1066 significant interest in an exception, especially for Eszett. 1067 Depending on what the exception was, making it would either raise 1068 some backward compatibility problems with IDNA2003 or create an 1069 unusual special case that would highlight differences in preferred 1070 orthography between German as written in Germany and German as 1071 written in some other countries, notably Switzerland. Additional 1072 discussion of issues with Eszett appear in Section 10.7. 1074 Additional cases with alphabets written right to left are described 1075 in Section 7.5. 1077 Whether ligatures and digraphs are to be treated as a sequence of 1078 characters or as a single standalone one constitute a problem that 1079 cannot be resolved solely by operating on scripts. They are, 1080 however, a key concern in the IDN context. Their satisfactory 1081 resolution will require support in policies set by registries, which 1082 therefore need to be particularly mindful not just of this specific 1083 issue, but of all other related matters that cannot be dealt with on 1084 an exclusively algorithmic basis. 1086 Just as with the examples of different-looking characters that may be 1087 assumed to be the same, it is in general impossible to deal with 1088 these situations in a system such as IDNA -- or with Unicode 1089 normalization generally -- since determining what to do requires 1090 information about the language being used, context, or both. 1091 Consequently, these specifications make no attempt to treat these 1092 combined characters in any special way. However, their existence 1093 provides a prime example of a situation in which a registry that is 1094 aware of the language context in which labels are to be registered, 1095 and where that language sometimes (or always) treats the two- 1096 character sequences as equivalent to the combined form, should give 1097 serious consideration to applying a "variant" model [RFC3743] 1098 [RFC4290] to reduce the opportunities for user confusion and fraud 1099 that would result from the related strings being registered to 1100 different parties. 1102 7.4. Case Mapping and Related Issues 1104 Traditionally in the DNS, ASCII letters have been stored with their 1105 case preserved. Matching during the query process has been case- 1106 independent, but none of the information that might be represented by 1107 choices of case has been lost. That model has been accidentally 1108 helpful because, as people have created DNS labels by catenating 1109 words (or parts of words) to form labels, case has often been used to 1110 distinguish among components and make the labels more memorable. 1112 The solution of keeping the characters separate but doing matching 1113 independent of case is not feasible with an IDNA-like model because 1114 the matching would then have to be done on the server rather than 1115 have characters mapped on the client. That situation was recognized 1116 in IDNA2003 and nothing in IDNA2008 fundamentally changes it or could 1117 do so. In IDNA2003, all upper-case characters are mapped to lower- 1118 case ones and, in general, all code points that represent alternate 1119 forms of the same character are mapped to that character (including 1120 mapping Greek final form sigma to the medial form). IDNA2008 1121 permits, at the risk of some incompatibility, slightly more 1122 flexibility in this area. That additional flexibility still does not 1123 solve the problem with final form sigma and other characters that 1124 Unicode treats as completely separate characters that match only 1125 under casemapping if at all. Many people now believe these should be 1126 handled as separate characters so information about them can be 1127 preserved in the transformations to A-labels and back. However 1128 making a change to permit that behavior would create a situation in 1129 which the same string, valid in both protocols, would be interpreted 1130 differently by IDNA2003 and IDNA2008. In principle, that would 1131 violate one of the conditions discussed in Section 10.3.1 and hence 1132 require a prefix change. Of course, if a prefix change were made (at 1133 the costs discussed in Section 10.3.3) there would be several 1134 options, including, if desired, assigning the characer to the 1135 CONTEXTUAL RULE REQUIRED category and requiring that it only be used 1136 in carefully-selected contexts. 1138 7.5. Right to Left Text 1140 In order to be sure that the directionality of right to left text is 1141 unambiguous, IDNA2003 required that any label in which right to left 1142 characters appear both starts and ends with them, may not include any 1143 characters with strong left to right properties (which excludes other 1144 alphabetic characters but permits European digits), and rejects any 1145 other string that contains a right to left character. This is one of 1146 the few places where the IDNA algorithms (both old and new) are 1147 required to look at an entire label, not just at individual 1148 characters. The algorithmic model used in IDNA2003 rejects the label 1149 when the final character in a right to left string requires a 1150 combining mark in order to be correctly represented. 1152 This problem manifests itself in languages written with consonantal 1153 alphabets to which diacritical vocalic systems are applied, and in 1154 languages with orthographies derived from them where the combining 1155 marks may have different functionality. In both cases the combining 1156 marks can be essential components of the orthography. Examples of 1157 this are Yiddish, written with an extended Hebrew script, and Dhivehi 1158 (the official language of Maldives) which is written in the Thaana 1159 script (which is, in turn, derived from the Arabic script). The new 1160 rules for right to left scripts are described in [IDNA2008-Bidi]. 1162 8. IDNs and the Robustness Principle 1164 The model of IDNs described in this document can be seen as a 1165 particular instance of the "Robustness Principle" that has been so 1166 important to other aspects of Internet protocol design. This 1167 principle is often stated as "Be conservative about what you send and 1168 liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2 1169 [RFC1123]). For IDNs to work well, not only must the protocol be 1170 carefully designed and implemented, but zone administrators 1171 (registries) must have and require sensible policies about what is 1172 registered -- conservative policies -- and implement and enforce 1173 them. 1175 Conversely, resolvers can (and SHOULD or maybe MUST) reject labels 1176 that clearly violate global (protocol) rules (no one has ever 1177 seriously claimed that being liberal in what is accepted requires 1178 being stupid). However, once one gets past such global rules and 1179 deals with anything sensitive to script or locale, it is necessary to 1180 assume that garbage has not been placed into the DNS, i.e., one must 1181 be liberal about what one is willing to look up in the DNS rather 1182 than guessing about whether it should have been permitted to be 1183 registered. 1185 As mentioned elsewhere, if a string doesn't resolve, it makes no 1186 difference whether it simply wasn't registered or was prohibited by 1187 some rule. 1189 If resolvers, as a user interface (UI) or other local matter, decide 1190 to warn about some strings that are valid under the global rules but 1191 that they perceive as dangerous, that is their prerogative and we can 1192 only hope that the market (and maybe regulators) will reinforce the 1193 good choices and discourage the poor ones. In this context, a 1194 resolver that decides a string that is valid under the protocol is 1195 dangerous and refuses to look it up is in violation of the protocols; 1196 one that is willing to look something up, but warns against it, is 1197 exercising a local choice. 1199 9. Front-end and User Interface Processing 1201 Domain names may be identified and processed in many contexts. They 1202 may be typed in by users either by themselves or as part of URIs or 1203 IRIs. They may occur in running text or be processed by one system 1204 after being provided in another. They may wish to try to normalize 1205 URLs so as to determine (or guess) whether a reference is valid or 1206 two references point to the same object without actually looking the 1207 objects up and comparing them. Some of these goals may be more 1208 easily and reliably satisfied than others. While there are strong 1209 arguments for any domain name that is placed "on the wire" -- 1210 transmitted between systems -- to be in the minimum-ambiguity forms 1211 of A-labels, U-labels, or LDH-labels, it is inevitable that programs 1212 that process domain names will encounter variant forms. One source 1213 of such forms will be labels created under IDNA2003. Because of the 1214 way that protocol was specified, there are a significant number of 1215 domain names in files on the Internet that use characters that cannot 1216 be represented directly in domain names but for which interpretations 1217 are provided. There are two major categories of such characters, 1218 those that are removed by NFKC normalization and those upper-case 1219 characters that are mapped to lower-case (there are also a few 1220 characters that are given special-case mapping treatment in 1221 Stringprep). 1223 Other issues in domain name identification and processing arise 1224 because IDNA2003 specified that several other characters be treated 1225 as equivalent to the ASCII period (dot, full stop) character used as 1226 a label separator. If a domain name appears in an arbitrary context 1227 (such as running text), one may be faced with the requirement to know 1228 that a string is a domain name in order to adjust for the different 1229 forms of dots but also to have traditional dots to recognize that a 1230 string is a domain name -- an obvious contradiction. 1232 As discussed elsewhere in this document, the IDNA2008 model removes 1233 all of these mappings and interpretations, including the equivalence 1234 of different forms of dots, from the protocol, leaving such mappings 1235 to local processing. This should not be taken to imply that local 1236 processing is optional or can be avoided entirely. Instead, unless 1237 the program context is such that it is known that any IDNs that 1238 appear will be either U-labels or A-labels, some local processing of 1239 apparent domain name strings will be required, both to maintain 1240 compatibility with IDNA2003 and to prevent user astonishment. Such 1241 local processing, while not specified in this document or the 1242 associated ones, will generally take one of two forms: 1244 o Generic Preprocessing. 1245 When the context in which the program or system that processes 1246 domain names operates is global, a reasonable balance must be 1247 found that is sensitive to the broad range of local needs and 1248 assumptions while, at the same time, not sacrificing the needs of 1249 one language, script, or user population to those of another. 1251 For this case, the best practice will usually be to apply NFKC and 1252 case-mapping (or, perhaps better yet, Stringprep itself), plus 1253 dot-mapping where appropriate, to the domain name string prior to 1254 applying IDNA. That practice will not only yield a reasonable 1255 compromise of user experience with protocol requirements but will 1256 be almost completely compatible with the various forms permitted 1257 by IDNA2003. 1259 o Highly Localized Preprocessing. 1260 Unlike the case above, there will be some situations in which 1261 software will be highly localized for a particular environment and 1262 carefully adapted to the expectations of users in that 1263 environment. The many discussions about using the Internet to 1264 preserve and support local cultures suggest that these cases may 1265 be more common in the future than they have been so far. 1267 In these cases, we should avoid trying to tell implementers what 1268 they should do, if only because they are quite likely (and for 1269 good reason) to ignore us. We would assume that they would map 1270 characters that the intuitions of their users would suggest be 1271 mapped. One can imagine switches about whether some sorts of 1272 mappings occur, warnings before applying them or, in a slightly 1273 more extreme version of the approach taken in Internet Explorer 1274 version 7 (IE7), utterly refuse to handle "strange" characters at 1275 all if they appear in U-label form. None of those local decisions 1276 are a threat to interoperability as long as (i) only U-labels and 1277 A-labels are used in interchange with systems outside the local 1278 environment, (ii) no character that would be valid in a U-label as 1279 itself is mapped to something else, (iii) any local mappings are 1280 applied as a preprocessing step (or, for conversions from U-labels 1281 or A-labels to presentation forms, postprocessing), not as part of 1282 IDNA processing proper, and (iv) appropriate consideration is 1283 given to labels that might have entered the environment in 1284 conformance to IDNA2003. 1286 10. Migration and Version Synchronization 1288 10.1. Design Criteria 1290 As mentioned above and in RFC 4690, two key goals of this work are to 1291 enable applications to be agnostic about whether they are being run 1292 in environments supporting any Unicode version from 3.2 onward and to 1293 permit incrementally adding permitted scripts and other character 1294 collections without disruption or, subsequent to this version, 1295 "heavy" processes such as formation of an IETF WG. The mechanisms 1296 that support this are outlined above, but this section reviews them 1297 in a context that may be more helpful to those who need to understand 1298 the approach and make plans for it. 1300 10.1.1. General IDNA Validity Criteria 1302 The general criteria for a putative label, and the collection of 1303 characters that make it up, to be considered IDNA-valid are: 1305 o The characters are "letters", marks needed to form letters, 1306 numerals, or other code points used to write words in some 1307 language. Symbols, drawing characters, and various notational 1308 characters are permanently excluded -- some because they are 1309 actively dangerous in URI, IRI, or similar contexts and others 1310 because there is no evidence that they are important enough to 1311 Internet operations or internationalization to justify inclusion 1312 and the complexities that would come with it (additional 1313 discussion and rationale for the symbol decision appears in 1314 Section 10.5). 1316 o Other than in very exceptional cases, e.g., where they are needed 1317 to write substantially any word of a given language, punctuation 1318 characters are excluded as well. The fact that a word exists is 1319 not proof that it should be usable in a DNS label and DNS labels 1320 are not expected to be usable for multiple-word phrases (although 1321 they are certainly not prohibited if the conventions and 1322 orthography of a particular language cause that to be possible). 1324 o Characters that are unassigned (have no character assignment at 1325 all) in the version of Unicode being used by the registry or 1326 application are not permitted, even on resolution (lookup). There 1327 are at least two reasons for this. Tests involving the context of 1328 characters (e.g., some characters being permitted only adjacent to 1329 ones of specific types but otherwise invisible or very problematic 1330 for other reasons) and integrity tests on complete labels are 1331 needed. Unassigned code points cannot be permitted because one 1332 cannot determine whether particular code points will require 1333 contextual rules (and what those rules should be)7 before 1334 characters are assigned to them and the properties of those 1335 characters fully understood. Second, Unicode specifies that an 1336 unassigned code point normalizes and case folds to itself. If the 1337 code point is later assigned to a character, and particularly if 1338 the newly-assigned code point has a combining class that 1339 determines its placement relative to other combining characters, 1340 it could normalize to some other code point or sequence, creating 1341 confusion and/or violating other rules listed here. 1343 o Any character that is mapped to another character by Nameprep2003 1344 or by a current version of NFKC is prohibited as input to IDNA 1345 (for either registration or resolution). Implementers of user 1346 interfaces to applications are free to make those conversions when 1347 they consider them suitable for their operating system 1348 environments, context, or users. 1350 Tables used to identify the characters that are IDNA-valid are 1351 expected to be driven by the principles above (described in more 1352 precise form in [IDNA2008-Tables]). The principles are not just an 1353 interpretation of the tables. 1355 10.1.2. Labels in Registration 1357 Anyone entering a label into a DNS zone must properly validate that 1358 label -- i.e., be sure that the criteria for an A-label are met -- in 1359 order for Unicode version-independence to be possible. In 1360 particular: 1362 o Any label that contains hyphens as its third and fourth characters 1363 MUST be IDNA-valid. This implies that, (i) if the third and 1364 fourth characters are hyphens, the first and second ones MUST be 1365 "xn" until and unless this specification is updated to permit 1366 other prefixes and (ii) labels starting in "xn--" MUST be valid 1367 A-labels, as discussed in Section 3 above. 1369 o The Unicode tables (i.e., tables of code points, character 1370 classes, and properties) and IDNA tables (i.e., tables of 1371 contextual rules such as those described above), MUST be 1372 consistent on the systems performing or validating labels to be 1373 registered. Note that this does not require that tables reflect 1374 the latest version of Unicode, only that all tables used on a 1375 given system are consistent with each other. 1377 Under this model, a registry (or entity communicating with a registry 1378 to accomplish name registrations) will need to update its tables -- 1379 both the Unicode-associated tables and the tables of permitted IDN 1380 characters -- to enable a new script or other set of new characters. 1381 It will not be affected by newer versions of Unicode, or newly- 1382 authorized characters, until and unless it wishes to make those 1383 registrations. The registration side is also responsible --under the 1384 protocol and to registrants and users-- for much more careful 1385 checking than is expected of applications systems that look names up, 1386 both checking as required by the protocol and checking required by 1387 whatever policies it develops for minimizing risks due to confusable 1388 characters and sequences and preserving language or script integrity. 1390 Systems looking up or resolving DNS labels MUST be able to assume 1391 that applicable registration rules were followed for names entered 1392 into the DNS. 1394 10.1.3. Labels in Resolution (Lookup) 1396 Anyone looking up a label in a DNS zone 1398 o MUST maintain a consistent set of tables, as discussed above. As 1399 with registration, the tables need not reflect the latest version 1400 of Unicode but they MUST be consistent. 1402 o MUST validate the characters in labels to be looked up only to the 1403 extent of determining that the U-label does not contain either 1404 code points prohibited by IDNA (categorized as "DISALLOWED") or 1405 code points that are unassigned in its version of Unicode. 1407 o MUST validate the label itself for conformance with a small number 1408 of whole-label rules, notably verifying that there are no leading 1409 combining marks, that the "bidi" conditions are met if right to 1410 left characters appear, that any required contextual rules are 1411 available and that, if such rules are associated with Joiner 1412 Controls, they are tested. 1414 o MUST NOT validate other contextual rules about characters, 1415 including mixed-script label prohibitions, although such rules MAY 1416 be used to influence presentation decisions in the user interface. 1418 By avoiding applying its own interpretation of which labels are valid 1419 as a means of rejecting lookup attempts, the resolver application 1420 becomes less sensitive to version incompatibilities with the 1421 particular zone registry associated with the domain name. 1423 An application or client that looks names up in the DNS will be able 1424 to resolve any name that is validly registered, as long as its 1425 version of the Unicode-associated tables is sufficiently up-to-date 1426 to interpret all of the characters in the label. It SHOULD 1427 distinguish, in its messages to users, between "label contains an 1428 unallocated code point" and other types of lookup failures. A 1429 failure on the basis of an old version of Unicode may lead the user 1430 to a desire to upgrade to a newer version, but will have no other ill 1431 effects (this is consistent with behavior in the transition to the 1432 DNS when some hosts could not yet handle some forms of names or 1433 record types). 1435 10.2. More Flexibility in User Agents 1437 These specifications do not perform mappings between one character or 1438 code point and others for any reason. Instead, they prohibits the 1439 characters that would be mapped to others by normalization, case 1440 folding, or other rules. As examples, while mathematical characters 1441 based on Latin ones are accepted as input to IDNA2003, they are 1442 prohibited in IDNA2008. Similarly, double-width characters and other 1443 variations are prohibited as IDNA input. 1445 Since the rules in [IDNA2008-Tables] provide that only strings that 1446 are stable under NFKC are valid, if it is convenient for an 1447 application to perform NFKC normalization before lookup, that 1448 operation is safe since this will never make the application unable 1449 to look up any valid string. 1451 In many cases these prohibitions should have no effect on what the 1452 user can type at resolution time. It is perfectly reasonable for 1453 systems that support user interfaces to perform some character 1454 mapping that is appropriate to the local environment. This would 1455 normally be done prior to actual invocation of IDNA. At least 1456 conceptually, the mapping would be part of the Unicode conversions 1457 discussed above and in [IDNA2008-Protocol]. However, those changes 1458 will be local ones only -- local to environments in which users will 1459 clearly understand that the character forms are equivalent. For use 1460 in interchange among systems, it appears to be much more important 1461 that U-labels and A-labels can be mapped back and forth without loss 1462 of information. 1464 One specific, and very important, instance of this strategy arises 1465 with case-folding. In the ASCII-only DNS, names are looked up and 1466 matched in a case-independent way, but no actual case-folding occurs. 1467 Names can be placed in the DNS in either upper or lower case form (or 1468 any mixture of them) and that form is preserved, returned in queries, 1469 and so on. IDNA2003 simulated that behavior by performing case- 1470 mapping at registration time (resulting in only lower-case IDNs in 1471 the DNS) and when names were looked up. 1473 As suggested earlier in this section, it appears to be desirable to 1474 do as little character mapping as possible consistent with having 1475 Unicode work correctly (e.g., NFC mapping to resolve different 1476 codings for the same character is still necessary although the 1477 specifications require that it be performed prior to invoking the 1478 protocol) and to make the mapping between A-labels and U-labels 1479 idempotent. Case-mapping is not an exception to this principle. If 1480 only lower case characters can be registered in the DNS (i.e., be 1481 present in a U-label), then IDNA2008 should prohibit upper-case 1482 characters as input. Some other considerations reinforce this 1483 conclusion. For example, an essential element of the ASCII case- 1484 mapping functions is that uppercase(character) must be equal to 1485 uppercase(lowercase(character)). That requirement may not be 1486 satisfied with IDNs. The relationship between upper case and lower 1487 case may even be language-dependent, with different languages (or 1488 even the same language in different areas) expecting different 1489 mappings. Of course, the expectations of users who are accustomed to 1490 a case-insensitive DNS environment will probably be well-served if 1491 user agents perform case mapping prior to IDNA processing, but the 1492 IDNA procedures themselves should neither require such mapping nor 1493 expect them when they are not natural to the localized environment. 1495 10.3. The Question of Prefix Changes 1497 The conditions that would require a change in the IDNA "prefix" 1498 ("xn--" for the version of IDNA specified in [RFC3490]) have been a 1499 great concern to the community. A prefix change would clearly be 1500 necessary if the algorithms were modified in a manner that would 1501 create serious ambiguities during subsequent transition in 1502 registrations. This section summarizes our conclusions about the 1503 conditions under which changes in prefix would be necessary and the 1504 implications of such a change. 1506 10.3.1. Conditions Requiring a Prefix Change 1508 An IDN prefix change is needed if a given string would resolve or 1509 otherwise be interpreted differently depending on the version of the 1510 protocol or tables being used. Consequently, work to update IDNs 1511 would require a prefix change if, and only if, one of the following 1512 four conditions were met: 1514 1. The conversion of an A-label to Unicode (i.e., a U-label) yields 1515 one string under IDNA2003 (RFC3490) and a different string under 1516 IDNA2008. 1518 2. An input string that is valid under IDNA2003 and also valid under 1519 IDNA2008 yields two different A-labels with the different 1520 versions of IDNA. This condition is believed to be essentially 1521 equivalent to the one above. 1523 Note, however, that if the input string is valid under one 1524 version and not valid under the other, this condition does not 1525 apply. See the first item in Section 10.3.2, below. 1527 3. A fundamental change is made to the semantics of the string that 1528 is inserted in the DNS, e.g., if a decision were made to try to 1529 include language or specific script information in that string, 1530 rather than having it be just a string of characters. 1532 4. A sufficiently large number of characters is added to Unicode so 1533 that the Punycode mechanism for block offsets no longer has 1534 enough capacity to reference the higher-numbered planes and 1535 blocks. This condition is unlikely even in the long term and 1536 certain not to arise in the next few years. 1538 10.3.2. Conditions Not Requiring a Prefix Change 1540 In particular, as a result of the principles described above, none of 1541 the following changes require a new prefix: 1543 1. Prohibition of some characters as input to IDNA. This may make 1544 names that are now registered inaccessible, but does not require 1545 a prefix change. 1547 2. Adjustments in Stringprep tables or IDNA actions, including 1548 normalization definitions, that affect characters that were 1549 already invalid under IDNA2003. 1551 3. Changes in the style of definitions of Stringprep or Nameprep 1552 that do not alter the actions performed by them. 1554 Of course, because these specifications do not involve changes to 1555 Stringprep or Nameprep, the third condition above and part of the 1556 second are moot. 1558 10.3.3. Implications of Prefix Changes 1560 While it might be possible to make a prefix change, the costs of such 1561 a change are considerable. Even if they wanted to do so, all 1562 registries could not convert all IDNA2003 ("xn--") registrations to a 1563 new form at the same time and synchronize that change with 1564 applications supporting lookup. Unless all existing registrations 1565 were simply to be declared invalid, and perhaps even then, systems 1566 that needed to support both labels with old prefixes and labels with 1567 new ones would first process a putative label under the IDNA2008 1568 rules and try to look it up and then, if it were not found, would 1569 process the label under IDNA2003 rules and look it up again. That 1570 process could significantly slow down all processing that involved 1571 IDNs in the DNS especially since, in principle, a fully-qualified 1572 name could contain a mixture of labels that were registered with the 1573 old and new prefixes, a situation that would make the use of DNS 1574 caching very difficult. In addition, looking up the same input 1575 string as two separate A-labels would create some potential for 1576 confusion and attacks, since they could, in principle, resolve to 1577 different targets. 1579 Consequently, a prefix change is to be avoided if at all possible, 1580 even if it means accepting some IDNA2003 decisions about character 1581 distinctions as irreversible. 1583 10.4. Stringprep Changes and Compatibility 1585 Concerns have been expressed about problems for non-DNS uses of 1586 Stringprep being caused by changes to the specification intended to 1587 improve the handling of IDNs, most notably as this might affect 1588 identification and authentication protocols. Section 10.3, above, 1589 essentially also applies in this context. The proposed new inclusion 1590 tables [IDNA2008-Tables], the reduction in the number of characters 1591 permitted as input for registration or resolution (Section 6), and 1592 even the proposed changes in handling of right to left strings 1593 [IDNA2008-Bidi] either give interpretations to strings prohibited 1594 under IDNA2003 or prohibit strings that IDNA2003 permitted. Strings 1595 that are valid under both IDNA2003 and IDNA2008, and the 1596 corresponding versions of Stringprep, are not changed in 1597 interpretation. This protocol does not use either Nameprep or 1598 Stringprep as specified in IDNA2003. 1600 It is particularly important to keep IDNA processing separate from 1601 processing for various security protocols because some of the 1602 constraints that are necessary for smooth and comprehensible use of 1603 IDNs may be unwanted or undesirable in other contexts. For example, 1604 the criteria for good passwords or passphrases are very different 1605 from those for desirable IDNs. Similarly, internationalized SCSI 1606 identifiers and other protocol components are likely to have 1607 different requirements than IDNs. 1609 Perhaps even more important in practice, since most other known uses 1610 of Stringprep encode or process characters that are already in 1611 normalized form and expect the use of only those characters that can 1612 be used in writing words of languages, the changes proposed here and 1613 in [IDNA2008-Tables] are unlikely to have any effect at all, 1614 especially not on registries and registrations that follow rules 1615 already in existence when this work started. 1617 10.5. The Symbol Question 1619 One of the major differences between this specification and the 1620 original version of IDNA is that the original version permitted non- 1621 letter symbols of various sorts, including punctuation and line- 1622 drawing symbols, in the protocol. They were always discouraged in 1623 practice. In particular, both the "IESG Statement" about IDNA and 1624 all versions of the ICANN Guidelines specify that only language 1625 characters be used in labels. This specification disallows symbols 1626 entirely. There are several reasons for this, which include: 1628 o As discussed elsewhere, the original IDNA specification assumed 1629 that as many Unicode characters as possible should be permitted, 1630 directly or via mapping to other characters, in IDNs. This 1631 specification operates on an inclusion model, extrapolating from 1632 the LDH rules --which have served the Internet very well-- to a 1633 Unicode base rather than an ASCII base. 1635 o Most Unicode names for letters are, in most cases, fairly 1636 intuitive, unambiguous and recognizable to users of the relevant 1637 script. Symbol names are more problematic because there may be no 1638 general agreement on whether a particular glyph matches a symbol; 1639 there are no uniform conventions for naming; variations such as 1640 outline, solid, and shaded forms may or may not exist; and so on. 1641 As just one example, consider a "heart" symbol as it might appear 1642 in a logo that might be read as "I love...". While the user might 1643 read such a logo as "I love..." or "I heart...", considerable 1644 knowledge of the coding distinctions made in Unicode is needed to 1645 know that there more than one "heart" character (e.g., U+2665, 1646 U+2661, and U+2765) and how to describe it. These issues are of 1647 particular importance if strings are expected to be understood or 1648 transcribed by the listener after being read out loud. 1650 o As a simplified example of this, assume one wanted to use a 1651 "heart" or "star" symbol in a label. This is problematic because 1652 the those names are ambiguous in the Unicode system of naming (the 1653 actual Unicode names require far more qualification). A user or 1654 would-be registrant has no way to know --absent careful study of 1655 the code tables-- whether it is ambiguous (e.g., where there are 1656 multiple "heart" characters) or not. Conversely, the user seeing 1657 the hypothetical label doesn't know whether to read it --try to 1658 transmit it to a colleague by voice-- as "heart", as "love", as 1659 "black heart", or as any of the other examples below. 1661 o The actual situation is even worse than this. There is no 1662 possible way for a normal, casual, user to tell the difference 1663 between the hearts of U+2665 and U+2765 and the stars of U+2606 1664 and U+2729 or the without somehow knowing to look for a 1665 distinction. We have a white heart (U+2661) and few black hearts 1666 and describing a label containing a heart symbol is hopelessly 1667 ambiguous. In cities where "Square" is a popular part of a 1668 location name, one might well want to use a square symbol in a 1669 label as well and there are far more squares of various flavors in 1670 Unicode than there are hearts or stars. 1672 o The consequence of these ambiguities of description and 1673 dependencies on distinctions that were, or were not, made in 1674 Unicode codings, is that symbols are a very poor basis for 1675 reliable communication. Of course, these difficulties with 1676 symbols do not arise with actual pictographic languages and 1677 scripts which would be treated like any other language characters; 1678 the two should not be confused. 1680 [[anchor32: Note in Draft: Should the above section be significantly 1681 trimmed or eliminated?]] 1683 10.6. Migration Between Unicode Versions: Unassigned Code Points 1685 In IDNA2003, labels containing unassigned code points are resolved on 1686 the theory that, if they appear in labels and can be resolved, the 1687 relevant standards must have changed and the registry has properly 1688 allocated only assigned values. 1690 In this specification, strings containing unassigned code points MUST 1691 NOT be either looked up or registered. There are several reasons for 1692 this, with the most important ones being: 1694 o It cannot be known with sufficient reliability in advance that a 1695 code point that was not previously assigned will not be assigned 1696 to a compatibility character. In IDNA2003, since there is no 1697 direct dependency on NFKC (Stringprep's tables are based on NFKC, 1698 but IDNA2003 depends only on Stringprep), allocation of a 1699 compatibility character might produce some odd situations, but it 1700 would not be a problem. In IDNA2008, where compatibility 1701 characters are generally assigned to DISALLOWED, permitting 1702 strings containing unassigned characters to be looked up would 1703 permit violating the principle that characters in DISALLOWED are 1704 not looked up. 1706 o More generally, the status of an unassigned character with regard 1707 to the DISALLOWED and PROTOCOL-VALID categories, and whether 1708 contextual rules are required with the latter, cannot be evaluated 1709 until a character is actually assigned and known. 1711 It is possible to argue that the issues above are not important and 1712 that, as a consequence, it is better to retain the principle of 1713 looking up labels even if they contain unassigned characters because 1714 all of the important scripts and characters have been coded as of 1715 Unicode 5.1 and hence unassigned code points will be assigned only to 1716 obscure characters or archaic scripts. Unfortunately, that does not 1717 appear to be a safe assumption for at least two reasons. First, much 1718 the same claim of completeness has been made for earlier versions of 1719 Unicode. The reality is that a script that is obscure to much of the 1720 world may still be very important to those who use it. Cultural and 1721 linguistic preservation principles make it inappropriate to declare 1722 the script of no importance in IDNs. Second, we already have 1723 counterexamples in, e.g., the relationships associated with new Han 1724 characters being added (whether in the BMP or in Unicode Plane 2). 1726 10.7. Other Compatibility Issues 1728 The existing (2003) IDNA model includes several odd artifacts of the 1729 context in which it was developed. Many, if not all, of these are 1730 potential avenues for exploits, especially if the registration 1731 process permits "source" names (names that have not been processed 1732 through IDNA and nameprep) to be registered. As one example, since 1733 the character Eszett, used in German, is mapped by IDNA2003 into the 1734 sequence "ss" rather than being retained as itself or prohibited, a 1735 string containing that character but that is otherwise in ASCII is 1736 not really an IDN (in the U-label sense defined above) at all. After 1737 Nameprep maps the Eszett out, the result is an ASCII string and so 1738 does not get an xn-- prefix, but the string that can be displayed to 1739 a user appears to be an IDN. The proposed IDNA2008 eliminates this 1740 artifact. A character is either permitted as itself or it is 1741 prohibited; special cases that make sense only in a particular 1742 linguistic or cultural context can be dealt with as localization 1743 matters where appropriate. 1745 11. Acknowledgments 1747 The editor and contributors would like to express their thanks to 1748 those who contributed significant early review comments, sometimes 1749 accompanied by text, especially Mark Davis, Paul Hoffman, Simon 1750 Josefsson, and Sam Weiler. In addition, some specific ideas were 1751 incorporated from suggestions, text, or comments about sections that 1752 were unclear supplied by Frank Ellerman, Michael Everson, Asmus 1753 Freytag, Erik van der Poel, Michel Suignard, and Ken Whistler, 1754 although, as usual, they bear little or no responsibility for the 1755 conclusions the editor and contributors reached after receiving their 1756 suggestions. Thanks are also due to Vint Cerf, Debbie Garside, and 1757 Jefsey Morphin for conversations that led to considerable 1758 improvements in the content of this document. 1760 A meeting was held on 30 January 2008 to attempt to reconcile 1761 differences in perspective and terminology about this set of 1762 specifications between the design team and members of the Unicode 1763 Technical Consortium. The discussions at and subsequent to that 1764 meeting were very helpful in focusing the issues and in refining the 1765 specifications. The active participants at that meeting were (in 1766 alphabetic order as usual) Harald Alvestrand, Vint Cerf, Tina Dam, 1767 Mark Davis, Lisa Dusseault, Patrik Faltstrom (by telephone), Cary 1768 Karp, John Klensin, Warren Kumari, Lisa Moore, Erik van der Poel, 1769 Michel Suignard, and Ken Whistler. We express our thanks to Google 1770 for support of that meeting and to the participants for their 1771 contributions. 1773 Special thanks are due to Paul Hoffman for permission to extract 1774 material from his Internet-Draft to form the basis for Section 2. 1776 12. Contributors 1778 While the listed editor held the pen, this core of this document and 1779 the initial WG version represents the joint work and conclusions of 1780 an ad hoc design team consisting of the editor and, in alphabetic 1781 order, Harald Alvestrand, Tina Dam, Patrik Faltstrom, and Cary Karp. 1782 In addition, there were many specific contributions and helpful 1783 comments from those listed in the Acknowledgments section and others 1784 who have contributed to the development and use of the IDNA 1785 protocols. 1787 13. IANA Considerations 1789 This section gives an overview of registries required for IDNA. The 1790 actual definition of the first one appears in [IDNA2008-Tables]. 1792 13.1. IDNA Character Registry 1794 The distinction among the three major categories "UNASSIGNED", 1795 "DISALLOWED", and "PROTOCOL-VALID" is made by special categories and 1796 rules that are integral elements of [IDNA2008-Tables]. Convenience 1797 in programming and validation requires a registry of characters and 1798 scripts and their categories, updated for each new version of Unicode 1799 and the characters it contains. The details of this registry are 1800 specified in [IDNA2008-Tables]. 1802 13.2. IDNA Context Registry 1804 For characters that are defined in the IDNA Character Registry list 1805 as PROTOCOL-VALID but requiring a contextual rule (i.e., the types of 1806 rule described in Section 6.1.1.1), IANA will create and maintain a 1807 list of approved contextual rules, using the the "expert reviewer" 1808 model. Unlike usual practice, we recommend that the "expert 1809 reviewer" be a committee that reflects expertise on the relevant 1810 scripts, and encourage IANA, the IESG, and IAB to establish liaisons 1811 and work together with other relevant standards bodies to populate 1812 that committee and its procedures over the long term. [[anchor37: 1813 Note in Draft: This section requires careful review by the WG, since 1814 "expert review" may not be appropriate but other mechanisms may be 1815 excessively burdensome.]] 1817 A table from which that registry can be initialized, and some further 1818 discussion, appears in [RulesInit]. 1820 13.3. IANA Repository of IDN Practices of TLDs 1822 This registry, historically described as the "IANA Language Character 1823 Set Registry" or "IANA Script Registry" (both somewhat misleading 1824 terms) is maintained by IANA at the request of ICANN. It is used to 1825 provide a central documentation repository of the IDN policies used 1826 by top level domain (TLD) registries who volunteer to contribute to 1827 it and is used in conjunction with ICANN Guidelines for IDN use. 1829 It is not an IETF-managed registry and, while the protocol changes 1830 specified here may call for some revisions to the tables, these 1831 specifications have no direct effect on that registry and no IANA 1832 action is required as a result. 1834 14. Security Considerations 1836 Security on the Internet partly relies on the DNS. Thus, any change 1837 to the characteristics of the DNS can change the security of much of 1838 the Internet. 1840 Domain names are used by users to identify and connect to Internet 1841 servers. The security of the Internet is compromised if a user 1842 entering a single internationalized name is connected to different 1843 servers based on different interpretations of the internationalized 1844 domain name. 1846 When systems use local character sets other than ASCII and Unicode, 1847 this specification leaves the problem of transcoding between the 1848 local character set and Unicode up to the application or local 1849 system. If different applications (or different versions of one 1850 application) implement different transcoding rules, they could 1851 interpret the same name differently and contact different servers. 1852 This problem is not solved by security protocols like TLS that do not 1853 take local character sets into account. 1855 To help prevent confusion between characters that are visually 1856 similar, it is suggested that implementations provide visual 1857 indications where a domain name contains multiple scripts. Such 1858 mechanisms can also be used to show when a name contains a mixture of 1859 simplified and traditional Chinese characters, or to distinguish zero 1860 and one from O and l. DNS zone administrators may impose 1861 restrictions (subject to the limitations identified elsewhere in this 1862 document) that try to minimize characters that have similar 1863 appearance or similar interpretations. It is worth noting that there 1864 are no comprehensive technical solutions to the problems of 1865 confusable characters. One can reduce the extent of the problems in 1866 various ways, but probably never eliminate it. Some specific 1867 suggestions about identification and handling of confusable 1868 characters appear in a Unicode Consortium publication 1869 [Unicode-UTR36]. 1871 The registration and resolution models described above and in 1872 [IDNA2008-Protocol] change the mechanisms available for applications 1873 and resolvers to determine the validity of labels they encounter. In 1874 some respects, the ability to test is strengthened. For example, 1875 putative labels that contain unassigned code points will now be 1876 rejected, while IDNA2003 permitted them (something that is now 1877 recognized as a considerable source of risk). On the other hand, the 1878 protocol specification no longer assumes that the application that 1879 looks up a name will be able to determine, and apply, information 1880 about the protocol version used in registration. In theory, that may 1881 increase risk since the application will be able to do less pre- 1882 lookup validation. In practice, the protection afforded by that test 1883 has been largely illusory for reasons explained in RFC 4690 and 1884 above. 1886 Any change to Stringprep or, more broadly, the IETF's model of the 1887 use of internationalized character strings in different protocols, 1888 creates some risk of inadvertent changes to those protocols, 1889 invalidating deployed applications or databases, and so on. Our 1890 current hypothesis is that the same considerations that would require 1891 changing the IDN prefix (see Section 10.3.2) are the ones that would, 1892 e.g., invalidate certificates or hashes that depend on Stringprep, 1893 but those cases require careful consideration and evaluation. More 1894 important, it is not necessary to change Stringprep2003 at all in 1895 order to make the IDNA changes contemplated here. It is far 1896 preferable to create a separate document, or separate profile 1897 components, for IDN work, leaving the question of upgrading to other 1898 protocols to experts on them and eliminating any possible 1899 synchronization dependency between IDNA changes and possible upgrades 1900 to security protocols or conventions. 1902 No mechanism involving names or identifiers alone can protect a wide 1903 variety of security threats and attacks that are largely independent 1904 of them including spoofed pages, DNS query trapping and diversion, 1905 and so on. 1907 15. Change Log 1909 [[anchor40: RFC Editor: Please remove this section.]] 1911 For version 00 of draft-ietf-idnabis-rational, this list contains a 1912 complete trace going back through the earlier, design team, drafts. 1913 That earlier material will be removed in subsequent drafts. 1915 15.1. Version -01 of draft-klensin-idnabis-issues 1917 Version -01 of this document is a considerable rewrite from -00. 1918 Many sections have been clarified or extended and several new 1919 sections have been added to reflect discussions in a number of 1920 contexts since -00 was issued. 1922 15.2. Version -02 of draft-klensin-idnabis-issues 1924 o Corrected several editorial errors including an accidentally- 1925 introduced misstatement about NFKC. 1927 o Extensively revised the document to synchronize its terminology 1928 with version 03 of [IDNA2008-Tables] and to provide a better 1929 conceptual framework for its categories and how they are used. 1930 Added new material to clarify terminology and relationships with 1931 other efforts. More subtle changes in this version lay the 1932 groundwork for separating the document into a conceptual overview 1933 and a protocol specification for version 03. 1935 15.3. Version -03 of draft-klensin-idnabis-issues 1937 o Removed protocol materials to a separate document and incorporated 1938 rationale and explanation materials from the original 1939 specification in RFC 3960 into this document. Cleaned up earlier 1940 text to reflect a more mature specification and restructured 1941 several sections and added additional rationale material. 1943 o Strengthened and clarified the A-label / U-label/ LDH-label 1944 definition. 1946 o Retitled the document to reflect its evolving role. 1948 15.4. Version -04 of draft-klensin-idnabis-issues 1950 o Moved more text from "protocol" and further reorganized material. 1952 o Provided new material on "Contextual Rule Required. 1954 o Improved consistency of terminology, both internally and with the 1955 "tables" document. 1957 o Improved the IANA Considerations section and discussed the 1958 existing IDNA-related registry. 1960 o More small changes to increase consistency. 1962 15.5. Version -05 of draft-klensin-idnabis-issues 1964 Changed "YES" category back to "ALWAYS" to re-synch with the tables 1965 document and provide clearer terminology. 1967 15.6. Version -06 of draft-klensin-idnabis-issues 1969 o Clarified the prohibitions on strings that look like A-labels but 1970 are not and on unassigned code points. 1972 o Clarified length restrictions on IDN labels. 1974 o Revised the terminology definitions to remove the impression of 1975 circularity and removed invocations of ToASCII and ToUnicode, 1976 which do not exist in IDNA2008. 1978 o Added a new section on front-end processing. 1980 o Added a new section to discuss case-mapping. 1982 o Extended the discussion of prefix changes to identify the 1983 implications of making one. 1985 o Several more editorial improvements, corrected references, and 1986 similar adjustments. 1988 15.7. Version -07 of draft-klensin-idnabis-issues 1990 o Added material that specifically defines the format of contextual 1991 rules. 1993 o Added and altered text after discussions at the 30 January meeting 1994 (see Section 11) and the follow-up to those discussions. Among 1995 the key decisions at that meeting were to eliminate the 1996 distinction among the valid categories (formerly "ALWAYS", "MAYBE 1997 YES", and "MAYBE NO"), to adjust the terminology accordingly, and 1998 to change "CONTEXTUAL RULE REQUIRED" from a separate category in 1999 this document and the protocol one to a modifier of what is now 2000 called "PROTOCOL-VALID". The consequent changes resulted in 2001 removal of several sections of explanation from this document. 2003 o Resynchronized terminology with "protocol" and "tables" documents. 2005 o More editorial and typographic corrections. 2007 15.8. Version -00 of draft-ietf-idnabis-rationale 2009 o Rewrote the abstract and introduction, and retuned the title, to 2010 be more consistent with WG work and activities. Changed the file 2011 name to reflect WG naming. 2013 o Removed most of the material that explained, or compared this 2014 approach to, IDNA2003. Some of this material may appear in the 2015 non-WG "IDNA-alternatives" draft if it is ever completed. 2017 o Changed IDNA200X in terminology and references to IDNA2008. 2019 o Added a contextual rule for hyphen to the appendix, adjusted the 2020 rule syntax slightly, and supplied draft regular expression rules. 2022 o Responded to comments produced during the WG charter discussions 2023 and from several individuals. In general, comments requesting a 2024 reorganization of the collection of documents have not been 2025 responded to pending a WG decision on that topic. 2027 o Moved the contextual rule appendix out of here and into 2028 "Protocol". It may not belong there either, but definitely does 2029 not belong here, and was holding up getting this document out. 2031 o Many small editorial improvements, including reorganization of 2032 some material. 2034 Editorial note: While several sections have been removed from this 2035 version, the WG should discuss whether further cuts are desirable, 2036 e.g., whether Section 7.3, Section 7.4, or Section 10.3 provide 2037 enough value to be worth retaining? Can Section 10.4 be trimmed 2038 without loss of useful information and, if so, how? Section 10.7 2039 appears critical of IDNA2003 in undesirable ways: should it be 2040 dropped or do people have suggestions about how to improve it? 2041 Strong opinions have been expressed that Section 10.5 should be 2042 trimmed significantly or removed entirely. The WG will need to 2043 discuss that too. Are there other materials that should be trimmed 2044 out? 2046 16. References 2048 16.1. Normative References 2050 [ASCII] American National Standards Institute (formerly United 2051 States of America Standards Institute), "USA Code for 2052 Information Interchange", ANSI X3.4-1968, 1968. 2054 ANSI X3.4-1968 has been replaced by newer versions with 2055 slight modifications, but the 1968 version remains 2056 definitive for the Internet. 2058 [IDNA2008-Bidi] 2059 Alvestrand, H. and C. Karp, "An updated IDNA criterion for 2060 right to left scripts", February 2008, . 2064 New version of this document pending as 2065 draft-ietf-idnabis-bidi-00. 2067 [IDNA2008-Protocol] 2068 Klensin, J., "Internationalizing Domain Names in 2069 Applications (IDNA): Protocol", May 2008, . 2073 [IDNA2008-Tables] 2074 Faltstrom, P., "The Unicode Code Points and IDNA", 2075 April 2008, . 2078 A version of this document is available in HTML format at 2079 http://stupid.domain.name/idnabis/ 2080 draft-ietf-idnabis-tables-00.html 2082 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2083 Requirement Levels", BCP 14, RFC 2119, March 1997. 2085 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 2086 Internationalized Strings ("stringprep")", RFC 3454, 2087 December 2002. 2089 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 2090 "Internationalizing Domain Names in Applications (IDNA)", 2091 RFC 3490, March 2003. 2093 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 2094 Profile for Internationalized Domain Names (IDN)", 2095 RFC 3491, March 2003. 2097 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 2098 for Internationalized Domain Names in Applications 2099 (IDNA)", RFC 3492, March 2003. 2101 [RulesInit] 2102 Klensin, J., "Internationalizing Domain Names in 2103 Applications (IDNA): Protocol, Appendix A Contextual Rules 2104 Table", May 2008, . 2107 Forthconming. 2109 [Unicode-PropertyValueAliases] 2110 The Unicode Consortium, "Unicode Character Database: 2111 PropertyValueAliases", March 2008, . 2114 [Unicode-RegEx] 2115 The Unicode Consortium, "Unicode Technical Standard #18: 2116 Unicode Regular Expressions", May 2005, 2117 . 2119 [Unicode-Scripts] 2120 The Unicode Consortium, "Unicode Standard Annex #24: 2121 Unicode Script Property", February 2008, 2122 . 2124 [Unicode51] 2125 The Unicode Consortium, "The Unicode Standard, Version 2126 5.1.0", 2008. 2128 defined by: The Unicode Standard, Version 5.0, Boston, MA, 2129 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by 2130 Unicode 5.1.0 2131 (http://www.unicode.org/versions/Unicode5.1.0/). 2133 16.2. Informative References 2135 [BIG5] Institute for Information Industry of Taiwan, "Computer 2136 Chinese Glyph and Character Code Mapping Table, Technical 2137 Report C-26", 1984. 2139 There are several forms and variations and a closely- 2140 related standard, CNS 11643. See the discussion in 2141 Chapter 3 of Lunde, K., CJKV Information Processing, 2142 O'Reilly & Associates, 1999 2144 [GB18030] "Chinese National Standard GB 18030-2000: Information 2145 Technology -- Chinese ideograms coded character set for 2146 information interchange -- Extension for the basic set.", 2147 2000. 2149 [RFC0810] Feinler, E., Harrenstien, K., Su, Z., and V. White, "DoD 2150 Internet host table specification", RFC 810, March 1982. 2152 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 2153 STD 13, RFC 1034, November 1987. 2155 [RFC1035] Mockapetris, P., "Domain names - implementation and 2156 specification", STD 13, RFC 1035, November 1987. 2158 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 2159 and Support", STD 3, RFC 1123, October 1989. 2161 [RFC2782] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for 2162 specifying the location of services (DNS SRV)", RFC 2782, 2163 February 2000. 2165 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint 2166 Engineering Team (JET) Guidelines for Internationalized 2167 Domain Names (IDN) Registration and Administration for 2168 Chinese, Japanese, and Korean", RFC 3743, April 2004. 2170 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 2171 Identifiers (IRIs)", RFC 3987, January 2005. 2173 [RFC4290] Klensin, J., "Suggested Practices for Registration of 2174 Internationalized Domain Names (IDN)", RFC 4290, 2175 December 2005. 2177 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 2178 Recommendations for Internationalized Domain Names 2179 (IDNs)", RFC 4690, September 2006. 2181 [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin, 2182 "Registration and Administration Recommendations for 2183 Chinese Domain Names", RFC 4713, October 2006. 2185 [Unicode-UTR36] 2186 The Unicode Consortium, "Unicode Technical Report #36: 2187 Unicode Security Considerations", August 2006, 2188 . 2190 Author's Address 2192 John C Klensin 2193 1770 Massachusetts Ave, Ste 322 2194 Cambridge, MA 02140 2195 USA 2197 Phone: +1 617 245 1457 2198 Email: john+ietf@jck.com 2200 Full Copyright Statement 2202 Copyright (C) The IETF Trust (2008). 2204 This document is subject to the rights, licenses and restrictions 2205 contained in BCP 78, and except as set forth therein, the authors 2206 retain all their rights. 2208 This document and the information contained herein are provided on an 2209 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2210 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 2211 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 2212 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 2213 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2214 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2216 Intellectual Property 2218 The IETF takes no position regarding the validity or scope of any 2219 Intellectual Property Rights or other rights that might be claimed to 2220 pertain to the implementation or use of the technology described in 2221 this document or the extent to which any license under such rights 2222 might or might not be available; nor does it represent that it has 2223 made any independent effort to identify any such rights. Information 2224 on the procedures with respect to rights in RFC documents can be 2225 found in BCP 78 and BCP 79. 2227 Copies of IPR disclosures made to the IETF Secretariat and any 2228 assurances of licenses to be made available, or the result of an 2229 attempt made to obtain a general license or permission for the use of 2230 such proprietary rights by implementers or users of this 2231 specification can be obtained from the IETF on-line IPR repository at 2232 http://www.ietf.org/ipr. 2234 The IETF invites any interested party to bring to its attention any 2235 copyrights, patents or patent applications, or other proprietary 2236 rights that may cover technology that may be required to implement 2237 this standard. Please address the information to the IETF at 2238 ietf-ipr@ietf.org.