idnits 2.17.00 (12 Aug 2021)
/tmp/idnits9501/draft-ietf-idnabis-rationale-00.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
line 2214.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2225.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2232.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2238.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust Copyright Line does not match the
current year
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (May 10, 2008) is 5123 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
== Unused Reference: 'Unicode-PropertyValueAliases' is defined on line
2109, but no explicit reference was found in the text
== Unused Reference: 'Unicode-RegEx' is defined on line 2114, but no
explicit reference was found in the text
== Unused Reference: 'Unicode-Scripts' is defined on line 2119, but no
explicit reference was found in the text
-- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'
== Outdated reference: draft-ietf-idnabis-protocol has been published as
RFC 5891
== Outdated reference: draft-ietf-idnabis-tables has been published as RFC
5892
** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564)
** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)
** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)
== Outdated reference: draft-ietf-idnabis-protocol has been published as
RFC 5891
-- Duplicate reference: draft-ietf-idnabis-protocol, mentioned in
'RulesInit', was also mentioned in 'IDNA2008-Protocol'.
-- Possible downref: Non-RFC (?) normative reference: ref.
'Unicode-PropertyValueAliases'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode-RegEx'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode-Scripts'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode51'
-- Obsolete informational reference (is this intentional?): RFC 810
(Obsoleted by RFC 952)
Summary: 4 errors (**), 0 flaws (~~), 7 warnings (==), 14 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group J. Klensin
3 Internet-Draft May 10, 2008
4 Intended status: Standards Track
5 Expires: November 11, 2008
7 Internationalizing Domain Names for Applications (IDNA): Definitions,
8 Background and Rationale
9 draft-ietf-idnabis-rationale-00.txt
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on November 11, 2008.
36 Abstract
38 Several years have passed since the original protocol for
39 Internationalized Domain Names (IDNs) was completed and deployed.
40 During that time, a number of issues have arisen, including the need
41 to update the system to deal with newer versions of Unicode. Some of
42 these issues require tuning of the existing protocols and the tables
43 on which they depend. This document provides an overview of a
44 revised system and provides explanatory material for its components.
46 Table of Contents
48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
49 1.1. Context and Overview . . . . . . . . . . . . . . . . . . . 4
50 1.2. Discussion Forum . . . . . . . . . . . . . . . . . . . . . 4
51 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4
52 1.4. Applicability and Function of IDNA . . . . . . . . . . . . 5
53 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6
54 1.5.1. Documents and Standards . . . . . . . . . . . . . . . 6
55 1.5.2. Terminology about Characters and Character Sets . . . 6
56 1.5.3. DNS-related Terminology . . . . . . . . . . . . . . . 7
57 1.5.4. Terminology Specific to IDNA . . . . . . . . . . . . . 7
58 1.5.5. Punycode is an Algorithm, not a Name . . . . . . . . . 11
59 1.5.6. Other Terminology Issues . . . . . . . . . . . . . . . 11
60 1.5.7. Comprehensibility of IDNA Mechanisms and Processing . 12
61 2. Summary of Major Changes from IDNA2003 . . . . . . . . . . . . 13
62 3. The Revised IDNA Model . . . . . . . . . . . . . . . . . . . . 14
63 4. Processing in IDNA2008 . . . . . . . . . . . . . . . . . . . . 14
64 5. IDNA2008 Document List . . . . . . . . . . . . . . . . . . . . 15
65 6. Permitted Characters: An Inclusion List . . . . . . . . . . . 15
66 6.1. A Tiered Model of Permitted Characters and Labels . . . . 16
67 6.1.1. PROTOCOL-VALID . . . . . . . . . . . . . . . . . . . . 16
68 6.1.2. DISALLOWED . . . . . . . . . . . . . . . . . . . . . . 18
69 6.1.3. UNASSIGNED . . . . . . . . . . . . . . . . . . . . . . 19
70 6.2. Registration Policy . . . . . . . . . . . . . . . . . . . 19
71 6.3. Layered Restrictions: Tables, Context, Registration,
72 Applications . . . . . . . . . . . . . . . . . . . . . . . 19
73 7. Issues that Constrain Possible Solutions . . . . . . . . . . . 20
74 7.1. Display and Network Order . . . . . . . . . . . . . . . . 20
75 7.2. Entry and Display in Applications . . . . . . . . . . . . 21
76 7.3. Linguistic Expectations: Ligatures, Digraphs, and
77 Alternate Character Forms . . . . . . . . . . . . . . . . 22
78 7.4. Case Mapping and Related Issues . . . . . . . . . . . . . 24
79 7.5. Right to Left Text . . . . . . . . . . . . . . . . . . . . 25
80 8. IDNs and the Robustness Principle . . . . . . . . . . . . . . 26
81 9. Front-end and User Interface Processing . . . . . . . . . . . 26
82 10. Migration and Version Synchronization . . . . . . . . . . . . 28
83 10.1. Design Criteria . . . . . . . . . . . . . . . . . . . . . 28
84 10.1.1. General IDNA Validity Criteria . . . . . . . . . . . . 29
85 10.1.2. Labels in Registration . . . . . . . . . . . . . . . . 30
86 10.1.3. Labels in Resolution (Lookup) . . . . . . . . . . . . 31
87 10.2. More Flexibility in User Agents . . . . . . . . . . . . . 31
88 10.3. The Question of Prefix Changes . . . . . . . . . . . . . . 33
89 10.3.1. Conditions Requiring a Prefix Change . . . . . . . . . 33
90 10.3.2. Conditions Not Requiring a Prefix Change . . . . . . . 34
91 10.3.3. Implications of Prefix Changes . . . . . . . . . . . . 34
92 10.4. Stringprep Changes and Compatibility . . . . . . . . . . . 35
93 10.5. The Symbol Question . . . . . . . . . . . . . . . . . . . 35
94 10.6. Migration Between Unicode Versions: Unassigned Code
95 Points . . . . . . . . . . . . . . . . . . . . . . . . . . 37
96 10.7. Other Compatibility Issues . . . . . . . . . . . . . . . . 38
97 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 38
98 12. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 39
99 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39
100 13.1. IDNA Character Registry . . . . . . . . . . . . . . . . . 39
101 13.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 39
102 13.3. IANA Repository of IDN Practices of TLDs . . . . . . . . . 40
103 14. Security Considerations . . . . . . . . . . . . . . . . . . . 40
104 15. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 41
105 15.1. Version -01 of draft-klensin-idnabis-issues . . . . . . . 42
106 15.2. Version -02 of draft-klensin-idnabis-issues . . . . . . . 42
107 15.3. Version -03 of draft-klensin-idnabis-issues . . . . . . . 42
108 15.4. Version -04 of draft-klensin-idnabis-issues . . . . . . . 42
109 15.5. Version -05 of draft-klensin-idnabis-issues . . . . . . . 43
110 15.6. Version -06 of draft-klensin-idnabis-issues . . . . . . . 43
111 15.7. Version -07 of draft-klensin-idnabis-issues . . . . . . . 43
112 15.8. Version -00 of draft-ietf-idnabis-rationale . . . . . . . 44
113 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 44
114 16.1. Normative References . . . . . . . . . . . . . . . . . . . 44
115 16.2. Informative References . . . . . . . . . . . . . . . . . . 46
116 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 47
117 Intellectual Property and Copyright Statements . . . . . . . . . . 49
119 1. Introduction
121 1.1. Context and Overview
123 Several years have passed since the original protocol for
124 Internationalized Domain Names (IDNs) was completed and deployed.
125 During that time, a number of issues have arisen, including a subset
126 of those described in a recent IAB report [RFC4690] and the need to
127 update the system to deal with newer versions of Unicode. Those
128 standards are known as Internationalized Domain Names in Applications
129 (IDNA), taken from the name of the highest level standard within that
130 group (see Section 1.5). Some tuning of the existing protocols and
131 the tables on which they depend is now required. Where it is
132 important to understanding of the revised protocols, this document
133 further explains the issues that have been encountered. It also
134 provides an overview of the new IDNA model and explanatory material
135 for it. Additional explanatory material for the specific components
136 of the proposals will appear with the associated documents.
138 1.2. Discussion Forum
140 [[anchor4: RFC Editor: please remove this section.]]
142 This work is being discussed in the IETF "idnabis" Working Group and
143 on the mailing list idna-update@alvestrand.no
145 1.3. Objectives
147 The intent of the IDNA revision effort, and hence of this document
148 and the associated ones, is to increase the usability and
149 effectiveness of internationalized domain names (IDNs) while
150 preserving or strengthening the integrity of references that use
151 them. The original "hostname" character definitions (see, e.g.,
152 [RFC0810]) struck a balance between the creation of useful mnemonics
153 and the introduction of parsing problems or general confusion in the
154 contexts in which domain names are used. Our objective is to
155 preserve that balance while expanding the character repertoire to
156 include extended versions of Roman-derived scripts and scripts that
157 are not Roman in origin. No work of this sort will be able to
158 completely eliminate sources of visual or textual confusion: such
159 confusion is possible even under the original rules where only ASCII
160 characters were permitted. However, one can hope, through the
161 application of different techniques at different points (see
162 Section 6.3), to keep problems to an acceptable minimum. One
163 consequence of this general objective is that the desire of some user
164 or marketing community to use a particular string --whether the
165 reason is to try to write sentences of particular languages in the
166 DNS, to express a facsimile of the symbol for a brand, or for some
167 other purpose-- is not a primary goal within the context of
168 applications in the domain name space.
170 1.4. Applicability and Function of IDNA
172 The IDNA standard does not require any applications to conform to it,
173 nor does it retroactively change those applications. An application
174 can elect to use IDNA in order to support IDN while maintaining
175 interoperability with existing infrastructure. If an application
176 wants to use non-ASCII characters in domain names, IDNA is the only
177 currently-defined option. Adding IDNA support to an existing
178 application entails changes to the application only, and leaves room
179 for flexibility in front-end processing and more specifically in the
180 user interface (see Section 9).
182 A great deal of the discussion of IDN solutions has focused on
183 transition issues and how IDNs will work in a world where not all of
184 the components have been updated. Proposals that were not chosen by
185 the original IDN Working Group would depend on user applications,
186 resolvers, and DNS servers being updated in order for a user to apply
187 an internationalized domain name in any form or coding acceptable
188 under that method. While processing must be performed prior to or
189 after access to the DNS, no changes are needed to the DNS protocol or
190 any DNS servers or the resolvers on user's computers.
192 The IDNA specification solves the problem of extending the repertoire
193 of characters that can be used in domain names to include a large
194 subset of the Unicode repertoire.
196 IDNA does not extend the service offered by DNS to the applications.
197 Instead, the applications (and, by implication, the users) continue
198 to see an exact-match lookup service. Either there is a single
199 exactly-matching name or there is no match. This model has served
200 the existing applications well, but it requires, with or without
201 internationalized domain names, that users know the exact spelling of
202 the domain names that are to be typed into applications such as web
203 browsers and mail user agents. The introduction of the larger
204 repertoire of characters potentially makes the set of misspellings
205 larger, especially given that in some cases the same appearance, for
206 example on a business card, might visually match several Unicode code
207 points or several sequences of code points.
209 IDNA allows the graceful introduction of IDNs not only by avoiding
210 upgrades to existing infrastructure (such as DNS servers and mail
211 transport agents), but also by allowing some rudimentary use of IDNs
212 in applications by using the ASCII representation of the non-ASCII
213 name labels. While such names are user-unfriendly to read and type,
214 and hence not optimal for user input, they allow (for instance)
215 replying to email and clicking on URLs even though the domain name
216 displayed is incomprehensible to the user. In order to allow user-
217 friendly input and output of the IDNs and acceptance of some
218 characters as equivalent to those to be processed according to the
219 protocol, the applications need to be modified to conform to this
220 specification.
222 IDNA uses the Unicode character repertoire, which avoids the
223 significant delays that would be inherent in waiting for a different
224 and specific character sets to be defined for IDN purposes,
225 presumably by some other standards developing organization.
227 1.5. Terminology
229 1.5.1. Documents and Standards
231 This document uses the term "IDNA2003" to refer to the set of
232 standards that make up and support the version of IDNA published in
233 2003, i.e., those commonly known as the IDNA base specification
234 [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep
235 [RFC3454]. In this document, those names are used to refer,
236 conceptually, to the individual documents, with the base IDNA
237 specification called just "IDNA".
239 The term "IDNA2008" is used to refer to a new version of IDNA as
240 described in this document and in the documents described in
241 Section 5. References to "these specifications" are to the entire
242 set.
244 1.5.2. Terminology about Characters and Character Sets
246 A code point is an integer value associated with a character in a
247 coded character set.
249 Unicode [Unicode51] is a coded character set containing almost
250 100,000 characters as of the current version. A single Unicode code
251 point is denoted by "U+" followed by four to six hexadecimal digits,
252 while a range of Unicode code points is denoted by two four to six
253 digit hexadecimal numbers separated by "..", with no prefixes.
255 ASCII means US-ASCII [ASCII], a coded character set containing 128
256 characters associated with code points in the range 0000..007F.
257 Unicode may be thought of as an extension of ASCII; it includes all
258 the ASCII characters and associates them with equivalent code points.
260 "Letters" are, informally, generalizations from the ASCII and common-
261 sense understanding of that term, i.e., characters that are used to
262 write text that are not digits, symbols, or punctuation. Formally,
263 they are characters with a Unicode General Category value starting in
264 "L" (see Section 4.5 of [Unicode51]).
266 1.5.3. DNS-related Terminology
268 When discussing the DNS, this document generally assumes the
269 terminology used in the DNS specifications [RFC1034] [RFC1035]. The
270 terms "lookup" and "resolution" are used interchangeably and the
271 process or application component that performs DNS resolution is
272 called a "resolver". The process of placing an entry into the DNS is
273 referred to as "registration" paralleling common contemporary usage
274 in other contexts. Consequently, any DNS zone administration is
275 described as a "registry", regardless of that actual administrative
276 arrangements or level in the tree. A note about that relationship is
277 included in the text below where it seems particularly significant.
279 The term "LDH code points" is defined in this document to mean the
280 code points associated with ASCII letters, digits, and the hyphen-
281 minus; that is, U+002D, 0030..0039, 0041..005A, and 0061..007A. "LDH"
282 is an abbreviation for "letters, digits, hyphen".
284 The base DNS specifications [RFC1034] [RFC1035] discuss "domain
285 names" and "host names", but many people and sections of these
286 specifications use the terms interchangeably. Further, because those
287 documents were not terribly clear, many people who are sure they know
288 the exact definitions of each of these terms disagree on the
289 definitions. This document generally uses the term "domain name".
290 When it refers to, e.g., host name syntax restrictions, it explicitly
291 cites the relevant defining documents. The remaining definitions in
292 this subsection are essentially a review.
294 A label is an individual component of a domain name. Labels are
295 usually shown separated by dots; for example, the domain name
296 "www.example.com" is composed of three labels: "www", "example", and
297 "com". (The zero-length root label described in [RFC1123], which can
298 be explicit as in "www.example.com." or implicit as in
299 "www.example.com", is not considered a label in this specification.)
300 IDNA extends the set of usable characters in labels that are text.
301 For the rest of this document, the term "label" is shorthand for
302 "text label", and "every label" means "every text label".
304 1.5.4. Terminology Specific to IDNA
306 This section defines some terminology to reduce dependence on terms
307 and definitions that have been problematic in the past.
309 1.5.4.1. Terms for IDN Label Codings
311 1.5.4.1.1. IDNA-valid strings, A-label, and U-label
313 To improve clarity, this document introduces three new terms in this
314 subsection. In the next, it defines a historical one to be slightly
315 more precise for IDNA contexts.
317 o A string is "IDNA-valid" if it meets all of the requirements of
318 these specifications for an IDNA label. IDNA-valid strings may
319 appear in either of two forms, defined immediately below. It is
320 expected that specific reference will be made to the form
321 appropriate to any context in which the distinction is important.
323 o An "A-label" is the ASCII-Compatible Encoding (ACE, see
324 Section 1.5.4.3) form of an IDNA-valid string. It must be a
325 complete label: IDNA is defined for labels, not for parts of them
326 and not for complete domain names. This means, by definition,
327 that every A-label will begin with the IDNA ACE prefix, "xn--",
328 followed by a string that is a valid output of the Punycode
329 algorithm and hence a maximum of 59 ASCII characters in length.
330 The prefix and string together must conform to all requirements
331 for a label that can be stored in the DNS including conformance to
332 the LDH ("host name") rule described in RFC 1034, RFC 1123 and
333 elsewhere.
335 o A "U-label" is an IDNA-valid string of Unicode characters,
336 expressed in a standard Unicode Encoding Form, normally UTF-8 in
337 an Internet transmission context, and subject to the constraint
338 below. Conversions between valid U-labels and valid A-labels is
339 performed according to the specification in [RFC3492], adding or
340 removing the ACE prefix (see Section 1.5.4.3) as needed.
342 To be valid, U-labels and A-labels must obey an important symmetry
343 constraint. While that constraint may be tested in any of several
344 ways, an A-label must be capable of being produced by conversion from
345 a U-label and a U-label must be capable of being produced by
346 conversion from an A-label. Among other things, this implies that
347 both U-labels and A-labels must represent strings in normalized form.
348 These strings MUST contain only characters specified elsewhere in
349 this document and its companion documents, and only in the contexts
350 indicated as appropriate.
352 Any rules or conventions that apply to DNS labels in general, such as
353 rules about lengths of strings, apply to whichever of the U-label or
354 A-label would be more restrictive. For the U-label, constraints
355 imposed by existing protocols and their presentation forms make the
356 length restriction apply to the length in octets of the UTF-8 form of
357 those labels (which will always be greater than or equal to the
358 length in code points). The exception to this, of course, is that
359 the restriction to ASCII characters does not apply to the U-label.
361 A different way to look at these terms, which may be more clear to
362 some readers, is that U-labels, A-labels, and LDH-labels (see the
363 next subsection) are disjoint categories that, together, make up the
364 forms of legitimate strings for use in domain names that describe
365 hosts. Of the three, only A-labels and LDH-labels can actually
366 appear in DNS zone files or queries; U-labels can appear, along with
367 the other two, in presentation and user interface forms and in
368 selected protocols other than those of the DNS itself. Strings that
369 do not conform to the rules for one of these three categories and, in
370 particular, strings that contain "-" in the third or fourth character
371 position but are:
373 o not A-labels or
375 o cannot be processed as U-labels or A-labels as described in these
376 specifications,
378 are invalid as labels in domain names that identify Internet hosts or
379 similar resources. This restriction on strings containing "--" is
380 required for three reasons:
382 o to prevent confusion with pre-IDNA coding forms;
384 o to permit future extensions that would require changing the
385 prefix, no matter how unlikely those might be (see Section 10.3);
386 and
388 o to reduce the opportunities for attacks on the encoding system.
390 1.5.4.1.2. LDH-label and Internationalized Label
392 In the hope of further clarifying discussions about IDNs, these
393 specifications use the term "LDH-label" strictly to refer to an all-
394 ASCII label that obeys the "hostname" (LDH) conventions and that is
395 not an IDN. In other words, only "U-label" and "A-label" refer to
396 IDNs and LDH-labels are not IDNs. "Internationalized label" is used
397 when a term is needed to refer to any of the three categories. There
398 are some standardized DNS label formats, such as those for service
399 location (SRV) records [RFC2782] that do not fall into any of the
400 three categories and hence are not internationalized labels.
402 1.5.4.2. Equivalence
404 In IDNA, equivalence of labels is defined in terms of the A-labels.
405 If the A-labels are equal in a case-independent comparison, then the
406 labels are considered equivalent, no matter how they are represented.
407 Traditional LDH labels already have a notion of equivalence: within
408 that list of characters, upper case and lower case are considered
409 equivalent. The IDNA notion of equivalence is an extension of that
410 older notion. Equivalent labels in IDNA are treated as alternate
411 forms of the same label, just as "foo" and "Foo" are treated as
412 alternate forms of the same label.
414 1.5.4.3. ACE Prefix
416 The "ACE prefix" is defined in this document to be a string of ASCII
417 characters "xn--" that appears at the beginning of every A-label.
418 "ACE" stands for "ASCII-Compatible Encoding".
420 1.5.4.4. Domain Name Slot
422 A "domain name slot" is defined in this document to be a protocol
423 element or a function argument or a return value (and so on)
424 explicitly designated for carrying a domain name. Examples of domain
425 name slots include: the QNAME field of a DNS query; the name argument
426 of the gethostbyname() or getaddrinfo() standard C library functions;
427 the part of an email address following the at-sign (@) in the
428 parameter to the SMTP MAIL or RCPT commands or the "From:" field of
429 an email message header; and the host portion of the URI in the src
430 attribute of an HTML
tag. General text that just happens to
431 contain a domain name is not a domain name slot. For example, a
432 domain name appearing in the plain text body of an email message is
433 not occupying a domain name slot.
435 An "IDN-aware domain name slot" is defined in this document to be a
436 domain name slot explicitly designated for carrying an
437 internationalized domain name as defined in this document. The
438 designation may be static (for example, in the specification of the
439 protocol or interface) or dynamic (for example, as a result of
440 negotiation in an interactive session).
442 An "IDN-unaware domain name slot" is defined in this document to be
443 any domain name slot that is not an IDN-aware domain name slot.
444 Obviously, this includes any domain name slot whose specification
445 predates IDNA.
447 1.5.5. Punycode is an Algorithm, not a Name
449 There has been some confusion about whether a "Punycode string" does
450 or does not include the prefix and about whether it is required that
451 such strings could have been the output of ToASCII (see RFC 3490,
452 Section 4 [RFC3490]). This specification discourages the use of the
453 term "Punycode" to describe anything but the encoding method and
454 algorithm of [RFC3492]. The terms defined above are preferred as
455 much more clear than terms such as "Punycode string".
457 1.5.6. Other Terminology Issues
459 The document departs from historical DNS terminology and usage in one
460 important respect. Over the years, the community has talked very
461 casually about "names" in the DNS, beginning with calling it "the
462 domain name system". That terminology is fine in the very precise
463 sense that the identifiers of the DNS do provide names for objects
464 and addresses. But, in the context of IDNs, the term has introduced
465 some confusion, confusion that has increased further as people have
466 begun to speak of DNS labels in terms of the words or phrases of
467 various natural languages.
469 Historically, many, perhaps most, of the "names" in the DNS have been
470 mnemonics to identify some particular concept, object, or
471 organization. They are typically derived from, or rooted in, some
472 language because most people think in language-based ways. But,
473 because they are mnemonics, they need not obey the orthographic
474 conventions of any language: it is not a requirement that it be
475 possible for them to be "words".
477 This distinction is important because the reasonable goal of an IDN
478 effort is not to be able to write the great Klingon (or language of
479 one's choice) novel in DNS labels but to be able to form a usefully
480 broad range of mnemonics in ways that are as natural as possible in a
481 very broad range of scripts.
483 "The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
484 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
485 document are to be interpreted as described in RFC 2119 [RFC2119].
487 An "internationalized domain name" (IDN) is a domain name that may
488 contain any mixture of LDH-labels, A-labels, or U-labels. This
489 implies that every conventional domain name is an IDN (which implies
490 that it is possible for a domain name to be an IDN without it
491 containing any non-ASCII characters). Just as has been the case with
492 ASCII names, some DNS zone administrators may impose restrictions,
493 beyond those imposed by DNS or IDNA, on the characters or strings
494 that may be registered as labels in their zones. Because of the
495 diversity of characters that can be used in a U-label and the
496 confusion they might cause, such restrictions are mandatory for IDN
497 registries and zones even though the particular restrictions are not
498 part of these specifications. Because these restrictions, commonly
499 known as "registry restrictions", only affect what can be registered
500 and not resolution processing, they have no effect on the syntax or
501 semantics of DNS protocol messages; a query for a name that matches
502 no records will yield the same response regardless of the reason why
503 it is not in the zone. Clients issuing queries or interpreting
504 responses cannot be assumed to have any knowledge of zone-specific
505 restrictions or conventions. See Section 6.2.
507 1.5.7. Comprehensibility of IDNA Mechanisms and Processing
509 One of the major goals of this work is to improve the general
510 understanding of how IDNA works and what characters are permitted and
511 what happens to them. Comprehensibility and predictability to users
512 and registrants are themselves important motivations and design goals
513 for this effort. The effort includes some new terminology and a
514 revised and extended model, both covered in this section, and some
515 more specific protocol, processing, and table modifications. Details
516 of the latter appear in other documents (see Section 5).
518 Several issues are inherent in the application of IDNs and, indeed,
519 almost any other system that tries to handle international characters
520 and concepts. They range from the apparently trivial --e.g., one
521 cannot display a character for which one does not have a font
522 available locally-- to the more complex and subtle. Many people have
523 observed that internationalization is just a tool to enable effective
524 localization while permitting some global uniformity. Issues of
525 display, of exactly how various strings and characters are entered,
526 and so on are inherently issues about localization and user interface
527 design.
529 A protocol such as IDNA can only assume that such operations as data
530 entry and reconciliation of differences in character forms are
531 possible. It may make some recommendations about how display might
532 work when characters and fonts are not available, but they can only
533 be general recommendations and, because display functions are rarely
534 controlled by the types of applications that would call upon IDNA,
535 will rarely be very effective.
537 However, shifting responsibility for character mapping and other
538 adjustments from the protocol (where it was located in IDNA2003) to
539 the user interface or processing before invoking IDNA raises issues
540 about both what that processing should do and about compatibility for
541 references prepared in an IDNA2003 context. Those issues are
542 discussed in Section 9.
544 Operations for converting between local character sets and normalized
545 Unicode are part of this general set of user interface issues. The
546 conversion is obviously not required at all in a Unicode-native
547 system that maintains all strings in Normalization Form C (NFC). It
548 may, however, involve some complexity in a system that is not
549 Unicode-native, especially if the elements of the local character set
550 do not map exactly and unambiguously into Unicode characters or do so
551 in a way that is not completely stable over time. Perhaps more
552 important, if a label being converted to a local character set
553 contains Unicode characters that have no correspondence in that
554 character set, the application may have to apply special, locally-
555 appropriate, methods to avoid or reduce loss of information.
557 Depending on the system involved, the major difficulty may not lie in
558 the mapping but in accurately identifying the incoming character set
559 and then applying the correct conversion routine. If a local
560 operating system uses one of the ISO 8859 character sets or an
561 extensive national or industrial system such as GB18030 [GB18030] or
562 BIG5 [BIG5], one must correctly identify the character set in use
563 before converting to Unicode even though those character coding
564 systems are substantially or completely Unicode-compatible (i.e., all
565 of the code points in them have an exact and unique mapping to
566 Unicode code points). It may be even more difficult when the
567 character coding system in local use is based on conceptually
568 different assumptions than those used by Unicode about, e.g., about
569 font encodings used for publications in some Indic scripts. Those
570 differences may not easily yield unambiguous conversions or
571 interpretations even if each coding system is internally consistent
572 and adequate to represent the local language and script.
574 2. Summary of Major Changes from IDNA2003
576 1. Update base character set from Unicode 3.2 to Unicode version-
577 agnostic.
579 2. Separate the definitions for the "registration" and "lookup"
580 activities.
582 3. Disallow symbol and punctuation characters except where special
583 exceptions are necessary.
585 4. Remove the mapping and normalization steps from the protocol and
586 have them instead done by the applications themselves, possibly
587 in a local fashion, before invoking the protocol.
589 5. Change the way that the protocol specifies which characters are
590 allowed in labels from "humans decide what the table of
591 codepoints contains" to "decision about codepoints are based on
592 Unicode properties plus a small exclusion list created by
593 humans".
595 6. Introduce the new concept of characters that can be used only in
596 specific contexts.
598 7. Allow typical words and names in languages such as Dhivehi and
599 Yiddish to be expressed.
601 8. Make bidirectional domain names (delimited strings of labels,
602 not just labels standing on their own) display in a non-
603 surprising fashion.
605 9. Make bidirectional domain names in a paragraph display in a non-
606 surprising fashion.
608 10. Remove the dot separator from the mandatory part of the
609 protocol.
611 11. Make some currently-valid labels that are not actually IDNA
612 labels invalid.
614 3. The Revised IDNA Model
616 IDNA is a client-side protocol, i.e., almost all of the processing is
617 performed by the client. The strings that appear in, and are
618 resolved by, the DNS conform to the traditional rules for the naming
619 of hosts, and consist of ASCII letters, digits, and hyphens. This
620 approach permits IDNA to be deployed without modifications to the DNS
621 itself. That, in turn, avoids both having to upgrade the entire
622 Internet to support IDNs and needing to incur the unknown risks to
623 deployed systems of DNS structural or design changes especially if
624 those changes need to be deployed all at the same time.
626 4. Processing in IDNA2008
628 These specifications separate Domain Name Registration and Resolution
629 in the protocol specification. Doing so reflects current practice in
630 which per-registry restrictions and special processing are applied at
631 registration time but not on resolution. Even more important in the
632 longer term, it facilitates incremental addition of permitted
633 character groups to avoid freezing on one particular version of
634 Unicode.
636 The actual registration and lookup protocols for IDNA2008 are
637 specified in [IDNA2008-Protocol].
639 5. IDNA2008 Document List
641 [[anchor19: This section will need to be extensively revised or
642 removed before publication.]]
644 The following documents are being produced as part of the IDNA2008
645 effort.
647 o A revised version of this document, containing an overview,
648 rationale, and conformance conditions.
650 o A separate document, drawn from material in early versions of this
651 one, that explicitly updates and replaces RFC 3490 but which has
652 most rationale material from that document moved to this one
653 [IDNA2008-Protocol].
655 o A document describing the "Bidi problem" with Stringprep and
656 proposing a solution [IDNA2008-Bidi].
658 o A specification of the categories and rules that identify the code
659 points allowed in a U-label, based on Unicode 5.0 code
660 assignments. See Section 6 and [IDNA2008-Tables].
662 o One or more documents containing guidance and suggestions for
663 registries (in this context, those responsible for establishing
664 policies for any zone file in the DNS, not only those at the top
665 or second level). The documents in this category may not be IETF
666 products and may be prepared and completed asynchronously with
667 those described above.
669 6. Permitted Characters: An Inclusion List
671 This section provides an overview of the model used to establish the
672 algorithm and character lists of [IDNA2008-Tables] and describes the
673 names and applicability of the categories used there. Note that the
674 inclusion of a character in the first category group does not imply
675 that it can be used indiscriminately; some characters are associated
676 with contextual rules that must be applied as well.
678 The information given in this section is provided to make the rules,
679 tables, and protocol easier to understand. It is not normative. The
680 normative generating rules appear in [IDNA2008-Tables] and the rules
681 that actually determine what labels can be registered or looked up
682 are in [IDNA2008-Protocol].
684 6.1. A Tiered Model of Permitted Characters and Labels
686 Moving to an inclusion model requires respecifying the list of
687 characters that are permitted in IDNs. In IDNA2003, the role and
688 utility of characters are independent of context and fixed forever
689 (or until the standard is replaced). Making completely context-
690 independent rules globally has proven impractical because some
691 characters, especially those that are called "Join_Controls" in
692 Unicode, are needed to make reasonable use of some scripts but become
693 invisible characters in others. Of necessity, IDNA2003 prohibited
694 those types of characters entirely. But the restrictions were much
695 too severe to permit an adequate range of mnemonics for terminology
696 based on some languages. The requirement to support those characters
697 but limit their use to very specific contexts was reinforced by the
698 observation that handling of particular characters across the
699 languages that use a script, or the use of similar or identical-
700 looking characters in different scripts, is less well understood than
701 many people believed it was several years ago.
703 Independently of the characters chosen (see next subsection), the
704 theory is to divide the characters that appear in Unicode into three
705 categories:
707 6.1.1. PROTOCOL-VALID
709 Characters identified as "PROTOCOL-VALID" (often abbreviated
710 "PVALID") are, in general, permitted by IDNA for all uses in IDNs.
711 Their use may be restricted by rules about the context in which they
712 appear or by other rules that apply to the entire label in which they
713 are to be embedded. For example, any label that contains a character
714 in this group that has a "right to left" property must be used in
715 context with the "Bidi" rules (see [IDNA2008-Bidi]).
717 The term "PROTOCOL-VALID", is used to stress the fact that the
718 presence of a character in this category does not imply that a given
719 registry need accept registrations containing any of the characters
720 in the category. Registries are still expected to apply judgment
721 about labels they will accept and to maintain rules consistent with
722 those judgments (see [IDNA2008-Protocol] and Section 6.3).
724 Characters that are placed in the "PROTOCOL-VALID" category are never
725 removed from it unless the code points themselves are removed from
726 Unicode (such removal would be inconsistent with the Unicode
727 stability principles (see [Unicode51], Appendix F) and hence should
728 never occur).
730 [[anchor21: Placeholder: Does this topic or comment need additional
731 discussion or explanation?]]
733 6.1.1.1. Contextual Rules
735 Characters in the PROTOCOL-VALID category may actually be unsuitable
736 for general use in IDNs but necessary for the plausible support of
737 some scripts. The two most commonly-cited examples are the zero-
738 width joiner and non-joiner characters (ZWNJ, U+200C, and ZWJ,
739 U+200D), but provisions for unambiguous labels may require that other
740 characters be restricted to particular contexts. For example, the
741 ASCII hyphen is not permitted to start or end a label, whether that
742 label contains non-ASCII characters or not.
744 These characters must not appear in IDNs without additional
745 restrictions, typically because they are invisible in most scripts
746 but affect format or presentation in a few others or because they are
747 combining characters that are safe for use only in conjunction with
748 particular characters or scripts. In order to permit them to be used
749 at all, they are specially identified as "CONTEXTUAL RULE REQUIRED"
750 and, when adequately understood, associated with a rule. In
751 addition, the rule will define whether it is to be applied on lookup
752 as well as registration. A distinction is made between characters
753 that indicate or prohibit joining (known as "CONTEXT-JOINER" or
754 "CONTEXTJ") and other characters requiring contextual treatment
755 ("CONTEXT-OTHER" or "CONTEXTO"). Only the former are fully tested at
756 lookup time.
758 6.1.1.2. Rules and Their Application
760 The actual rules may be present or absent. If present, they may have
761 values of "True" (character may be used in any position in any
762 label), "False" (character may not be used in any label), or may be
763 an extended regular expression that specifies the context in which
764 the character is permitted.
766 Examples of descriptions of typical rules, stated informally and in
767 English, include "Must follow a character from Script XYZ", "MUST
768 occur only if the entire label is in Script ABC", "MUST occur only if
769 the previous and subsequent characters have the DFG property".
771 Because it is easier to identify these characters than to know that
772 they are actually needed in IDNs or how to establish exactly the
773 right rules for each one, a rule may have a null value in a given
774 version of the tables. Characters associated with null rules MUST
775 NOT appear in putative labels for either registration or lookup. Of
776 course, a later version of the tables might contain a non-null rule.
778 [[anchor23: Definition of regular expression language to be
779 supplied]]
781 6.1.2. DISALLOWED
783 Some characters are sufficiently problematic for use in IDNs that
784 they should be excluded for both registration and lookup (i.e.,
785 conforming applications performing name resolution should verify that
786 these characters are absent; if they are present, the label strings
787 should be rejected rather than converted to A-labels and looked up.
789 Of course, this category would include code points that had been
790 removed entirely from Unicode should such removals ever occur.
792 Characters that are placed in the "DISALLOWED" category are expected
793 to never be removed from it or reclassified. If a character is
794 classified as "DISALLOWED" in error and the error is sufficiently
795 problematic, the only recourse would be either to introduce a new
796 code point into Unicode and classify it as "PROTOCOL-VALID" or for
797 the IETF to accept the considerable costs of an incompatible change
798 and replace the relevant RFC with one containing appropriate
799 exceptions.
801 [[anchor24: Note in Draft: the permanence of DISALLOWED was still
802 under discussion in the WG when this draft was posted. The text
803 above reflects the editor's opinion about the emerging consensus but
804 is subject to change as the discussion continues.]]
806 There is provision for exception cases but, in general, characters
807 are placed into "DISALLOWED" if they fall into one or more of the
808 following groups:
810 o The character is a compatibility equivalent for another character.
811 In slightly more precise Unicode terms, application of
812 normalization method NFKC to the character yields some other
813 character.
815 o The character is an upper-case form or some other form that is
816 mapped to another character by Unicode casefolding.
818 o The character is a symbol or punctuation form or, more generally,
819 something that is not a letter, digit, or a mark that is used to
820 form a letter or digit.
822 6.1.3. UNASSIGNED
824 For convenience in processing and table-building, code points that do
825 not have assigned values in a given version of Unicode are treated as
826 belonging to a special UNASSIGNED category. Such code points MUST
827 NOT appear in labels to be registered or looked up. The category
828 differs from DISALLOWED in that code points are moved out of it by
829 the simple expedient of being assigned in a later version of Unicode
830 (at which point, they are classified into one of the other categories
831 as appropriate).
833 6.2. Registration Policy
835 While these recommendations cannot and should not define registry
836 policies, registries SHOULD develop and apply additional restrictions
837 to reduce confusion and other problems. For example, it is generally
838 believed that labels containing characters from more than one script
839 are a bad practice although there may be some important exceptions to
840 that principle. Some registries may choose to restrict registrations
841 to characters drawn from a very small number of scripts. For many
842 scripts, the use of variant techniques such as those as described in
843 [RFC3743] and [RFC4290], and illustrated for Chinese by the tables
844 described in RFC 4713 [RFC4713] may be helpful in reducing problems
845 that might be perceived by users. It is worth stressing that these
846 principles of policy development and application apply at all levels
847 of the DNS, not only, e.g., TLD registrations.
849 6.3. Layered Restrictions: Tables, Context, Registration, Applications
851 The essence of the character rules in IDNA2008 is based on the
852 realization that there is no magic bullet for any of the issues
853 associated with a multiscript DNS. Instead, the specifications
854 define a variety of approaches that, together, constitute multiple
855 lines of defense against ambiguity in identifiers and loss of
856 referential integrity. The actual character tables are the first
857 mechanism, protocol rules about how those characters are applied or
858 restricted in context are the second, and those two in combination
859 constitute the limits of what can be done by a protocol alone. As
860 discussed in the previous section (Section 6.2), registries are
861 expected to restrict what they permit to be registered, devising and
862 using rules that are designed to optimize the balance between
863 confusion and risk on the one hand and maximum expressiveness in
864 mnemonics on the other.
866 In addition, there is an important role for user agents in warning
867 against label forms that appear unreasonable given their knowledge of
868 local contexts and conventions. Of course, no approach based on
869 naming or identifiers alone can protect against all threats.
871 [[anchor25: Note in Draft: the last sentence above basically
872 duplicates a comment in Security Considerations. Is it worth having
873 in both places??]]
875 7. Issues that Constrain Possible Solutions
877 7.1. Display and Network Order
879 The correct treatment of domain names requires a clear distinction
880 between Network Order (the order in which the code points are sent in
881 protocols) and Display Order (the order in which the code points are
882 displayed on a screen or paper). The order of labels in a domain
883 name that contains characters that are normally written right to left
884 is discussed in [IDNA2008-Bidi]. In particular, there are questions
885 about the order in which labels are displayed if left to right and
886 right to left labels are adjacent to each other, especially if there
887 are also multiple consecutive appearances of one of the types. The
888 decision about the display order is ultimately under the control of
889 user agents --including web browsers, mail clients, and the like--
890 which may be highly localized. Even when formats are specified by
891 protocols, the full composition of an Internationalized Resource
892 Identifier (IRI) [RFC3987] or Internationalized Email address
893 contains elements other than the domain name. For example, IRIs
894 contain protocol identifiers and field delimiter syntax such as
895 "http://" or "mailto:" while email addresses contain the "@" to
896 separate local parts from domain names. User agents are not required
897 to use those protocol-based forms directly but often do so. While
898 display, parsing, and processing within a label is specified by the
899 IDNA protocol and the associated documents, the relationship between
900 fully-qualified domain names and internationalized labels is
901 unchanged from the base DNS specifications. Comments here about such
902 full domain names are explanatory or examples of what might be done
903 and must not be considered normative.
905 Questions remain about protocol constraints implying that the overall
906 direction of these strings will always be left to right (or right to
907 left) for an IRI or email address, or if they even should conform to
908 such rules. These questions also have several possible answers.
909 Should a domain name abc.def, in which both labels are represented in
910 scripts that are written right to left, be displayed as fed.cba or
911 cba.fed? An IRI for clear text web access would, in network order,
912 begin with "http://" and the characters will appear as
913 "http://abc.def" -- but what does this suggest about the display
914 order? When entering a URI to many browsers, it may be possible to
915 provide only the domain name and leave the "http://" to be filled in
916 by default, assuming no tail (an approach that does not work for
917 other protocols). The natural display order for the typed domain
918 name on a right to left system is fed.cba. Does this change if a
919 protocol identifier, tail, and the corresponding delimiters are
920 specified?
922 While logic, precedent, and reality suggest that these are questions
923 for user interface design, not IETF protocol specifications,
924 experience in the 1980s and 1990s with mixing systems in which domain
925 name labels were read in network order (left to right) and those in
926 which those labels were read right to left would predict a great deal
927 of confusion, and heuristics that sometimes fail, if each
928 implementation of each application makes its own decisions on these
929 issues.
931 It should be obvious that any revision of IDNA, including the current
932 one, must be clear about the network (transmission on the wire) order
933 of characters in labels and for the labels in complete (fully-
934 qualified) domain names. In order to prevent user confusion and, in
935 particular, to reduce the chances for inconsistent transcription of
936 domain names from printed form, it is likely that some strong
937 suggestions should be made about display order as well.
939 7.2. Entry and Display in Applications
941 Applications can accept domain names using any character set or sets
942 desired by the application developer or specified by the operating
943 system, and can display domain names in any charset. That is, the
944 IDNA protocol does not affect the interface between users and
945 applications.
947 An IDNA-aware application can accept and display internationalized
948 domain names in two formats: the internationalized character set(s)
949 supported by the application (i.e., an appropriate local
950 representation of a U-label), and as an A-label. Applications MAY
951 allow the display and user input of A-labels, but are encouraged to
952 not do so except as an interface for special purposes, possibly for
953 debugging, or to cope with display limitations. A-labels are opaque
954 and ugly, and, where possible, should thus only be exposed to users
955 and in contexts in which they are absolutely needed. Because IDN
956 labels can be rendered either as the A-labels or U-labels, the
957 application may reasonably have an option for the user to select the
958 preferred method of display; if it does, rendering the U-label should
959 normally be the default.
961 Domain names are often stored and transported in many places. For
962 example, they are part of documents such as mail messages and web
963 pages. They are transported in many parts of many protocols, such as
964 both the control commands and the RFC 2822 body parts of SMTP, and
965 the headers and the body content in HTTP. It is important to
966 remember that domain names appear both in domain name slots and in
967 the content that is passed over protocols.
969 In protocols and document formats that define how to handle
970 specification or negotiation of charsets, labels can be encoded in
971 any charset allowed by the protocol or document format. If a
972 protocol or document format only allows one charset, the labels MUST
973 be given in that charset. Of course, not all charsets can properly
974 represent all labels. If a U-label cannot be displayed in its
975 entirety, the only choice (without loss of information) may be to
976 display the A-label.
978 In any place where a protocol or document format allows transmission
979 of the characters in internationalized labels, labels SHOULD be
980 transmitted using whatever character encoding and escape mechanism
981 the protocol or document format uses at that place. This provision
982 is intended to prevent situations in which, e.g., UTF-8 domain names
983 appear embedded in text that is otherwise in some other character
984 coding.
986 All protocols that use domain name slots already have the capacity
987 for handling domain names in the ASCII charset. Thus, A-labels can
988 inherently be handled by those protocols.
990 7.3. Linguistic Expectations: Ligatures, Digraphs, and Alternate
991 Character Forms
993 Users often have expectations about character matching or equivalence
994 that are based on their languages and the orthography of those
995 languages. These expectations may not be consistent with forms or
996 actions that can be naturally accommodated in a character coding
997 system, especially if multiple languages are written using the same
998 script but using different conventions. A Norwegian user might
999 expect a label with the ae-ligature to be treated as the same label
1000 as one using the Swedish spelling with a-umlaut even though applying
1001 that mapping to English would be astonishing to users. A user in
1002 German might expect a label with an o-umlaut and a label that had
1003 "oe" substituted, but was otherwise the same, treated as equivalent
1004 even though that substitution would be a clear error in Swedish. A
1005 Chinese user might expect automatic matching of Simplified and
1006 Traditional Chinese characters, but applying that matching for Korean
1007 or Japanese text would create considerable confusion. For that
1008 matter, an English user might expect "theater" and "theatre" to
1009 match.
1011 Related issues arise because there are a number of languages written
1012 with alphabetic scripts in which single phonemes are written using
1013 two characters, termed a "digraph", for example, the "ph" in
1014 "pharmacy" and "telephone". (Note that characters paired in this
1015 manner can also appear consecutively without forming a digraph, as in
1016 "tophat".) Certain digraphs are normally indicated typographically
1017 by setting the two characters closer together than they would be if
1018 used consecutively to represent different phonemes. Some digraphs
1019 are fully joined as ligatures (strictly designating setting totally
1020 without intervening white space, although the term is sometimes
1021 applied to close set pairs). An example of this may be seen when the
1022 word "encyclopaedia" is set with a U+00E6 LATIN SMALL LIGATURE AE
1023 (and some would not consider that word correctly spelled unless the
1024 ligature form was used or the "a" was dropped entirely). When these
1025 ligature and digraph forms have the same interpretation across all
1026 languages that use a given script, application of Unicode
1027 normalization generally resolves the differences and causes them to
1028 match. When they have different interpretations, any requirements
1029 for matching must utilize other methods or users must be educated to
1030 understand that matching will not occur.
1032 Difficulties arise from the fact that a given ligature may be a
1033 completely optional typographic convenience for representing a
1034 digraph in one language (as in the above example with some spelling
1035 conventions), while in another language it is a single character that
1036 may not always be correctly representable by a two-letter sequence
1037 (as in the above example with different spelling conventions). This
1038 can be illustrated by many words in the Norwegian language, where the
1039 "ae" ligature is the 27th letter of a 29-letter extended Latin
1040 alphabet. It is equivalent to the 28th letter of the Swedish
1041 alphabet (also containing 29 letters), U+00E4 LATIN SMALL LETTER A
1042 WITH DIAERESIS, for which an "ae" cannot be substituted according to
1043 current orthographic standards.
1045 That character (U+00E4) is also part of the German alphabet where,
1046 unlike in the Nordic languages, the two-character sequence "ae" is
1047 usually treated as a fully acceptable alternate orthography. The
1048 inverse is however not true, and those two characters cannot
1049 necessarily be combined into an "umlauted a". This also applies to
1050 another German character, the "umlauted o" (U+00F6 LATIN SMALL LETTER
1051 O WITH DIAERESIS) which, for example, cannot be used for writing the
1052 name of the author "Goethe". It is also a letter in the Swedish
1053 alphabet where, in parallel to the "umlauted a", it cannot be
1054 correctly represented as "oe" and in the Norwegian alphabet, where it
1055 is represented, not as "umlauted o", but as "slashed o", U+00F8.
1057 Some of the ligatures that have explicit code points in Unicode were
1058 given special handling in IDNA2003 and now pose additional problems
1059 as people argue that they should have been treated differently to
1060 preserve important information. For example, the German character
1061 Eszett (Sharp S, U+00DF) is retained as itself by NFKC but case-
1062 folded by Stringprep to "ss", but the closely-related, but less
1063 frequently seen, character "Long S T" (U+FB05) is a compatibility
1064 character that is mapped out by NFKC. Unless exceptions are made,
1065 both will be treated as DISALLOWED by IDNA2008. But there is
1066 significant interest in an exception, especially for Eszett.
1067 Depending on what the exception was, making it would either raise
1068 some backward compatibility problems with IDNA2003 or create an
1069 unusual special case that would highlight differences in preferred
1070 orthography between German as written in Germany and German as
1071 written in some other countries, notably Switzerland. Additional
1072 discussion of issues with Eszett appear in Section 10.7.
1074 Additional cases with alphabets written right to left are described
1075 in Section 7.5.
1077 Whether ligatures and digraphs are to be treated as a sequence of
1078 characters or as a single standalone one constitute a problem that
1079 cannot be resolved solely by operating on scripts. They are,
1080 however, a key concern in the IDN context. Their satisfactory
1081 resolution will require support in policies set by registries, which
1082 therefore need to be particularly mindful not just of this specific
1083 issue, but of all other related matters that cannot be dealt with on
1084 an exclusively algorithmic basis.
1086 Just as with the examples of different-looking characters that may be
1087 assumed to be the same, it is in general impossible to deal with
1088 these situations in a system such as IDNA -- or with Unicode
1089 normalization generally -- since determining what to do requires
1090 information about the language being used, context, or both.
1091 Consequently, these specifications make no attempt to treat these
1092 combined characters in any special way. However, their existence
1093 provides a prime example of a situation in which a registry that is
1094 aware of the language context in which labels are to be registered,
1095 and where that language sometimes (or always) treats the two-
1096 character sequences as equivalent to the combined form, should give
1097 serious consideration to applying a "variant" model [RFC3743]
1098 [RFC4290] to reduce the opportunities for user confusion and fraud
1099 that would result from the related strings being registered to
1100 different parties.
1102 7.4. Case Mapping and Related Issues
1104 Traditionally in the DNS, ASCII letters have been stored with their
1105 case preserved. Matching during the query process has been case-
1106 independent, but none of the information that might be represented by
1107 choices of case has been lost. That model has been accidentally
1108 helpful because, as people have created DNS labels by catenating
1109 words (or parts of words) to form labels, case has often been used to
1110 distinguish among components and make the labels more memorable.
1112 The solution of keeping the characters separate but doing matching
1113 independent of case is not feasible with an IDNA-like model because
1114 the matching would then have to be done on the server rather than
1115 have characters mapped on the client. That situation was recognized
1116 in IDNA2003 and nothing in IDNA2008 fundamentally changes it or could
1117 do so. In IDNA2003, all upper-case characters are mapped to lower-
1118 case ones and, in general, all code points that represent alternate
1119 forms of the same character are mapped to that character (including
1120 mapping Greek final form sigma to the medial form). IDNA2008
1121 permits, at the risk of some incompatibility, slightly more
1122 flexibility in this area. That additional flexibility still does not
1123 solve the problem with final form sigma and other characters that
1124 Unicode treats as completely separate characters that match only
1125 under casemapping if at all. Many people now believe these should be
1126 handled as separate characters so information about them can be
1127 preserved in the transformations to A-labels and back. However
1128 making a change to permit that behavior would create a situation in
1129 which the same string, valid in both protocols, would be interpreted
1130 differently by IDNA2003 and IDNA2008. In principle, that would
1131 violate one of the conditions discussed in Section 10.3.1 and hence
1132 require a prefix change. Of course, if a prefix change were made (at
1133 the costs discussed in Section 10.3.3) there would be several
1134 options, including, if desired, assigning the characer to the
1135 CONTEXTUAL RULE REQUIRED category and requiring that it only be used
1136 in carefully-selected contexts.
1138 7.5. Right to Left Text
1140 In order to be sure that the directionality of right to left text is
1141 unambiguous, IDNA2003 required that any label in which right to left
1142 characters appear both starts and ends with them, may not include any
1143 characters with strong left to right properties (which excludes other
1144 alphabetic characters but permits European digits), and rejects any
1145 other string that contains a right to left character. This is one of
1146 the few places where the IDNA algorithms (both old and new) are
1147 required to look at an entire label, not just at individual
1148 characters. The algorithmic model used in IDNA2003 rejects the label
1149 when the final character in a right to left string requires a
1150 combining mark in order to be correctly represented.
1152 This problem manifests itself in languages written with consonantal
1153 alphabets to which diacritical vocalic systems are applied, and in
1154 languages with orthographies derived from them where the combining
1155 marks may have different functionality. In both cases the combining
1156 marks can be essential components of the orthography. Examples of
1157 this are Yiddish, written with an extended Hebrew script, and Dhivehi
1158 (the official language of Maldives) which is written in the Thaana
1159 script (which is, in turn, derived from the Arabic script). The new
1160 rules for right to left scripts are described in [IDNA2008-Bidi].
1162 8. IDNs and the Robustness Principle
1164 The model of IDNs described in this document can be seen as a
1165 particular instance of the "Robustness Principle" that has been so
1166 important to other aspects of Internet protocol design. This
1167 principle is often stated as "Be conservative about what you send and
1168 liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2
1169 [RFC1123]). For IDNs to work well, not only must the protocol be
1170 carefully designed and implemented, but zone administrators
1171 (registries) must have and require sensible policies about what is
1172 registered -- conservative policies -- and implement and enforce
1173 them.
1175 Conversely, resolvers can (and SHOULD or maybe MUST) reject labels
1176 that clearly violate global (protocol) rules (no one has ever
1177 seriously claimed that being liberal in what is accepted requires
1178 being stupid). However, once one gets past such global rules and
1179 deals with anything sensitive to script or locale, it is necessary to
1180 assume that garbage has not been placed into the DNS, i.e., one must
1181 be liberal about what one is willing to look up in the DNS rather
1182 than guessing about whether it should have been permitted to be
1183 registered.
1185 As mentioned elsewhere, if a string doesn't resolve, it makes no
1186 difference whether it simply wasn't registered or was prohibited by
1187 some rule.
1189 If resolvers, as a user interface (UI) or other local matter, decide
1190 to warn about some strings that are valid under the global rules but
1191 that they perceive as dangerous, that is their prerogative and we can
1192 only hope that the market (and maybe regulators) will reinforce the
1193 good choices and discourage the poor ones. In this context, a
1194 resolver that decides a string that is valid under the protocol is
1195 dangerous and refuses to look it up is in violation of the protocols;
1196 one that is willing to look something up, but warns against it, is
1197 exercising a local choice.
1199 9. Front-end and User Interface Processing
1201 Domain names may be identified and processed in many contexts. They
1202 may be typed in by users either by themselves or as part of URIs or
1203 IRIs. They may occur in running text or be processed by one system
1204 after being provided in another. They may wish to try to normalize
1205 URLs so as to determine (or guess) whether a reference is valid or
1206 two references point to the same object without actually looking the
1207 objects up and comparing them. Some of these goals may be more
1208 easily and reliably satisfied than others. While there are strong
1209 arguments for any domain name that is placed "on the wire" --
1210 transmitted between systems -- to be in the minimum-ambiguity forms
1211 of A-labels, U-labels, or LDH-labels, it is inevitable that programs
1212 that process domain names will encounter variant forms. One source
1213 of such forms will be labels created under IDNA2003. Because of the
1214 way that protocol was specified, there are a significant number of
1215 domain names in files on the Internet that use characters that cannot
1216 be represented directly in domain names but for which interpretations
1217 are provided. There are two major categories of such characters,
1218 those that are removed by NFKC normalization and those upper-case
1219 characters that are mapped to lower-case (there are also a few
1220 characters that are given special-case mapping treatment in
1221 Stringprep).
1223 Other issues in domain name identification and processing arise
1224 because IDNA2003 specified that several other characters be treated
1225 as equivalent to the ASCII period (dot, full stop) character used as
1226 a label separator. If a domain name appears in an arbitrary context
1227 (such as running text), one may be faced with the requirement to know
1228 that a string is a domain name in order to adjust for the different
1229 forms of dots but also to have traditional dots to recognize that a
1230 string is a domain name -- an obvious contradiction.
1232 As discussed elsewhere in this document, the IDNA2008 model removes
1233 all of these mappings and interpretations, including the equivalence
1234 of different forms of dots, from the protocol, leaving such mappings
1235 to local processing. This should not be taken to imply that local
1236 processing is optional or can be avoided entirely. Instead, unless
1237 the program context is such that it is known that any IDNs that
1238 appear will be either U-labels or A-labels, some local processing of
1239 apparent domain name strings will be required, both to maintain
1240 compatibility with IDNA2003 and to prevent user astonishment. Such
1241 local processing, while not specified in this document or the
1242 associated ones, will generally take one of two forms:
1244 o Generic Preprocessing.
1245 When the context in which the program or system that processes
1246 domain names operates is global, a reasonable balance must be
1247 found that is sensitive to the broad range of local needs and
1248 assumptions while, at the same time, not sacrificing the needs of
1249 one language, script, or user population to those of another.
1251 For this case, the best practice will usually be to apply NFKC and
1252 case-mapping (or, perhaps better yet, Stringprep itself), plus
1253 dot-mapping where appropriate, to the domain name string prior to
1254 applying IDNA. That practice will not only yield a reasonable
1255 compromise of user experience with protocol requirements but will
1256 be almost completely compatible with the various forms permitted
1257 by IDNA2003.
1259 o Highly Localized Preprocessing.
1260 Unlike the case above, there will be some situations in which
1261 software will be highly localized for a particular environment and
1262 carefully adapted to the expectations of users in that
1263 environment. The many discussions about using the Internet to
1264 preserve and support local cultures suggest that these cases may
1265 be more common in the future than they have been so far.
1267 In these cases, we should avoid trying to tell implementers what
1268 they should do, if only because they are quite likely (and for
1269 good reason) to ignore us. We would assume that they would map
1270 characters that the intuitions of their users would suggest be
1271 mapped. One can imagine switches about whether some sorts of
1272 mappings occur, warnings before applying them or, in a slightly
1273 more extreme version of the approach taken in Internet Explorer
1274 version 7 (IE7), utterly refuse to handle "strange" characters at
1275 all if they appear in U-label form. None of those local decisions
1276 are a threat to interoperability as long as (i) only U-labels and
1277 A-labels are used in interchange with systems outside the local
1278 environment, (ii) no character that would be valid in a U-label as
1279 itself is mapped to something else, (iii) any local mappings are
1280 applied as a preprocessing step (or, for conversions from U-labels
1281 or A-labels to presentation forms, postprocessing), not as part of
1282 IDNA processing proper, and (iv) appropriate consideration is
1283 given to labels that might have entered the environment in
1284 conformance to IDNA2003.
1286 10. Migration and Version Synchronization
1288 10.1. Design Criteria
1290 As mentioned above and in RFC 4690, two key goals of this work are to
1291 enable applications to be agnostic about whether they are being run
1292 in environments supporting any Unicode version from 3.2 onward and to
1293 permit incrementally adding permitted scripts and other character
1294 collections without disruption or, subsequent to this version,
1295 "heavy" processes such as formation of an IETF WG. The mechanisms
1296 that support this are outlined above, but this section reviews them
1297 in a context that may be more helpful to those who need to understand
1298 the approach and make plans for it.
1300 10.1.1. General IDNA Validity Criteria
1302 The general criteria for a putative label, and the collection of
1303 characters that make it up, to be considered IDNA-valid are:
1305 o The characters are "letters", marks needed to form letters,
1306 numerals, or other code points used to write words in some
1307 language. Symbols, drawing characters, and various notational
1308 characters are permanently excluded -- some because they are
1309 actively dangerous in URI, IRI, or similar contexts and others
1310 because there is no evidence that they are important enough to
1311 Internet operations or internationalization to justify inclusion
1312 and the complexities that would come with it (additional
1313 discussion and rationale for the symbol decision appears in
1314 Section 10.5).
1316 o Other than in very exceptional cases, e.g., where they are needed
1317 to write substantially any word of a given language, punctuation
1318 characters are excluded as well. The fact that a word exists is
1319 not proof that it should be usable in a DNS label and DNS labels
1320 are not expected to be usable for multiple-word phrases (although
1321 they are certainly not prohibited if the conventions and
1322 orthography of a particular language cause that to be possible).
1324 o Characters that are unassigned (have no character assignment at
1325 all) in the version of Unicode being used by the registry or
1326 application are not permitted, even on resolution (lookup). There
1327 are at least two reasons for this. Tests involving the context of
1328 characters (e.g., some characters being permitted only adjacent to
1329 ones of specific types but otherwise invisible or very problematic
1330 for other reasons) and integrity tests on complete labels are
1331 needed. Unassigned code points cannot be permitted because one
1332 cannot determine whether particular code points will require
1333 contextual rules (and what those rules should be)7 before
1334 characters are assigned to them and the properties of those
1335 characters fully understood. Second, Unicode specifies that an
1336 unassigned code point normalizes and case folds to itself. If the
1337 code point is later assigned to a character, and particularly if
1338 the newly-assigned code point has a combining class that
1339 determines its placement relative to other combining characters,
1340 it could normalize to some other code point or sequence, creating
1341 confusion and/or violating other rules listed here.
1343 o Any character that is mapped to another character by Nameprep2003
1344 or by a current version of NFKC is prohibited as input to IDNA
1345 (for either registration or resolution). Implementers of user
1346 interfaces to applications are free to make those conversions when
1347 they consider them suitable for their operating system
1348 environments, context, or users.
1350 Tables used to identify the characters that are IDNA-valid are
1351 expected to be driven by the principles above (described in more
1352 precise form in [IDNA2008-Tables]). The principles are not just an
1353 interpretation of the tables.
1355 10.1.2. Labels in Registration
1357 Anyone entering a label into a DNS zone must properly validate that
1358 label -- i.e., be sure that the criteria for an A-label are met -- in
1359 order for Unicode version-independence to be possible. In
1360 particular:
1362 o Any label that contains hyphens as its third and fourth characters
1363 MUST be IDNA-valid. This implies that, (i) if the third and
1364 fourth characters are hyphens, the first and second ones MUST be
1365 "xn" until and unless this specification is updated to permit
1366 other prefixes and (ii) labels starting in "xn--" MUST be valid
1367 A-labels, as discussed in Section 3 above.
1369 o The Unicode tables (i.e., tables of code points, character
1370 classes, and properties) and IDNA tables (i.e., tables of
1371 contextual rules such as those described above), MUST be
1372 consistent on the systems performing or validating labels to be
1373 registered. Note that this does not require that tables reflect
1374 the latest version of Unicode, only that all tables used on a
1375 given system are consistent with each other.
1377 Under this model, a registry (or entity communicating with a registry
1378 to accomplish name registrations) will need to update its tables --
1379 both the Unicode-associated tables and the tables of permitted IDN
1380 characters -- to enable a new script or other set of new characters.
1381 It will not be affected by newer versions of Unicode, or newly-
1382 authorized characters, until and unless it wishes to make those
1383 registrations. The registration side is also responsible --under the
1384 protocol and to registrants and users-- for much more careful
1385 checking than is expected of applications systems that look names up,
1386 both checking as required by the protocol and checking required by
1387 whatever policies it develops for minimizing risks due to confusable
1388 characters and sequences and preserving language or script integrity.
1390 Systems looking up or resolving DNS labels MUST be able to assume
1391 that applicable registration rules were followed for names entered
1392 into the DNS.
1394 10.1.3. Labels in Resolution (Lookup)
1396 Anyone looking up a label in a DNS zone
1398 o MUST maintain a consistent set of tables, as discussed above. As
1399 with registration, the tables need not reflect the latest version
1400 of Unicode but they MUST be consistent.
1402 o MUST validate the characters in labels to be looked up only to the
1403 extent of determining that the U-label does not contain either
1404 code points prohibited by IDNA (categorized as "DISALLOWED") or
1405 code points that are unassigned in its version of Unicode.
1407 o MUST validate the label itself for conformance with a small number
1408 of whole-label rules, notably verifying that there are no leading
1409 combining marks, that the "bidi" conditions are met if right to
1410 left characters appear, that any required contextual rules are
1411 available and that, if such rules are associated with Joiner
1412 Controls, they are tested.
1414 o MUST NOT validate other contextual rules about characters,
1415 including mixed-script label prohibitions, although such rules MAY
1416 be used to influence presentation decisions in the user interface.
1418 By avoiding applying its own interpretation of which labels are valid
1419 as a means of rejecting lookup attempts, the resolver application
1420 becomes less sensitive to version incompatibilities with the
1421 particular zone registry associated with the domain name.
1423 An application or client that looks names up in the DNS will be able
1424 to resolve any name that is validly registered, as long as its
1425 version of the Unicode-associated tables is sufficiently up-to-date
1426 to interpret all of the characters in the label. It SHOULD
1427 distinguish, in its messages to users, between "label contains an
1428 unallocated code point" and other types of lookup failures. A
1429 failure on the basis of an old version of Unicode may lead the user
1430 to a desire to upgrade to a newer version, but will have no other ill
1431 effects (this is consistent with behavior in the transition to the
1432 DNS when some hosts could not yet handle some forms of names or
1433 record types).
1435 10.2. More Flexibility in User Agents
1437 These specifications do not perform mappings between one character or
1438 code point and others for any reason. Instead, they prohibits the
1439 characters that would be mapped to others by normalization, case
1440 folding, or other rules. As examples, while mathematical characters
1441 based on Latin ones are accepted as input to IDNA2003, they are
1442 prohibited in IDNA2008. Similarly, double-width characters and other
1443 variations are prohibited as IDNA input.
1445 Since the rules in [IDNA2008-Tables] provide that only strings that
1446 are stable under NFKC are valid, if it is convenient for an
1447 application to perform NFKC normalization before lookup, that
1448 operation is safe since this will never make the application unable
1449 to look up any valid string.
1451 In many cases these prohibitions should have no effect on what the
1452 user can type at resolution time. It is perfectly reasonable for
1453 systems that support user interfaces to perform some character
1454 mapping that is appropriate to the local environment. This would
1455 normally be done prior to actual invocation of IDNA. At least
1456 conceptually, the mapping would be part of the Unicode conversions
1457 discussed above and in [IDNA2008-Protocol]. However, those changes
1458 will be local ones only -- local to environments in which users will
1459 clearly understand that the character forms are equivalent. For use
1460 in interchange among systems, it appears to be much more important
1461 that U-labels and A-labels can be mapped back and forth without loss
1462 of information.
1464 One specific, and very important, instance of this strategy arises
1465 with case-folding. In the ASCII-only DNS, names are looked up and
1466 matched in a case-independent way, but no actual case-folding occurs.
1467 Names can be placed in the DNS in either upper or lower case form (or
1468 any mixture of them) and that form is preserved, returned in queries,
1469 and so on. IDNA2003 simulated that behavior by performing case-
1470 mapping at registration time (resulting in only lower-case IDNs in
1471 the DNS) and when names were looked up.
1473 As suggested earlier in this section, it appears to be desirable to
1474 do as little character mapping as possible consistent with having
1475 Unicode work correctly (e.g., NFC mapping to resolve different
1476 codings for the same character is still necessary although the
1477 specifications require that it be performed prior to invoking the
1478 protocol) and to make the mapping between A-labels and U-labels
1479 idempotent. Case-mapping is not an exception to this principle. If
1480 only lower case characters can be registered in the DNS (i.e., be
1481 present in a U-label), then IDNA2008 should prohibit upper-case
1482 characters as input. Some other considerations reinforce this
1483 conclusion. For example, an essential element of the ASCII case-
1484 mapping functions is that uppercase(character) must be equal to
1485 uppercase(lowercase(character)). That requirement may not be
1486 satisfied with IDNs. The relationship between upper case and lower
1487 case may even be language-dependent, with different languages (or
1488 even the same language in different areas) expecting different
1489 mappings. Of course, the expectations of users who are accustomed to
1490 a case-insensitive DNS environment will probably be well-served if
1491 user agents perform case mapping prior to IDNA processing, but the
1492 IDNA procedures themselves should neither require such mapping nor
1493 expect them when they are not natural to the localized environment.
1495 10.3. The Question of Prefix Changes
1497 The conditions that would require a change in the IDNA "prefix"
1498 ("xn--" for the version of IDNA specified in [RFC3490]) have been a
1499 great concern to the community. A prefix change would clearly be
1500 necessary if the algorithms were modified in a manner that would
1501 create serious ambiguities during subsequent transition in
1502 registrations. This section summarizes our conclusions about the
1503 conditions under which changes in prefix would be necessary and the
1504 implications of such a change.
1506 10.3.1. Conditions Requiring a Prefix Change
1508 An IDN prefix change is needed if a given string would resolve or
1509 otherwise be interpreted differently depending on the version of the
1510 protocol or tables being used. Consequently, work to update IDNs
1511 would require a prefix change if, and only if, one of the following
1512 four conditions were met:
1514 1. The conversion of an A-label to Unicode (i.e., a U-label) yields
1515 one string under IDNA2003 (RFC3490) and a different string under
1516 IDNA2008.
1518 2. An input string that is valid under IDNA2003 and also valid under
1519 IDNA2008 yields two different A-labels with the different
1520 versions of IDNA. This condition is believed to be essentially
1521 equivalent to the one above.
1523 Note, however, that if the input string is valid under one
1524 version and not valid under the other, this condition does not
1525 apply. See the first item in Section 10.3.2, below.
1527 3. A fundamental change is made to the semantics of the string that
1528 is inserted in the DNS, e.g., if a decision were made to try to
1529 include language or specific script information in that string,
1530 rather than having it be just a string of characters.
1532 4. A sufficiently large number of characters is added to Unicode so
1533 that the Punycode mechanism for block offsets no longer has
1534 enough capacity to reference the higher-numbered planes and
1535 blocks. This condition is unlikely even in the long term and
1536 certain not to arise in the next few years.
1538 10.3.2. Conditions Not Requiring a Prefix Change
1540 In particular, as a result of the principles described above, none of
1541 the following changes require a new prefix:
1543 1. Prohibition of some characters as input to IDNA. This may make
1544 names that are now registered inaccessible, but does not require
1545 a prefix change.
1547 2. Adjustments in Stringprep tables or IDNA actions, including
1548 normalization definitions, that affect characters that were
1549 already invalid under IDNA2003.
1551 3. Changes in the style of definitions of Stringprep or Nameprep
1552 that do not alter the actions performed by them.
1554 Of course, because these specifications do not involve changes to
1555 Stringprep or Nameprep, the third condition above and part of the
1556 second are moot.
1558 10.3.3. Implications of Prefix Changes
1560 While it might be possible to make a prefix change, the costs of such
1561 a change are considerable. Even if they wanted to do so, all
1562 registries could not convert all IDNA2003 ("xn--") registrations to a
1563 new form at the same time and synchronize that change with
1564 applications supporting lookup. Unless all existing registrations
1565 were simply to be declared invalid, and perhaps even then, systems
1566 that needed to support both labels with old prefixes and labels with
1567 new ones would first process a putative label under the IDNA2008
1568 rules and try to look it up and then, if it were not found, would
1569 process the label under IDNA2003 rules and look it up again. That
1570 process could significantly slow down all processing that involved
1571 IDNs in the DNS especially since, in principle, a fully-qualified
1572 name could contain a mixture of labels that were registered with the
1573 old and new prefixes, a situation that would make the use of DNS
1574 caching very difficult. In addition, looking up the same input
1575 string as two separate A-labels would create some potential for
1576 confusion and attacks, since they could, in principle, resolve to
1577 different targets.
1579 Consequently, a prefix change is to be avoided if at all possible,
1580 even if it means accepting some IDNA2003 decisions about character
1581 distinctions as irreversible.
1583 10.4. Stringprep Changes and Compatibility
1585 Concerns have been expressed about problems for non-DNS uses of
1586 Stringprep being caused by changes to the specification intended to
1587 improve the handling of IDNs, most notably as this might affect
1588 identification and authentication protocols. Section 10.3, above,
1589 essentially also applies in this context. The proposed new inclusion
1590 tables [IDNA2008-Tables], the reduction in the number of characters
1591 permitted as input for registration or resolution (Section 6), and
1592 even the proposed changes in handling of right to left strings
1593 [IDNA2008-Bidi] either give interpretations to strings prohibited
1594 under IDNA2003 or prohibit strings that IDNA2003 permitted. Strings
1595 that are valid under both IDNA2003 and IDNA2008, and the
1596 corresponding versions of Stringprep, are not changed in
1597 interpretation. This protocol does not use either Nameprep or
1598 Stringprep as specified in IDNA2003.
1600 It is particularly important to keep IDNA processing separate from
1601 processing for various security protocols because some of the
1602 constraints that are necessary for smooth and comprehensible use of
1603 IDNs may be unwanted or undesirable in other contexts. For example,
1604 the criteria for good passwords or passphrases are very different
1605 from those for desirable IDNs. Similarly, internationalized SCSI
1606 identifiers and other protocol components are likely to have
1607 different requirements than IDNs.
1609 Perhaps even more important in practice, since most other known uses
1610 of Stringprep encode or process characters that are already in
1611 normalized form and expect the use of only those characters that can
1612 be used in writing words of languages, the changes proposed here and
1613 in [IDNA2008-Tables] are unlikely to have any effect at all,
1614 especially not on registries and registrations that follow rules
1615 already in existence when this work started.
1617 10.5. The Symbol Question
1619 One of the major differences between this specification and the
1620 original version of IDNA is that the original version permitted non-
1621 letter symbols of various sorts, including punctuation and line-
1622 drawing symbols, in the protocol. They were always discouraged in
1623 practice. In particular, both the "IESG Statement" about IDNA and
1624 all versions of the ICANN Guidelines specify that only language
1625 characters be used in labels. This specification disallows symbols
1626 entirely. There are several reasons for this, which include:
1628 o As discussed elsewhere, the original IDNA specification assumed
1629 that as many Unicode characters as possible should be permitted,
1630 directly or via mapping to other characters, in IDNs. This
1631 specification operates on an inclusion model, extrapolating from
1632 the LDH rules --which have served the Internet very well-- to a
1633 Unicode base rather than an ASCII base.
1635 o Most Unicode names for letters are, in most cases, fairly
1636 intuitive, unambiguous and recognizable to users of the relevant
1637 script. Symbol names are more problematic because there may be no
1638 general agreement on whether a particular glyph matches a symbol;
1639 there are no uniform conventions for naming; variations such as
1640 outline, solid, and shaded forms may or may not exist; and so on.
1641 As just one example, consider a "heart" symbol as it might appear
1642 in a logo that might be read as "I love...". While the user might
1643 read such a logo as "I love..." or "I heart...", considerable
1644 knowledge of the coding distinctions made in Unicode is needed to
1645 know that there more than one "heart" character (e.g., U+2665,
1646 U+2661, and U+2765) and how to describe it. These issues are of
1647 particular importance if strings are expected to be understood or
1648 transcribed by the listener after being read out loud.
1650 o As a simplified example of this, assume one wanted to use a
1651 "heart" or "star" symbol in a label. This is problematic because
1652 the those names are ambiguous in the Unicode system of naming (the
1653 actual Unicode names require far more qualification). A user or
1654 would-be registrant has no way to know --absent careful study of
1655 the code tables-- whether it is ambiguous (e.g., where there are
1656 multiple "heart" characters) or not. Conversely, the user seeing
1657 the hypothetical label doesn't know whether to read it --try to
1658 transmit it to a colleague by voice-- as "heart", as "love", as
1659 "black heart", or as any of the other examples below.
1661 o The actual situation is even worse than this. There is no
1662 possible way for a normal, casual, user to tell the difference
1663 between the hearts of U+2665 and U+2765 and the stars of U+2606
1664 and U+2729 or the without somehow knowing to look for a
1665 distinction. We have a white heart (U+2661) and few black hearts
1666 and describing a label containing a heart symbol is hopelessly
1667 ambiguous. In cities where "Square" is a popular part of a
1668 location name, one might well want to use a square symbol in a
1669 label as well and there are far more squares of various flavors in
1670 Unicode than there are hearts or stars.
1672 o The consequence of these ambiguities of description and
1673 dependencies on distinctions that were, or were not, made in
1674 Unicode codings, is that symbols are a very poor basis for
1675 reliable communication. Of course, these difficulties with
1676 symbols do not arise with actual pictographic languages and
1677 scripts which would be treated like any other language characters;
1678 the two should not be confused.
1680 [[anchor32: Note in Draft: Should the above section be significantly
1681 trimmed or eliminated?]]
1683 10.6. Migration Between Unicode Versions: Unassigned Code Points
1685 In IDNA2003, labels containing unassigned code points are resolved on
1686 the theory that, if they appear in labels and can be resolved, the
1687 relevant standards must have changed and the registry has properly
1688 allocated only assigned values.
1690 In this specification, strings containing unassigned code points MUST
1691 NOT be either looked up or registered. There are several reasons for
1692 this, with the most important ones being:
1694 o It cannot be known with sufficient reliability in advance that a
1695 code point that was not previously assigned will not be assigned
1696 to a compatibility character. In IDNA2003, since there is no
1697 direct dependency on NFKC (Stringprep's tables are based on NFKC,
1698 but IDNA2003 depends only on Stringprep), allocation of a
1699 compatibility character might produce some odd situations, but it
1700 would not be a problem. In IDNA2008, where compatibility
1701 characters are generally assigned to DISALLOWED, permitting
1702 strings containing unassigned characters to be looked up would
1703 permit violating the principle that characters in DISALLOWED are
1704 not looked up.
1706 o More generally, the status of an unassigned character with regard
1707 to the DISALLOWED and PROTOCOL-VALID categories, and whether
1708 contextual rules are required with the latter, cannot be evaluated
1709 until a character is actually assigned and known.
1711 It is possible to argue that the issues above are not important and
1712 that, as a consequence, it is better to retain the principle of
1713 looking up labels even if they contain unassigned characters because
1714 all of the important scripts and characters have been coded as of
1715 Unicode 5.1 and hence unassigned code points will be assigned only to
1716 obscure characters or archaic scripts. Unfortunately, that does not
1717 appear to be a safe assumption for at least two reasons. First, much
1718 the same claim of completeness has been made for earlier versions of
1719 Unicode. The reality is that a script that is obscure to much of the
1720 world may still be very important to those who use it. Cultural and
1721 linguistic preservation principles make it inappropriate to declare
1722 the script of no importance in IDNs. Second, we already have
1723 counterexamples in, e.g., the relationships associated with new Han
1724 characters being added (whether in the BMP or in Unicode Plane 2).
1726 10.7. Other Compatibility Issues
1728 The existing (2003) IDNA model includes several odd artifacts of the
1729 context in which it was developed. Many, if not all, of these are
1730 potential avenues for exploits, especially if the registration
1731 process permits "source" names (names that have not been processed
1732 through IDNA and nameprep) to be registered. As one example, since
1733 the character Eszett, used in German, is mapped by IDNA2003 into the
1734 sequence "ss" rather than being retained as itself or prohibited, a
1735 string containing that character but that is otherwise in ASCII is
1736 not really an IDN (in the U-label sense defined above) at all. After
1737 Nameprep maps the Eszett out, the result is an ASCII string and so
1738 does not get an xn-- prefix, but the string that can be displayed to
1739 a user appears to be an IDN. The proposed IDNA2008 eliminates this
1740 artifact. A character is either permitted as itself or it is
1741 prohibited; special cases that make sense only in a particular
1742 linguistic or cultural context can be dealt with as localization
1743 matters where appropriate.
1745 11. Acknowledgments
1747 The editor and contributors would like to express their thanks to
1748 those who contributed significant early review comments, sometimes
1749 accompanied by text, especially Mark Davis, Paul Hoffman, Simon
1750 Josefsson, and Sam Weiler. In addition, some specific ideas were
1751 incorporated from suggestions, text, or comments about sections that
1752 were unclear supplied by Frank Ellerman, Michael Everson, Asmus
1753 Freytag, Erik van der Poel, Michel Suignard, and Ken Whistler,
1754 although, as usual, they bear little or no responsibility for the
1755 conclusions the editor and contributors reached after receiving their
1756 suggestions. Thanks are also due to Vint Cerf, Debbie Garside, and
1757 Jefsey Morphin for conversations that led to considerable
1758 improvements in the content of this document.
1760 A meeting was held on 30 January 2008 to attempt to reconcile
1761 differences in perspective and terminology about this set of
1762 specifications between the design team and members of the Unicode
1763 Technical Consortium. The discussions at and subsequent to that
1764 meeting were very helpful in focusing the issues and in refining the
1765 specifications. The active participants at that meeting were (in
1766 alphabetic order as usual) Harald Alvestrand, Vint Cerf, Tina Dam,
1767 Mark Davis, Lisa Dusseault, Patrik Faltstrom (by telephone), Cary
1768 Karp, John Klensin, Warren Kumari, Lisa Moore, Erik van der Poel,
1769 Michel Suignard, and Ken Whistler. We express our thanks to Google
1770 for support of that meeting and to the participants for their
1771 contributions.
1773 Special thanks are due to Paul Hoffman for permission to extract
1774 material from his Internet-Draft to form the basis for Section 2.
1776 12. Contributors
1778 While the listed editor held the pen, this core of this document and
1779 the initial WG version represents the joint work and conclusions of
1780 an ad hoc design team consisting of the editor and, in alphabetic
1781 order, Harald Alvestrand, Tina Dam, Patrik Faltstrom, and Cary Karp.
1782 In addition, there were many specific contributions and helpful
1783 comments from those listed in the Acknowledgments section and others
1784 who have contributed to the development and use of the IDNA
1785 protocols.
1787 13. IANA Considerations
1789 This section gives an overview of registries required for IDNA. The
1790 actual definition of the first one appears in [IDNA2008-Tables].
1792 13.1. IDNA Character Registry
1794 The distinction among the three major categories "UNASSIGNED",
1795 "DISALLOWED", and "PROTOCOL-VALID" is made by special categories and
1796 rules that are integral elements of [IDNA2008-Tables]. Convenience
1797 in programming and validation requires a registry of characters and
1798 scripts and their categories, updated for each new version of Unicode
1799 and the characters it contains. The details of this registry are
1800 specified in [IDNA2008-Tables].
1802 13.2. IDNA Context Registry
1804 For characters that are defined in the IDNA Character Registry list
1805 as PROTOCOL-VALID but requiring a contextual rule (i.e., the types of
1806 rule described in Section 6.1.1.1), IANA will create and maintain a
1807 list of approved contextual rules, using the the "expert reviewer"
1808 model. Unlike usual practice, we recommend that the "expert
1809 reviewer" be a committee that reflects expertise on the relevant
1810 scripts, and encourage IANA, the IESG, and IAB to establish liaisons
1811 and work together with other relevant standards bodies to populate
1812 that committee and its procedures over the long term. [[anchor37:
1813 Note in Draft: This section requires careful review by the WG, since
1814 "expert review" may not be appropriate but other mechanisms may be
1815 excessively burdensome.]]
1817 A table from which that registry can be initialized, and some further
1818 discussion, appears in [RulesInit].
1820 13.3. IANA Repository of IDN Practices of TLDs
1822 This registry, historically described as the "IANA Language Character
1823 Set Registry" or "IANA Script Registry" (both somewhat misleading
1824 terms) is maintained by IANA at the request of ICANN. It is used to
1825 provide a central documentation repository of the IDN policies used
1826 by top level domain (TLD) registries who volunteer to contribute to
1827 it and is used in conjunction with ICANN Guidelines for IDN use.
1829 It is not an IETF-managed registry and, while the protocol changes
1830 specified here may call for some revisions to the tables, these
1831 specifications have no direct effect on that registry and no IANA
1832 action is required as a result.
1834 14. Security Considerations
1836 Security on the Internet partly relies on the DNS. Thus, any change
1837 to the characteristics of the DNS can change the security of much of
1838 the Internet.
1840 Domain names are used by users to identify and connect to Internet
1841 servers. The security of the Internet is compromised if a user
1842 entering a single internationalized name is connected to different
1843 servers based on different interpretations of the internationalized
1844 domain name.
1846 When systems use local character sets other than ASCII and Unicode,
1847 this specification leaves the problem of transcoding between the
1848 local character set and Unicode up to the application or local
1849 system. If different applications (or different versions of one
1850 application) implement different transcoding rules, they could
1851 interpret the same name differently and contact different servers.
1852 This problem is not solved by security protocols like TLS that do not
1853 take local character sets into account.
1855 To help prevent confusion between characters that are visually
1856 similar, it is suggested that implementations provide visual
1857 indications where a domain name contains multiple scripts. Such
1858 mechanisms can also be used to show when a name contains a mixture of
1859 simplified and traditional Chinese characters, or to distinguish zero
1860 and one from O and l. DNS zone administrators may impose
1861 restrictions (subject to the limitations identified elsewhere in this
1862 document) that try to minimize characters that have similar
1863 appearance or similar interpretations. It is worth noting that there
1864 are no comprehensive technical solutions to the problems of
1865 confusable characters. One can reduce the extent of the problems in
1866 various ways, but probably never eliminate it. Some specific
1867 suggestions about identification and handling of confusable
1868 characters appear in a Unicode Consortium publication
1869 [Unicode-UTR36].
1871 The registration and resolution models described above and in
1872 [IDNA2008-Protocol] change the mechanisms available for applications
1873 and resolvers to determine the validity of labels they encounter. In
1874 some respects, the ability to test is strengthened. For example,
1875 putative labels that contain unassigned code points will now be
1876 rejected, while IDNA2003 permitted them (something that is now
1877 recognized as a considerable source of risk). On the other hand, the
1878 protocol specification no longer assumes that the application that
1879 looks up a name will be able to determine, and apply, information
1880 about the protocol version used in registration. In theory, that may
1881 increase risk since the application will be able to do less pre-
1882 lookup validation. In practice, the protection afforded by that test
1883 has been largely illusory for reasons explained in RFC 4690 and
1884 above.
1886 Any change to Stringprep or, more broadly, the IETF's model of the
1887 use of internationalized character strings in different protocols,
1888 creates some risk of inadvertent changes to those protocols,
1889 invalidating deployed applications or databases, and so on. Our
1890 current hypothesis is that the same considerations that would require
1891 changing the IDN prefix (see Section 10.3.2) are the ones that would,
1892 e.g., invalidate certificates or hashes that depend on Stringprep,
1893 but those cases require careful consideration and evaluation. More
1894 important, it is not necessary to change Stringprep2003 at all in
1895 order to make the IDNA changes contemplated here. It is far
1896 preferable to create a separate document, or separate profile
1897 components, for IDN work, leaving the question of upgrading to other
1898 protocols to experts on them and eliminating any possible
1899 synchronization dependency between IDNA changes and possible upgrades
1900 to security protocols or conventions.
1902 No mechanism involving names or identifiers alone can protect a wide
1903 variety of security threats and attacks that are largely independent
1904 of them including spoofed pages, DNS query trapping and diversion,
1905 and so on.
1907 15. Change Log
1909 [[anchor40: RFC Editor: Please remove this section.]]
1911 For version 00 of draft-ietf-idnabis-rational, this list contains a
1912 complete trace going back through the earlier, design team, drafts.
1913 That earlier material will be removed in subsequent drafts.
1915 15.1. Version -01 of draft-klensin-idnabis-issues
1917 Version -01 of this document is a considerable rewrite from -00.
1918 Many sections have been clarified or extended and several new
1919 sections have been added to reflect discussions in a number of
1920 contexts since -00 was issued.
1922 15.2. Version -02 of draft-klensin-idnabis-issues
1924 o Corrected several editorial errors including an accidentally-
1925 introduced misstatement about NFKC.
1927 o Extensively revised the document to synchronize its terminology
1928 with version 03 of [IDNA2008-Tables] and to provide a better
1929 conceptual framework for its categories and how they are used.
1930 Added new material to clarify terminology and relationships with
1931 other efforts. More subtle changes in this version lay the
1932 groundwork for separating the document into a conceptual overview
1933 and a protocol specification for version 03.
1935 15.3. Version -03 of draft-klensin-idnabis-issues
1937 o Removed protocol materials to a separate document and incorporated
1938 rationale and explanation materials from the original
1939 specification in RFC 3960 into this document. Cleaned up earlier
1940 text to reflect a more mature specification and restructured
1941 several sections and added additional rationale material.
1943 o Strengthened and clarified the A-label / U-label/ LDH-label
1944 definition.
1946 o Retitled the document to reflect its evolving role.
1948 15.4. Version -04 of draft-klensin-idnabis-issues
1950 o Moved more text from "protocol" and further reorganized material.
1952 o Provided new material on "Contextual Rule Required.
1954 o Improved consistency of terminology, both internally and with the
1955 "tables" document.
1957 o Improved the IANA Considerations section and discussed the
1958 existing IDNA-related registry.
1960 o More small changes to increase consistency.
1962 15.5. Version -05 of draft-klensin-idnabis-issues
1964 Changed "YES" category back to "ALWAYS" to re-synch with the tables
1965 document and provide clearer terminology.
1967 15.6. Version -06 of draft-klensin-idnabis-issues
1969 o Clarified the prohibitions on strings that look like A-labels but
1970 are not and on unassigned code points.
1972 o Clarified length restrictions on IDN labels.
1974 o Revised the terminology definitions to remove the impression of
1975 circularity and removed invocations of ToASCII and ToUnicode,
1976 which do not exist in IDNA2008.
1978 o Added a new section on front-end processing.
1980 o Added a new section to discuss case-mapping.
1982 o Extended the discussion of prefix changes to identify the
1983 implications of making one.
1985 o Several more editorial improvements, corrected references, and
1986 similar adjustments.
1988 15.7. Version -07 of draft-klensin-idnabis-issues
1990 o Added material that specifically defines the format of contextual
1991 rules.
1993 o Added and altered text after discussions at the 30 January meeting
1994 (see Section 11) and the follow-up to those discussions. Among
1995 the key decisions at that meeting were to eliminate the
1996 distinction among the valid categories (formerly "ALWAYS", "MAYBE
1997 YES", and "MAYBE NO"), to adjust the terminology accordingly, and
1998 to change "CONTEXTUAL RULE REQUIRED" from a separate category in
1999 this document and the protocol one to a modifier of what is now
2000 called "PROTOCOL-VALID". The consequent changes resulted in
2001 removal of several sections of explanation from this document.
2003 o Resynchronized terminology with "protocol" and "tables" documents.
2005 o More editorial and typographic corrections.
2007 15.8. Version -00 of draft-ietf-idnabis-rationale
2009 o Rewrote the abstract and introduction, and retuned the title, to
2010 be more consistent with WG work and activities. Changed the file
2011 name to reflect WG naming.
2013 o Removed most of the material that explained, or compared this
2014 approach to, IDNA2003. Some of this material may appear in the
2015 non-WG "IDNA-alternatives" draft if it is ever completed.
2017 o Changed IDNA200X in terminology and references to IDNA2008.
2019 o Added a contextual rule for hyphen to the appendix, adjusted the
2020 rule syntax slightly, and supplied draft regular expression rules.
2022 o Responded to comments produced during the WG charter discussions
2023 and from several individuals. In general, comments requesting a
2024 reorganization of the collection of documents have not been
2025 responded to pending a WG decision on that topic.
2027 o Moved the contextual rule appendix out of here and into
2028 "Protocol". It may not belong there either, but definitely does
2029 not belong here, and was holding up getting this document out.
2031 o Many small editorial improvements, including reorganization of
2032 some material.
2034 Editorial note: While several sections have been removed from this
2035 version, the WG should discuss whether further cuts are desirable,
2036 e.g., whether Section 7.3, Section 7.4, or Section 10.3 provide
2037 enough value to be worth retaining? Can Section 10.4 be trimmed
2038 without loss of useful information and, if so, how? Section 10.7
2039 appears critical of IDNA2003 in undesirable ways: should it be
2040 dropped or do people have suggestions about how to improve it?
2041 Strong opinions have been expressed that Section 10.5 should be
2042 trimmed significantly or removed entirely. The WG will need to
2043 discuss that too. Are there other materials that should be trimmed
2044 out?
2046 16. References
2048 16.1. Normative References
2050 [ASCII] American National Standards Institute (formerly United
2051 States of America Standards Institute), "USA Code for
2052 Information Interchange", ANSI X3.4-1968, 1968.
2054 ANSI X3.4-1968 has been replaced by newer versions with
2055 slight modifications, but the 1968 version remains
2056 definitive for the Internet.
2058 [IDNA2008-Bidi]
2059 Alvestrand, H. and C. Karp, "An updated IDNA criterion for
2060 right to left scripts", February 2008, .
2064 New version of this document pending as
2065 draft-ietf-idnabis-bidi-00.
2067 [IDNA2008-Protocol]
2068 Klensin, J., "Internationalizing Domain Names in
2069 Applications (IDNA): Protocol", May 2008, .
2073 [IDNA2008-Tables]
2074 Faltstrom, P., "The Unicode Code Points and IDNA",
2075 April 2008, .
2078 A version of this document is available in HTML format at
2079 http://stupid.domain.name/idnabis/
2080 draft-ietf-idnabis-tables-00.html
2082 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
2083 Requirement Levels", BCP 14, RFC 2119, March 1997.
2085 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
2086 Internationalized Strings ("stringprep")", RFC 3454,
2087 December 2002.
2089 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
2090 "Internationalizing Domain Names in Applications (IDNA)",
2091 RFC 3490, March 2003.
2093 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
2094 Profile for Internationalized Domain Names (IDN)",
2095 RFC 3491, March 2003.
2097 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
2098 for Internationalized Domain Names in Applications
2099 (IDNA)", RFC 3492, March 2003.
2101 [RulesInit]
2102 Klensin, J., "Internationalizing Domain Names in
2103 Applications (IDNA): Protocol, Appendix A Contextual Rules
2104 Table", May 2008, .
2107 Forthconming.
2109 [Unicode-PropertyValueAliases]
2110 The Unicode Consortium, "Unicode Character Database:
2111 PropertyValueAliases", March 2008, .
2114 [Unicode-RegEx]
2115 The Unicode Consortium, "Unicode Technical Standard #18:
2116 Unicode Regular Expressions", May 2005,
2117 .
2119 [Unicode-Scripts]
2120 The Unicode Consortium, "Unicode Standard Annex #24:
2121 Unicode Script Property", February 2008,
2122 .
2124 [Unicode51]
2125 The Unicode Consortium, "The Unicode Standard, Version
2126 5.1.0", 2008.
2128 defined by: The Unicode Standard, Version 5.0, Boston, MA,
2129 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
2130 Unicode 5.1.0
2131 (http://www.unicode.org/versions/Unicode5.1.0/).
2133 16.2. Informative References
2135 [BIG5] Institute for Information Industry of Taiwan, "Computer
2136 Chinese Glyph and Character Code Mapping Table, Technical
2137 Report C-26", 1984.
2139 There are several forms and variations and a closely-
2140 related standard, CNS 11643. See the discussion in
2141 Chapter 3 of Lunde, K., CJKV Information Processing,
2142 O'Reilly & Associates, 1999
2144 [GB18030] "Chinese National Standard GB 18030-2000: Information
2145 Technology -- Chinese ideograms coded character set for
2146 information interchange -- Extension for the basic set.",
2147 2000.
2149 [RFC0810] Feinler, E., Harrenstien, K., Su, Z., and V. White, "DoD
2150 Internet host table specification", RFC 810, March 1982.
2152 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
2153 STD 13, RFC 1034, November 1987.
2155 [RFC1035] Mockapetris, P., "Domain names - implementation and
2156 specification", STD 13, RFC 1035, November 1987.
2158 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application
2159 and Support", STD 3, RFC 1123, October 1989.
2161 [RFC2782] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for
2162 specifying the location of services (DNS SRV)", RFC 2782,
2163 February 2000.
2165 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
2166 Engineering Team (JET) Guidelines for Internationalized
2167 Domain Names (IDN) Registration and Administration for
2168 Chinese, Japanese, and Korean", RFC 3743, April 2004.
2170 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
2171 Identifiers (IRIs)", RFC 3987, January 2005.
2173 [RFC4290] Klensin, J., "Suggested Practices for Registration of
2174 Internationalized Domain Names (IDN)", RFC 4290,
2175 December 2005.
2177 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
2178 Recommendations for Internationalized Domain Names
2179 (IDNs)", RFC 4690, September 2006.
2181 [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin,
2182 "Registration and Administration Recommendations for
2183 Chinese Domain Names", RFC 4713, October 2006.
2185 [Unicode-UTR36]
2186 The Unicode Consortium, "Unicode Technical Report #36:
2187 Unicode Security Considerations", August 2006,
2188 .
2190 Author's Address
2192 John C Klensin
2193 1770 Massachusetts Ave, Ste 322
2194 Cambridge, MA 02140
2195 USA
2197 Phone: +1 617 245 1457
2198 Email: john+ietf@jck.com
2200 Full Copyright Statement
2202 Copyright (C) The IETF Trust (2008).
2204 This document is subject to the rights, licenses and restrictions
2205 contained in BCP 78, and except as set forth therein, the authors
2206 retain all their rights.
2208 This document and the information contained herein are provided on an
2209 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2210 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
2211 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
2212 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
2213 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2214 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
2216 Intellectual Property
2218 The IETF takes no position regarding the validity or scope of any
2219 Intellectual Property Rights or other rights that might be claimed to
2220 pertain to the implementation or use of the technology described in
2221 this document or the extent to which any license under such rights
2222 might or might not be available; nor does it represent that it has
2223 made any independent effort to identify any such rights. Information
2224 on the procedures with respect to rights in RFC documents can be
2225 found in BCP 78 and BCP 79.
2227 Copies of IPR disclosures made to the IETF Secretariat and any
2228 assurances of licenses to be made available, or the result of an
2229 attempt made to obtain a general license or permission for the use of
2230 such proprietary rights by implementers or users of this
2231 specification can be obtained from the IETF on-line IPR repository at
2232 http://www.ietf.org/ipr.
2234 The IETF invites any interested party to bring to its attention any
2235 copyrights, patents or patent applications, or other proprietary
2236 rights that may cover technology that may be required to implement
2237 this standard. Please address the information to the IETF at
2238 ietf-ipr@ietf.org.