idnits 2.17.00 (12 Aug 2021) /tmp/idnits6013/draft-klensin-idna-5892upd-unicode70-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 523: '...ated to True for the label, it MUST be...' -- The draft header indicates that this document updates RFC5892, but the abstract doesn't seem to mention this, which it should. -- The abstract seems to indicate that this document updates RFC5982, but the header doesn't have an 'Updates:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5892, updated by this document, for RFC5378 checks: 2008-04-26) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 7, 2014) is 2721 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also mentioned in 'RFC5892'. ** Downref: Normative reference to an Informational RFC: RFC 5894 ** Downref: Normative reference to an Informational RFC: RFC 6943 -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Versioning' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Arabic' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Hamza' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7' -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 3 errors (**), 0 flaws (~~), 1 warning (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft 4 Updates: 5892, 5894 (if approved) P. Faltstrom 5 Intended status: Standards Track Netnod 6 Expires: June 10, 2015 December 7, 2014 8 IDNA Update for Unicode 7.0.0 9 draft-klensin-idna-5892upd-unicode70-02.txt 11 Abstract 13 The current version of the IDNA specifications anticipated that each 14 new version of Unicode would be reviewed to verify that no changes 15 had been introduced that required adjustments to the set of rules 16 and, in particular, whether new exceptions or backward compatibility 17 adjustments were needed. That review was conducted for Unicode 7.0.0 18 and identified a potentially problematic new code point. This 19 specification discusses that code point and associated issues and 20 updates RFC 5982 accordingly. It also applies an editorial 21 clarification that was the subject of an earlier erratum. In 22 addition, the discussion of the specific issue updates RFC 5894. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on June 10, 2015. 41 Copyright Notice 43 Copyright (c) 2014 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Problem Description . . . . . . . . . . . . . . . . . . . . . 5 60 2.1. IDNA assumptions about Unicode normalization . . . . . . 5 61 2.2. New code point U+08A1, decomposition, and language 62 dependency . . . . . . . . . . . . . . . . . . . . . . . 6 63 2.3. Other examples of the same behavior . . . . . . . . . . . 7 64 2.4. Hamza and Combining Sequences . . . . . . . . . . . . . . 8 65 3. Proposed/ Alternative Changes to RFC 5892 for new character 66 U+08A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 67 3.1. Disallow This New Code Point . . . . . . . . . . . . . . 9 68 3.2. Disallow the combining sequences for these characters . . 10 69 3.3. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 11 70 3.4. Normalization Form IETF (or DNS) . . . . . . . . . . . . 11 71 4. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 11 72 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 73 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 74 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 75 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 76 8.1. Normative References . . . . . . . . . . . . . . . . . . 13 77 8.2. Informative References . . . . . . . . . . . . . . . . . 15 78 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 15 79 A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 15 80 A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 15 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 83 1. Introduction 85 The current version of the IDNA specifications, known as "IDNA2008" 86 [RFC5890], anticipated that each new version of Unicode would be 87 reviewed to verify that no changes had been introduced that required 88 adjustments to IDNA's rules and, in particular, whether new 89 exceptions or backward compatibility adjustments were needed. When 90 that review was carefully conducted for Unicode 7.0.0 [Unicode7], 91 comparing it to prior versions including the text in Unicode 6.2 92 [Unicode62], it identified a problematic new code point (U+08A1, 93 ARABIC LETTER BEH WITH HAMZA ABOVE). The specific problem is 94 discussed in detail in Section 2. The behavior of that code point, 95 while non-optimal for IDNA, follows that of a few code points that 96 predate Unicode 7.x and even the IDNA 2008 specifications and Unicode 97 6.0. Those existing code points make the question of what, if 98 anything, to do about this new one exceedingly problematic because 99 different reasonable criteria yield different decisions, 100 specifically: 102 o To disallow it as an IDNA exception case creates inconsistencies 103 with how those earlier code points were handled. 105 o To disallow it and the similar code points as well would 106 necessitate invalidating some potential labels that would have 107 been valid under IDNA2008 until this time. However, there is 108 reason to believe that no such labels exist. 110 o To permit the new code point to be treated as PVALID creates a 111 situation in which it is possible, within the same script, to 112 compose the same character symbol (glyph) in two different ways 113 that do not compare equal even after normalization. That 114 condition would then apply to it and the earlier code points with 115 the same behavior. That situation contradicts a fundamental 116 assumption of IDNA that is discussed in more detail below. 118 NOTE IN DRAFT: 120 This working draft discusses four alternatives, including, for 121 illustration, a radical idea that seems too drastic to be 122 considered now although it would have been appropriate to discuss 123 when the IDNA2008 specifications were being developed. The 124 authors suggest that the community discuss the relevant tradeoffs 125 and make a decision and that the document then be revised to 126 reflect that decision, with the other alternatives discussed as 127 options not chosen. Because there is no ideal choice, the 128 discussion of the issues in Section 2, is probably as or more 129 important than the particular choice of how to handle this code 130 point. In addition to providing information for this document, 131 that section should be considered as an updating addendum to RFC 132 5894 [RFC5894] and should be incorporated into any future revision 133 of that document. 135 As the result of this version of the document containing several 136 alternate proposals, some of the text is also a little bit 137 redundant. That will be corrected in future versions. 139 As anticipated when IDNA2008, and RFC 5892 in particular, were 140 written, exceptions and explicit updates are likely to be needed only 141 if there is disagreement between the Unicode Consortium's view about 142 what is best for the Standard and the IETF's view of what is best for 143 IDNs, the DNS, and IDNA. It was hoped that a situation would never 144 arise in which the the two perspectives would disagree, but the 145 possibility was anticipated and considerable mechanism added to RFC 146 5890 and 5982 as a result. It is probably important to note that a 147 disagreement in this context does not imply that anyone is "wrong", 148 only that the two different groups have different needs and therefore 149 criteria about what is acceptable. For that reason, the IETF has, in 150 the past, allowed some characters for IDNA that active Unicode 151 Technical Committee members suggested be disallowed to avoid a change 152 in derived tables [RFC6452]. This document describes a case where 153 the IETF should disallow a character or characters that the various 154 properties would otherwise treat as PVALID. 156 This document provides the "flagging for the IESG" specified by 157 Section 5.1 of RFC 5892. As specified there, the change itself 158 requires IETF review because it alters the rules of Section 2 of that 159 document. 161 Readers of this document are expected to be familiar with Unicode 162 terminology [Unicode62] and the IETF conventions for representing 163 Unicode code points [RFC5137]. 165 As a convenience to readers of RFC 5892 and to reduce the risks of 166 confusion, this document also formally applies the content of an 167 erratum to the text of the RFC (see Section 4) and so brings that RFC 168 up to date with all agreed changes. 170 [[RFC Editor: please remove the following comment and note if they 171 get to you.]] 173 [[IESG: It might not be a bad idea to incorporate some version of 174 the following into the Last Call announcement.]] 176 NOTE IN DRAFT to IETF Reviewers: The issues in this document, and 177 particularly the choices among options for either adding exception 178 cases to RFC 5892 or ignoring the issue, warning people, and 179 hoping the results do not include serious problems, are fairly 180 esoteric. Understanding them requires that one have at least some 181 understanding of how the Arabic Script works and the reasons the 182 Unicode Standard gives various Arabic Script characters a fairly 183 extended discussion [Unicode62-Arabic]. It also requires 184 understanding of a number of Unicode principles, including the 185 Normalization Stability rules [UAX15-Versioning] as applied to new 186 precomposed characters and guidelines for adding new characters. 187 There is considerable discussion of the issues in Section 2 and 188 references are provided for those who want to pursue them, but 189 potential reviewers should assume that the background needed to 190 understand the reasons for this change is no less deep in the 191 subject matter than would be expected of someone reviewing a 192 proposed change in, e.g., the fundamentals of BGP, TCP congestion 193 control, or some cryptographic algorithm. Put more bluntly, one's 194 ability to read or speak languages other than English, or even one 195 or more languages that use the Arabic script, does not make one an 196 expert in these matters. 198 2. Problem Description 200 2.1. IDNA assumptions about Unicode normalization 202 IDNA makes several assumptions about Unicode, Unicode "characters", 203 and the effects of normalization. Those assumptions were based on 204 careful reading of the Unicode Standard at the time [Unicode5], 205 guided by advice and commitments by members of the Unicode Technical 206 Committee. Those assumptions, and the associated requirements, are 207 necessitated by three properties of DNS labels that do not apply to 208 blocks of running text: 210 1. There is no language context for a label. While particular DNS 211 zones may impose restrictions, including language or script 212 restrictions, on what labels can be registered, neither the DNS 213 nor IDNA impose either type of restriction or give the user of a 214 label any indication about the registration or other restrictions 215 that may have been imposed. 217 2. Labels are often mnemonics rather than words in any language. 218 They may be abbreviations or acronyms or contain embedded digits 219 and have other characteristics that are not typical of words. 221 3. Labels are, in practice, usually short. Even when they are the 222 maximum length allowed by the DNS and IDNA, they are typically 223 too short to provide significant context. Statements that 224 suggest that languages can almost always be determined from 225 relatively short paragraphs or equivalent bodies of text do not 226 apply to DNS labels because of their typical short length and 227 because, as noted above, they are not required to be formed 228 according to language-based rules. 230 At the same time, because the DNS is an exact-match system, there 231 must be no ambiguity about whether two labels are equal. Although 232 there have been extensive discussions about "confusingly similar" 233 characters, labels, and strings, such tests between scripts are 234 always somewhat subjective: they are affected by choices of type 235 styles and by what the user expects to see. In spite of the fact 236 that the glyphs that represent many characters in different scripts 237 are identical in appearance (e.g., basic Latin "a" (U+0061) and the 238 identical-appearing Cyrillic character (U+0430), the most important 239 test is that, if two glyphs are the same within a given script, they 240 must represent the same character no matter how they are formed. 242 Unicode normalization, as explained in [UAX15], is expected to 243 resolve those "same script, same glyph, different formation methods" 244 issues. Within the Latin script, the code point sequence for lower 245 case "o" (U+006F) and combining diaeresis (U+0308) will, when 246 normalized using the "NFC" method required by IDNA, produce the 247 precombined small letter o with diaeresis (U+00F6) and hence the two 248 ways of forming the character will compare equal (and the combining 249 sequence is effectively prohibited from U-labels). 251 NFC was preferred over other normalization methods for IDNA because 252 it is more compact, more likely to be produced on keyboards on which 253 the relevant characters actually appeared, and because it does not 254 lose substantive information (e.g., some types of compatibility 255 equivalence involves judgment calls as to whether two characters are 256 actually the same -- they may be "the same" in some contexts but not 257 others -- while canonical equivalence is about different ways to 258 produce the glyph for the same abstract character). 260 IDNA also assumed that the extensive Unicode stability rules would be 261 applied and work as specified when new code points were added. Those 262 rules, as described in The Unicode Standard and the normative annexes 263 identified below, provide that: 265 1. New code points representing precombined characters that can be 266 formed from combining sequences will not be added to Unicode 267 unless neither the relevant base character nor required combining 268 character are part of the Standard within the relevant script 269 [UAX15-Versioning]. 271 2. If circumstances require that principle be violated, 272 normalization stability requires that the newly-added character 273 decompose (even under NFC) to the previously-available combining 274 sequence [UAX15-Exclusion]. 276 There is no explicit provision in the Standard's discussion of 277 conditions for adding new code points, nor of normalization 278 stability, for an exception based on different languages using the 279 same script. 281 2.2. New code point U+08A1, decomposition, and language dependency 283 Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH 284 WITH HAMZA ABOVE. As can be deduced from the name, it is visually 285 identical to the glyph that can be formed from a combining sequence 286 consisting of the code point for ARABIC LETTER BEH (U+0628) and the 287 code point for Combining Hamza Above (U+0654). The two rules 288 summarized above suggest that either the new code point should not be 289 allocated at all or that it should have a decomposition to 290 \u'0628'\u'0654'. 292 Had the issues outlined in this document been better understood at 293 the time, it probably would have been wise for RFC 5892 to disallow 294 either the precomposed character or the combining sequence of each 295 pair in those cases in which Unicode normalization rules do not cause 296 the right thing to happen, i.e., the combining sequence and 297 precomposed character to be treated as equivalent. Failure to do so 298 at the time places an extra burden on registries to be sure that 299 conflicts (and the potential for confusion and attacks) do not exist. 300 Oddly, had the exclusion been made part of the specification at that 301 time, the preference for precombined forms noted above would probably 302 have dictated excluding the combining sequence, something not 303 otherwise done in IDNA2008 because the NFC requirement serves the 304 same purpose. Today, the only thing that can be excluded without the 305 potential disruption of disallowing a previously-PVALID combining 306 sequence is the to exclude the newly-added code point so whatever is 307 done, or might have been contemplated with hindsight, will be 308 somewhat inconsistent. 310 2.3. Other examples of the same behavior 312 One of the things that complicates the issue with the new U+08A1 code 313 point is that there are several other Arabic-script code points that 314 behave in the same way for similar language-specific reasons. 316 In particular, at least three other grapheme clusters that have been 317 present for many version of Unicode can be seen as involving issues 318 similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA 319 ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER 320 REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are 321 preferred over combining sequences using HAMZA ABOVE (U+0654) 322 [Unicode62-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE 323 (U+0623) decomposes into \u'0627'\u'0654' and ARABIC LETTER YEH WITH 324 HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' so the 325 precomposed character and combining sequences compare equal when both 326 are normalized, as this specification prefers. 328 There are other variations in which a precomposed character involving 329 HAMZA ABOVE has a decomposition to a combining sequence that can form 330 it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a 331 compatibility (???) decomposition into the combining sequence 332 \u'06C7'\u'0674'. 334 2.4. Hamza and Combining Sequences 336 As the Unicode Standard points out at some length [Unicode62-Arabic], 337 Hamza is a problematic abstract character and the "Hamza Above" 338 construction even more so [Unicode62-Hamza]. Those sections explain 339 a distinction made by Unicode between the use of a Hamza mark to 340 denote a glottal stop and one used as a diacritic mark to denote a 341 separate letter. In the first case, the combining sequence is used. 342 In the second, a precombined character is assigned. 344 Unlike Unicode generally and because of concerns about identifier 345 spoofing and attacks based on similarities, character distinctions in 346 IDNA are based much more strictly on the appearance of characters; 347 language and pronunciation distinctions within a script are not 348 considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- 349 tautologically the same as BEH WITH HAMZA ABOVE, even if one of them 350 is written as U+08A1 (new to Unicode 7.0.0) and the other as the 351 sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also 352 available in versions of Unicode going back at least to the version 353 [Unicode32] used in the original version of IDNA [RFC3490]. Because 354 the precomposed form and combining sequence are, for IDNA purposes, 355 the same, IDNA expects that normalization (specifically the 356 requirement that all U-labels be in NFC form) will cause them to 357 compare equal. 359 If Unicode also considered them the same, then the principle would 360 apply that new precomposed ("composition") forms are not added unless 361 one of the code points that could be used to construct it did not 362 exist in an earlier version (and even then is 363 discouraged)[UAX15-Versioning]. When exceptions are made, they are 364 expected to conform to the rules and classes in the "Composition 365 Exclusion Table", with class 2 being relevant to this case 366 [UAX15-Exclusion]. That rule essentially requires that the 367 normalization for the old combining sequence to itself be retained 368 (for stability) but that the newly-added character be treated as 369 canonically decomposable and decompose back to the older sequence 370 even under NFC. That was not done for this particular case, 371 presumably because of the distinction about pronunciation modifiers 372 versus separate letters noted above. Because, for IDNA and the DNS, 373 there is a possibility that the composing sequence \u'0628'\u'0654' 374 already appears in labels, the only choice other than allowing an 375 otherwise-identical, and identically-appearing, label with U+08A1 376 substituted to identify a different DNS entry is to DISALLOW the new 377 character. 379 3. Proposed/ Alternative Changes to RFC 5892 for new character U+08A1 381 NOTE IN DRAFT: See the comments in the Introduction, Section 1 and 382 the first paragraph of each Subsection below for the status of the 383 Subsections that follow. Each one, in combination with the material 384 in Section 2 above, also provides information about the reasons why 385 that particular strategy is appropriate. 387 3.1. Disallow This New Code Point 389 If chosen by the community, this subsection would update the portion 390 of the IDNA2008 specification that identifies rules for what 391 characters are permitted [RFC5892] to disallow that code point. 393 With the publication of this document, Section 2.6 ("Exceptions (F)") 394 of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in 395 Category F so that the rule itself reads: 397 F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, 398 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, 399 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, 400 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B, 401 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, 402 303B, 30FB} 404 and then add to the subtable designated 405 "DISALLOWED -- Would otherwise have been PVALID" 406 after the line that begins "07FA", the additional line: 408 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE 410 This has the effect of making the cited code point DISALLOWED 411 independent of application of the rest of the IDNA rule set to the 412 current version of Unicode. Those wishing to create domain name 413 labels containing Beh with Hamza Above may continue to use the 414 sequence 416 U+0628, ARABIC LETTER BEH 417 followed by 419 U+0654, ARABIC HAMZA ABOVE 421 which was valid for IDNA purposes in Unicode 5.0 and earlier and 422 which continues to be valid. 424 In principle, much the same thing could be accomplished by using the 425 IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892 426 Section 5.3). However, that category is described as applying only 427 when "property values in versions of Unicode after 5.2 have changed 428 in such a way that the derived property value would no longer be 429 PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in 430 Unicode 7.0.0 and no property values of code points in prior versions 431 have changed, category G does not apply. If that section of RFC 5892 432 were to be replaced in the future, perhaps consideration should be 433 given to adding Normalization Stability and other issues to that 434 description but, at present, it is not relevant. 436 3.2. Disallow the combining sequences for these characters 438 If chosen by the community, this subsection would update the portion 439 of the IDNA2008 specification that identifies contextual rules 440 [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction 441 with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that 442 the choice of this option is consistent with the general preference 443 for precomposed characters discussed above but would ban some labels 444 that are valid today and that might, in principle, be in use. 446 The required prohibition could be imposed by creating a new 447 contextual rule in RFC 5892 to constrain combining sequences 448 containing Hamza Above. 450 As the Unicode Standard points out at some length [Unicode62-Arabic], 451 Hamza is a problematic abstract character and the "Hamza Above" 452 construction even more so. IDNA has historically associated 453 characters whose use is reasonable in some contexts but not others 454 with the special derived property "CONTEXTO" and then specified 455 specific, context-dependent, rules about where they may be used. 456 Because Hamza Above is problematic (and spawns edge cases, as 457 discussed in the Unicode Standard section cited above), it was 458 suggested that a contextual rule might be appropriate. There are at 459 least two reasons why a contextual rule would not be suitable for the 460 present situation. 462 1. As discussed above, the present situation is a normalization 463 stability and predictability problem, not a contextual one. Had 464 the same issues arisen with a newly-added precomposed character 465 that could previously be constructed from non-problematic base 466 and combining characters, it would be even more clearly a 467 normalization issue and, following the principles discussed there 468 and particularly in UAX 15 [UAX15-Exclusion], might not have been 469 assigned at all. 471 2. The contextual rule sets are designed around restricting the use 472 of code points to a particular script or adjacent to particular 473 characters within that script. Neither of these cases applies to 474 the newly-added character even if one could imagine rules for the 475 use of Hamza Above (U+0654) that would reflect the considerations 476 of Chapter 8 of Unicode 6.2. Even had the latter been desired, 477 it would be somewhat late now -- Hamza Above has been present as 478 a combining character (U+0654) in many versions of Unicode. 479 While that section of the Unicode Standard describes the issues, 480 it does not provide actionable guidance about what to do about it 481 for cases going forward or when visual identity is important. 483 3.3. Do Nothing Other Than Warn 485 The recommendation from UTC is to simply warn registries, at all 486 levels of the tree, to be careful with this set of characters, making 487 language distinctions within zones. Because the DNS cannot make or 488 enforce language distinctions, this suggestion is problematic but it 489 would avoid having the IETF either invalidating label strings that 490 are potentially now in use or creating inconsistencies among the 491 characters that combine with Hamza Above but that also have 492 precomposed forms that do not have decompositions. The potential 493 would still exist for registries to respect the warning and deprecate 494 such labels if they existed. 496 3.4. Normalization Form IETF (or DNS) 498 The most radical possibility would be to decide that none of the 499 Unicode Normalization Forms specified in UAX 15 [UAX15] are adequate 500 for use with the DNS because, contrary to their apparent 501 descriptions, normalization tables are actually determined using 502 language information. However, use of language information is 503 unacceptable for IDNA for reasons described elsewhere in this 504 document. The remedy would be to define an IETF-specific (or DNS- 505 specific) normalization form, building on NFC but adhering strictly 506 to the rule that normalization causes two different forms of the same 507 character (glyph image) within the same script to be treated as 508 equal. In practice such a form would be implemented for IDNA 509 purposes as an additional rule within RFC 5892 (and its successors) 510 that constituted an exception list for the NFC tables. For this set 511 of characters, the special IETF normalization form would be 512 equivalent to the exclusion discussed in Section 3.2 above. 514 4. Editorial clarification to RFC 5892 516 Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a 517 clarification to Appendix A and Section A.1 of RFC 5892. This 518 section of this document updates the RFC to apply that clarification. 520 1. In Appendix A, add a new paragraph after the paragraph that 521 begins "The code point...". The new paragraph should read: 523 "For the rule to be evaluated to True for the label, it MUST be 524 evaluated separately for every occurrence of the Code point in 525 the label; each of those evaluations must result in True." 527 2. In Appendix A, Section A.1, replace the "Rule Set" by 529 Rule Set: 530 False; 531 If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; 532 If cp .eq. \u200C And 533 RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp 534 (Joining_Type:T)*(Joining_Type:{R,D})) Then True; 536 5. Acknowledgements 538 The Unicode 7.0.0 changes were extensively discussed within the IAB's 539 Internationalization Program. The authors are grateful for the 540 discussions and feedback there, especially from Andrew Sullivan and 541 David Thaler. Additional information was requested and received from 542 Mark Davis and Ken Whistler and while they probably do not agree with 543 the necessity of excluding this code point or taking even more 544 drastic action as their responsibility is to look at the Unicode 545 Consortium requirements for stability, the decision would not have 546 been possible without their input. Thanks to Bill McQuillan for 547 reading the document carefully enough to identify and report a 548 confusing typographical error. Several experts and reviewers who 549 prefer to remain anonymous also provided helpful input and comments 550 on preliminary versions of this document. 552 6. IANA Considerations 554 When the IANA registry and tables are updated to reflect Unicode 555 7.0.0, changes should be made according to the decisions the IETF 556 makes about Section 3. 558 7. Security Considerations 560 [[CREF1: NOTE IN DRAFT: This section is unchanged in version -01 of 561 this document relative to what appeared in -00. It will need to be 562 rewritten once decisions are made about what path to follow. In 563 particular, if "just warn" is chosen, it will need to contain very 564 strong warnings.]] 566 This specification excludes a code point for which the Unicode- 567 specified normalization behavior could result in two ways to form a 568 visually-identical character within the same script not comparing 569 equal. That behavior could create a dream case for someone intending 570 to confuse the user by use of a domain name that looked identical to 571 another one, was entirely in the same script, but was still 572 considered different (see, for example, the discussion of false 573 negatives in identifier comparison in Section 2.1 of RFC 6943 574 [RFC6943]). This exclusion therefore should improve Internet 575 security. 577 8. References 579 8.1. Normative References 581 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP 582 137, RFC 5137, February 2008. 584 [RFC5890] Klensin, J., "Internationalized Domain Names for 585 Applications (IDNA): Definitions and Document Framework", 586 RFC 5890, August 2010. 588 [RFC5892] Faltstrom, P., "The Unicode Code Points and 589 Internationalized Domain Names for Applications (IDNA)", 590 RFC 5892, August 2010. 592 [RFC5892Erratum] 593 "RFC5892, "The Unicode Code Points and Internationalized 594 Domain Names for Applications (IDNA)", August 2010, Errata 595 ID: 3312", Errata ID 3312, August 2012, 596 . 598 [RFC5894] Klensin, J., "Internationalized Domain Names for 599 Applications (IDNA): Background, Explanation, and 600 Rationale", RFC 5894, August 2010. 602 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security 603 Purposes", RFC 6943, May 2013. 605 [UAX15] Davis, M., Ed., "Unicode Standard Annex #15: Unicode 606 Normalization Forms", June 2014, 607 . 609 [UAX15-Exclusion] 610 "Unicode Standard Annex #15: ob. cit., Section 5", 611 . 614 [UAX15-Versioning] 615 "Unicode Standard Annex #15, ob. cit., Section 3", 616 . 618 [Unicode5] 619 The Unicode Consortium, "The Unicode Standard, Version 620 5.0", ISBN 0-321-48091-0, 2007. 622 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. 623 This printed reference has now been updated online to 624 reflect additional code points. For code points, the 625 reference at the time RFC 5890-5894 were published is to 626 Unicode 5.2. 628 [Unicode62] 629 The Unicode Consortium, "The Unicode Standard, Version 630 6.2.0", ISBN 978-1-936213-07-8, 2012, 631 . 633 Preferred citation: The Unicode Consortium. The Unicode 634 Standard, Version 6.2.0, (Mountain View, CA: The Unicode 635 Consortium, 2012. ISBN 978-1-936213-07-8) 637 [Unicode62-Arabic] 638 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 639 Chapter 8, 2012, 640 . 642 Subsection titled "Encoding Principles", paragraph 643 numbered 4, starting on page 251. 645 [Unicode62-Hamza] 646 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 647 Chapter 8, 2012, 648 . 650 Subsection titled "Combining Hamza Above" starting on page 651 263. 653 [Unicode7] 654 The Unicode Consortium, "The Unicode Standard, Version 655 7.0.0", ISBN 978-1-936213-09-2, 2014, 656 . 658 Preferred Citation: The Unicode Consortium. The Unicode 659 Standard, Version 7.0.0, (Mountain View, CA: The Unicode 660 Consortium, 2014. ISBN 978-1-936213-09-2) 662 8.2. Informative References 664 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 665 "Internationalizing Domain Names in Applications (IDNA)", 666 RFC 3490, March 2003. 668 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and 669 Internationalized Domain Names for Applications (IDNA) - 670 Unicode 6.0", RFC 6452, November 2011. 672 [Unicode32] 673 The Unicode Consortium, "The Unicode Standard, Version 674 3.2.0", . 676 The Unicode Standard, Version 3.2.0 is defined by The 677 Unicode Standard, Version 3.0 (Reading, MA, Addison- 678 Wesley, 2000. ISBN 0-201-61633-5), as amended by the 679 Unicode Standard Annex #27: Unicode 3.1 680 (http://www.unicode.org/reports/tr27/) and by the Unicode 681 Standard Annex #28: Unicode 3.2 682 (http://www.unicode.org/reports/tr28/). 684 Appendix A. Change Log 686 RFC Editor: Please remove this appendix before publication. 688 A.1. Changes from version -00 to -01 690 o Version 01 of this document is an extensive rewrite and 691 reorganization, reflecting discussions with UTC members and adding 692 three more options for discussion to the original proposal to 693 simply disallow the new code point. 695 A.2. Changes from version -01 to -02 697 Corrected a typographical error in which Hamza Above was incorrectly 698 listed with the wrong code point. 700 Authors' Addresses 702 John C Klensin 703 1770 Massachusetts Ave, Ste 322 704 Cambridge, MA 02140 705 USA 707 Phone: +1 617 245 1457 708 Email: john-ietf@jck.com 709 Patrik Faltstrom 710 Netnod 711 Franzengatan 5 712 Stockholm 112 51 713 Sweden 715 Phone: +46 70 6059051 716 Email: paf@netnod.se