idnits 2.17.00 (12 Aug 2021) /tmp/idnits3804/draft-klensin-idna-5892upd-unicode70-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 524: '...ated to True for the label, it MUST be...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5892, updated by this document, for RFC5378 checks: 2008-04-26) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 6, 2015) is 2691 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also mentioned in 'RFC5892'. ** Downref: Normative reference to an Informational RFC: RFC 5894 ** Downref: Normative reference to an Informational RFC: RFC 6943 -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Versioning' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Arabic' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Hamza' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7' -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 3 errors (**), 0 flaws (~~), 1 warning (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft 4 Updates: 5892, 5894 (if approved) P. Faltstrom 5 Intended status: Standards Track Netnod 6 Expires: July 10, 2015 January 6, 2015 8 IDNA Update for Unicode 7.0.0 9 draft-klensin-idna-5892upd-unicode70-03.txt 11 Abstract 13 The current version of the IDNA specifications anticipated that each 14 new version of Unicode would be reviewed to verify that no changes 15 had been introduced that required adjustments to the set of rules 16 and, in particular, whether new exceptions or backward compatibility 17 adjustments were needed. That review was conducted for Unicode 7.0.0 18 and identified a potentially problematic new code point. This 19 specification discusses that code point and associated issues and 20 updates RFC 5892 accordingly. It also applies an editorial 21 clarification that was the subject of an earlier erratum. In 22 addition, the discussion of the specific issue updates RFC 5894. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on July 10, 2015. 41 Copyright Notice 43 Copyright (c) 2015 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Problem Description . . . . . . . . . . . . . . . . . . . . . 5 60 2.1. IDNA assumptions about Unicode normalization . . . . . . 5 61 2.2. New code point U+08A1, decomposition, and language 62 dependency . . . . . . . . . . . . . . . . . . . . . . . 6 63 2.3. Other examples of the same behavior . . . . . . . . . . . 7 64 2.4. Hamza and Combining Sequences . . . . . . . . . . . . . . 8 65 3. Proposed/ Alternative Changes to RFC 5892 for new character 66 U+08A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 67 3.1. Disallow This New Code Point . . . . . . . . . . . . . . 9 68 3.2. Disallow the combining sequences for these characters . . 10 69 3.3. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 11 70 3.4. Normalization Form IETF (or DNS) . . . . . . . . . . . . 11 71 4. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 11 72 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 73 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 74 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 75 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 76 8.1. Normative References . . . . . . . . . . . . . . . . . . 13 77 8.2. Informative References . . . . . . . . . . . . . . . . . 15 78 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 15 79 A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 15 80 A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 15 81 A.3. Changes from version -02 to -03 . . . . . . . . . . . . . 15 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 84 1. Introduction 86 The current version of the IDNA specifications, known as "IDNA2008" 87 [RFC5890], anticipated that each new version of Unicode would be 88 reviewed to verify that no changes had been introduced that required 89 adjustments to IDNA's rules and, in particular, whether new 90 exceptions or backward compatibility adjustments were needed. When 91 that review was carefully conducted for Unicode 7.0.0 [Unicode7], 92 comparing it to prior versions including the text in Unicode 6.2 93 [Unicode62], it identified a problematic new code point (U+08A1, 94 ARABIC LETTER BEH WITH HAMZA ABOVE). The specific problem is 95 discussed in detail in Section 2. The behavior of that code point, 96 while non-optimal for IDNA, follows that of a few code points that 97 predate Unicode 7.x and even the IDNA 2008 specifications and Unicode 98 6.0. Those existing code points make the question of what, if 99 anything, to do about this new one exceedingly problematic because 100 different reasonable criteria yield different decisions, 101 specifically: 103 o To disallow it as an IDNA exception case creates inconsistencies 104 with how those earlier code points were handled. 106 o To disallow it and the similar code points as well would 107 necessitate invalidating some potential labels that would have 108 been valid under IDNA2008 until this time. However, there is 109 reason to believe that no such labels exist. 111 o To permit the new code point to be treated as PVALID creates a 112 situation in which it is possible, within the same script, to 113 compose the same character symbol (glyph) in two different ways 114 that do not compare equal even after normalization. That 115 condition would then apply to it and the earlier code points with 116 the same behavior. That situation contradicts a fundamental 117 assumption of IDNA that is discussed in more detail below. 119 NOTE IN DRAFT: 121 This working draft discusses four alternatives, including, for 122 illustration, a radical idea that seems too drastic to be 123 considered now although it would have been appropriate to discuss 124 when the IDNA2008 specifications were being developed. The 125 authors suggest that the community discuss the relevant tradeoffs 126 and make a decision and that the document then be revised to 127 reflect that decision, with the other alternatives discussed as 128 options not chosen. Because there is no ideal choice, the 129 discussion of the issues in Section 2, is probably as or more 130 important than the particular choice of how to handle this code 131 point. In addition to providing information for this document, 132 that section should be considered as an updating addendum to RFC 133 5894 [RFC5894] and should be incorporated into any future revision 134 of that document. 136 As the result of this version of the document containing several 137 alternate proposals, some of the text is also a little bit 138 redundant. That will be corrected in future versions. 140 As anticipated when IDNA2008, and RFC 5892 in particular, were 141 written, exceptions and explicit updates are likely to be needed only 142 if there is disagreement between the Unicode Consortium's view about 143 what is best for the Standard and the IETF's view of what is best for 144 IDNs, the DNS, and IDNA. It was hoped that a situation would never 145 arise in which the the two perspectives would disagree, but the 146 possibility was anticipated and considerable mechanism added to RFC 147 5890 and 5982 as a result. It is probably important to note that a 148 disagreement in this context does not imply that anyone is "wrong", 149 only that the two different groups have different needs and therefore 150 criteria about what is acceptable. For that reason, the IETF has, in 151 the past, allowed some characters for IDNA that active Unicode 152 Technical Committee members suggested be disallowed to avoid a change 153 in derived tables [RFC6452]. This document describes a case where 154 the IETF should disallow a character or characters that the various 155 properties would otherwise treat as PVALID. 157 This document provides the "flagging for the IESG" specified by 158 Section 5.1 of RFC 5892. As specified there, the change itself 159 requires IETF review because it alters the rules of Section 2 of that 160 document. 162 Readers of this document are expected to be familiar with Unicode 163 terminology [Unicode62] and the IETF conventions for representing 164 Unicode code points [RFC5137]. 166 As a convenience to readers of RFC 5892 and to reduce the risks of 167 confusion, this document also formally applies the content of an 168 erratum to the text of the RFC (see Section 4) and so brings that RFC 169 up to date with all agreed changes. 171 [[RFC Editor: please remove the following comment and note if they 172 get to you.]] 174 [[IESG: It might not be a bad idea to incorporate some version of 175 the following into the Last Call announcement.]] 177 NOTE IN DRAFT to IETF Reviewers: The issues in this document, and 178 particularly the choices among options for either adding exception 179 cases to RFC 5892 or ignoring the issue, warning people, and 180 hoping the results do not include serious problems, are fairly 181 esoteric. Understanding them requires that one have at least some 182 understanding of how the Arabic Script works and the reasons the 183 Unicode Standard gives various Arabic Script characters a fairly 184 extended discussion [Unicode62-Arabic]. It also requires 185 understanding of a number of Unicode principles, including the 186 Normalization Stability rules [UAX15-Versioning] as applied to new 187 precomposed characters and guidelines for adding new characters. 188 There is considerable discussion of the issues in Section 2 and 189 references are provided for those who want to pursue them, but 190 potential reviewers should assume that the background needed to 191 understand the reasons for this change is no less deep in the 192 subject matter than would be expected of someone reviewing a 193 proposed change in, e.g., the fundamentals of BGP, TCP congestion 194 control, or some cryptographic algorithm. Put more bluntly, one's 195 ability to read or speak languages other than English, or even one 196 or more languages that use the Arabic script, does not make one an 197 expert in these matters. 199 2. Problem Description 201 2.1. IDNA assumptions about Unicode normalization 203 IDNA makes several assumptions about Unicode, Unicode "characters", 204 and the effects of normalization. Those assumptions were based on 205 careful reading of the Unicode Standard at the time [Unicode5], 206 guided by advice and commitments by members of the Unicode Technical 207 Committee. Those assumptions, and the associated requirements, are 208 necessitated by three properties of DNS labels that do not apply to 209 blocks of running text: 211 1. There is no language context for a label. While particular DNS 212 zones may impose restrictions, including language or script 213 restrictions, on what labels can be registered, neither the DNS 214 nor IDNA impose either type of restriction or give the user of a 215 label any indication about the registration or other restrictions 216 that may have been imposed. 218 2. Labels are often mnemonics rather than words in any language. 219 They may be abbreviations or acronyms or contain embedded digits 220 and have other characteristics that are not typical of words. 222 3. Labels are, in practice, usually short. Even when they are the 223 maximum length allowed by the DNS and IDNA, they are typically 224 too short to provide significant context. Statements that 225 suggest that languages can almost always be determined from 226 relatively short paragraphs or equivalent bodies of text do not 227 apply to DNS labels because of their typical short length and 228 because, as noted above, they are not required to be formed 229 according to language-based rules. 231 At the same time, because the DNS is an exact-match system, there 232 must be no ambiguity about whether two labels are equal. Although 233 there have been extensive discussions about "confusingly similar" 234 characters, labels, and strings, such tests between scripts are 235 always somewhat subjective: they are affected by choices of type 236 styles and by what the user expects to see. In spite of the fact 237 that the glyphs that represent many characters in different scripts 238 are identical in appearance (e.g., basic Latin "a" (U+0061) and the 239 identical-appearing Cyrillic character (U+0430), the most important 240 test is that, if two glyphs are the same within a given script, they 241 must represent the same character no matter how they are formed. 243 Unicode normalization, as explained in [UAX15], is expected to 244 resolve those "same script, same glyph, different formation methods" 245 issues. Within the Latin script, the code point sequence for lower 246 case "o" (U+006F) and combining diaeresis (U+0308) will, when 247 normalized using the "NFC" method required by IDNA, produce the 248 precombined small letter o with diaeresis (U+00F6) and hence the two 249 ways of forming the character will compare equal (and the combining 250 sequence is effectively prohibited from U-labels). 252 NFC was preferred over other normalization methods for IDNA because 253 it is more compact, more likely to be produced on keyboards on which 254 the relevant characters actually appeared, and because it does not 255 lose substantive information (e.g., some types of compatibility 256 equivalence involves judgment calls as to whether two characters are 257 actually the same -- they may be "the same" in some contexts but not 258 others -- while canonical equivalence is about different ways to 259 produce the glyph for the same abstract character). 261 IDNA also assumed that the extensive Unicode stability rules would be 262 applied and work as specified when new code points were added. Those 263 rules, as described in The Unicode Standard and the normative annexes 264 identified below, provide that: 266 1. New code points representing precombined characters that can be 267 formed from combining sequences will not be added to Unicode 268 unless neither the relevant base character nor required combining 269 character are part of the Standard within the relevant script 270 [UAX15-Versioning]. 272 2. If circumstances require that principle be violated, 273 normalization stability requires that the newly-added character 274 decompose (even under NFC) to the previously-available combining 275 sequence [UAX15-Exclusion]. 277 There is no explicit provision in the Standard's discussion of 278 conditions for adding new code points, nor of normalization 279 stability, for an exception based on different languages using the 280 same script. 282 2.2. New code point U+08A1, decomposition, and language dependency 284 Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH 285 WITH HAMZA ABOVE. As can be deduced from the name, it is visually 286 identical to the glyph that can be formed from a combining sequence 287 consisting of the code point for ARABIC LETTER BEH (U+0628) and the 288 code point for Combining Hamza Above (U+0654). The two rules 289 summarized above suggest that either the new code point should not be 290 allocated at all or that it should have a decomposition to 291 \u'0628'\u'0654'. 293 Had the issues outlined in this document been better understood at 294 the time, it probably would have been wise for RFC 5892 to disallow 295 either the precomposed character or the combining sequence of each 296 pair in those cases in which Unicode normalization rules do not cause 297 the right thing to happen, i.e., the combining sequence and 298 precomposed character to be treated as equivalent. Failure to do so 299 at the time places an extra burden on registries to be sure that 300 conflicts (and the potential for confusion and attacks) do not exist. 301 Oddly, had the exclusion been made part of the specification at that 302 time, the preference for precombined forms noted above would probably 303 have dictated excluding the combining sequence, something not 304 otherwise done in IDNA2008 because the NFC requirement serves the 305 same purpose. Today, the only thing that can be excluded without the 306 potential disruption of disallowing a previously-PVALID combining 307 sequence is the to exclude the newly-added code point so whatever is 308 done, or might have been contemplated with hindsight, will be 309 somewhat inconsistent. 311 2.3. Other examples of the same behavior 313 One of the things that complicates the issue with the new U+08A1 code 314 point is that there are several other Arabic-script code points that 315 behave in the same way for similar language-specific reasons. 317 In particular, at least three other grapheme clusters that have been 318 present for many version of Unicode can be seen as involving issues 319 similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA 320 ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER 321 REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are 322 preferred over combining sequences using HAMZA ABOVE (U+0654) 323 [Unicode62-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE 324 (U+0623) decomposes into \u'0627'\u'0654' and ARABIC LETTER YEH WITH 325 HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' so the 326 precomposed character and combining sequences compare equal when both 327 are normalized, as this specification prefers. 329 There are other variations in which a precomposed character involving 330 HAMZA ABOVE has a decomposition to a combining sequence that can form 331 it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a 332 compatibility (???) decomposition into the combining sequence 333 \u'06C7'\u'0674'. 335 2.4. Hamza and Combining Sequences 337 As the Unicode Standard points out at some length [Unicode62-Arabic], 338 Hamza is a problematic abstract character and the "Hamza Above" 339 construction even more so [Unicode62-Hamza]. Those sections explain 340 a distinction made by Unicode between the use of a Hamza mark to 341 denote a glottal stop and one used as a diacritic mark to denote a 342 separate letter. In the first case, the combining sequence is used. 343 In the second, a precombined character is assigned. 345 Unlike Unicode generally and because of concerns about identifier 346 spoofing and attacks based on similarities, character distinctions in 347 IDNA are based much more strictly on the appearance of characters; 348 language and pronunciation distinctions within a script are not 349 considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- 350 tautologically the same as BEH WITH HAMZA ABOVE, even if one of them 351 is written as U+08A1 (new to Unicode 7.0.0) and the other as the 352 sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also 353 available in versions of Unicode going back at least to the version 354 [Unicode32] used in the original version of IDNA [RFC3490]. Because 355 the precomposed form and combining sequence are, for IDNA purposes, 356 the same, IDNA expects that normalization (specifically the 357 requirement that all U-labels be in NFC form) will cause them to 358 compare equal. 360 If Unicode also considered them the same, then the principle would 361 apply that new precomposed ("composition") forms are not added unless 362 one of the code points that could be used to construct it did not 363 exist in an earlier version (and even then is 364 discouraged)[UAX15-Versioning]. When exceptions are made, they are 365 expected to conform to the rules and classes in the "Composition 366 Exclusion Table", with class 2 being relevant to this case 367 [UAX15-Exclusion]. That rule essentially requires that the 368 normalization for the old combining sequence to itself be retained 369 (for stability) but that the newly-added character be treated as 370 canonically decomposable and decompose back to the older sequence 371 even under NFC. That was not done for this particular case, 372 presumably because of the distinction about pronunciation modifiers 373 versus separate letters noted above. Because, for IDNA and the DNS, 374 there is a possibility that the composing sequence \u'0628'\u'0654' 375 already appears in labels, the only choice other than allowing an 376 otherwise-identical, and identically-appearing, label with U+08A1 377 substituted to identify a different DNS entry is to DISALLOW the new 378 character. 380 3. Proposed/ Alternative Changes to RFC 5892 for new character U+08A1 382 NOTE IN DRAFT: See the comments in the Introduction, Section 1 and 383 the first paragraph of each Subsection below for the status of the 384 Subsections that follow. Each one, in combination with the material 385 in Section 2 above, also provides information about the reasons why 386 that particular strategy is appropriate. 388 3.1. Disallow This New Code Point 390 If chosen by the community, this subsection would update the portion 391 of the IDNA2008 specification that identifies rules for what 392 characters are permitted [RFC5892] to disallow that code point. 394 With the publication of this document, Section 2.6 ("Exceptions (F)") 395 of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in 396 Category F so that the rule itself reads: 398 F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, 399 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, 400 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, 401 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B, 402 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, 403 303B, 30FB} 405 and then add to the subtable designated 406 "DISALLOWED -- Would otherwise have been PVALID" 407 after the line that begins "07FA", the additional line: 409 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE 411 This has the effect of making the cited code point DISALLOWED 412 independent of application of the rest of the IDNA rule set to the 413 current version of Unicode. Those wishing to create domain name 414 labels containing Beh with Hamza Above may continue to use the 415 sequence 417 U+0628, ARABIC LETTER BEH 418 followed by 420 U+0654, ARABIC HAMZA ABOVE 422 which was valid for IDNA purposes in Unicode 5.0 and earlier and 423 which continues to be valid. 425 In principle, much the same thing could be accomplished by using the 426 IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892 427 Section 5.3). However, that category is described as applying only 428 when "property values in versions of Unicode after 5.2 have changed 429 in such a way that the derived property value would no longer be 430 PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in 431 Unicode 7.0.0 and no property values of code points in prior versions 432 have changed, category G does not apply. If that section of RFC 5892 433 were to be replaced in the future, perhaps consideration should be 434 given to adding Normalization Stability and other issues to that 435 description but, at present, it is not relevant. 437 3.2. Disallow the combining sequences for these characters 439 If chosen by the community, this subsection would update the portion 440 of the IDNA2008 specification that identifies contextual rules 441 [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction 442 with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that 443 the choice of this option is consistent with the general preference 444 for precomposed characters discussed above but would ban some labels 445 that are valid today and that might, in principle, be in use. 447 The required prohibition could be imposed by creating a new 448 contextual rule in RFC 5892 to constrain combining sequences 449 containing Hamza Above. 451 As the Unicode Standard points out at some length [Unicode62-Arabic], 452 Hamza is a problematic abstract character and the "Hamza Above" 453 construction even more so. IDNA has historically associated 454 characters whose use is reasonable in some contexts but not others 455 with the special derived property "CONTEXTO" and then specified 456 specific, context-dependent, rules about where they may be used. 457 Because Hamza Above is problematic (and spawns edge cases, as 458 discussed in the Unicode Standard section cited above), it was 459 suggested that a contextual rule might be appropriate. There are at 460 least two reasons why a contextual rule would not be suitable for the 461 present situation. 463 1. As discussed above, the present situation is a normalization 464 stability and predictability problem, not a contextual one. Had 465 the same issues arisen with a newly-added precomposed character 466 that could previously be constructed from non-problematic base 467 and combining characters, it would be even more clearly a 468 normalization issue and, following the principles discussed there 469 and particularly in UAX 15 [UAX15-Exclusion], might not have been 470 assigned at all. 472 2. The contextual rule sets are designed around restricting the use 473 of code points to a particular script or adjacent to particular 474 characters within that script. Neither of these cases applies to 475 the newly-added character even if one could imagine rules for the 476 use of Hamza Above (U+0654) that would reflect the considerations 477 of Chapter 8 of Unicode 6.2. Even had the latter been desired, 478 it would be somewhat late now -- Hamza Above has been present as 479 a combining character (U+0654) in many versions of Unicode. 480 While that section of the Unicode Standard describes the issues, 481 it does not provide actionable guidance about what to do about it 482 for cases going forward or when visual identity is important. 484 3.3. Do Nothing Other Than Warn 486 The recommendation from UTC is to simply warn registries, at all 487 levels of the tree, to be careful with this set of characters, making 488 language distinctions within zones. Because the DNS cannot make or 489 enforce language distinctions, this suggestion is problematic but it 490 would avoid having the IETF either invalidating label strings that 491 are potentially now in use or creating inconsistencies among the 492 characters that combine with Hamza Above but that also have 493 precomposed forms that do not have decompositions. The potential 494 would still exist for registries to respect the warning and deprecate 495 such labels if they existed. 497 3.4. Normalization Form IETF (or DNS) 499 The most radical possibility would be to decide that none of the 500 Unicode Normalization Forms specified in UAX 15 [UAX15] are adequate 501 for use with the DNS because, contrary to their apparent 502 descriptions, normalization tables are actually determined using 503 language information. However, use of language information is 504 unacceptable for IDNA for reasons described elsewhere in this 505 document. The remedy would be to define an IETF-specific (or DNS- 506 specific) normalization form, building on NFC but adhering strictly 507 to the rule that normalization causes two different forms of the same 508 character (glyph image) within the same script to be treated as 509 equal. In practice such a form would be implemented for IDNA 510 purposes as an additional rule within RFC 5892 (and its successors) 511 that constituted an exception list for the NFC tables. For this set 512 of characters, the special IETF normalization form would be 513 equivalent to the exclusion discussed in Section 3.2 above. 515 4. Editorial clarification to RFC 5892 517 Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a 518 clarification to Appendix A and Section A.1 of RFC 5892. This 519 section of this document updates the RFC to apply that clarification. 521 1. In Appendix A, add a new paragraph after the paragraph that 522 begins "The code point...". The new paragraph should read: 524 "For the rule to be evaluated to True for the label, it MUST be 525 evaluated separately for every occurrence of the Code point in 526 the label; each of those evaluations must result in True." 528 2. In Appendix A, Section A.1, replace the "Rule Set" by 530 Rule Set: 531 False; 532 If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; 533 If cp .eq. \u200C And 534 RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp 535 (Joining_Type:T)*(Joining_Type:{R,D})) Then True; 537 5. Acknowledgements 539 The Unicode 7.0.0 changes were extensively discussed within the IAB's 540 Internationalization Program. The authors are grateful for the 541 discussions and feedback there, especially from Andrew Sullivan and 542 David Thaler. Additional information was requested and received from 543 Mark Davis and Ken Whistler and while they probably do not agree with 544 the necessity of excluding this code point or taking even more 545 drastic action as their responsibility is to look at the Unicode 546 Consortium requirements for stability, the decision would not have 547 been possible without their input. Thanks to Bill McQuillan and Ted 548 Hardie for reading versions of the document carefully enough to 549 identify and report some confusing typographical errors. Several 550 experts and reviewers who prefer to remain anonymous also provided 551 helpful input and comments on preliminary versions of this document. 553 6. IANA Considerations 555 When the IANA registry and tables are updated to reflect Unicode 556 7.0.0, changes should be made according to the decisions the IETF 557 makes about Section 3. 559 7. Security Considerations 561 [[CREF1: NOTE IN DRAFT: This section is unchanged in version -01 of 562 this document relative to what appeared in -00. It will need to be 563 rewritten once decisions are made about what path to follow. In 564 particular, if "just warn" is chosen, it will need to contain very 565 strong warnings.]] 567 This specification excludes a code point for which the Unicode- 568 specified normalization behavior could result in two ways to form a 569 visually-identical character within the same script not comparing 570 equal. That behavior could create a dream case for someone intending 571 to confuse the user by use of a domain name that looked identical to 572 another one, was entirely in the same script, but was still 573 considered different (see, for example, the discussion of false 574 negatives in identifier comparison in Section 2.1 of RFC 6943 575 [RFC6943]). This exclusion therefore should improve Internet 576 security. 578 8. References 580 8.1. Normative References 582 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP 583 137, RFC 5137, February 2008. 585 [RFC5890] Klensin, J., "Internationalized Domain Names for 586 Applications (IDNA): Definitions and Document Framework", 587 RFC 5890, August 2010. 589 [RFC5892] Faltstrom, P., "The Unicode Code Points and 590 Internationalized Domain Names for Applications (IDNA)", 591 RFC 5892, August 2010. 593 [RFC5892Erratum] 594 "RFC5892, "The Unicode Code Points and Internationalized 595 Domain Names for Applications (IDNA)", August 2010, Errata 596 ID: 3312", Errata ID 3312, August 2012, 597 . 599 [RFC5894] Klensin, J., "Internationalized Domain Names for 600 Applications (IDNA): Background, Explanation, and 601 Rationale", RFC 5894, August 2010. 603 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security 604 Purposes", RFC 6943, May 2013. 606 [UAX15] Davis, M., Ed., "Unicode Standard Annex #15: Unicode 607 Normalization Forms", June 2014, 608 . 610 [UAX15-Exclusion] 611 "Unicode Standard Annex #15: ob. cit., Section 5", 612 . 615 [UAX15-Versioning] 616 "Unicode Standard Annex #15, ob. cit., Section 3", 617 . 619 [Unicode5] 620 The Unicode Consortium, "The Unicode Standard, Version 621 5.0", ISBN 0-321-48091-0, 2007. 623 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. 624 This printed reference has now been updated online to 625 reflect additional code points. For code points, the 626 reference at the time RFC 5890-5894 were published is to 627 Unicode 5.2. 629 [Unicode62] 630 The Unicode Consortium, "The Unicode Standard, Version 631 6.2.0", ISBN 978-1-936213-07-8, 2012, 632 . 634 Preferred citation: The Unicode Consortium. The Unicode 635 Standard, Version 6.2.0, (Mountain View, CA: The Unicode 636 Consortium, 2012. ISBN 978-1-936213-07-8) 638 [Unicode62-Arabic] 639 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 640 Chapter 8, 2012, 641 . 643 Subsection titled "Encoding Principles", paragraph 644 numbered 4, starting on page 251. 646 [Unicode62-Hamza] 647 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 648 Chapter 8, 2012, 649 . 651 Subsection titled "Combining Hamza Above" starting on page 652 263. 654 [Unicode7] 655 The Unicode Consortium, "The Unicode Standard, Version 656 7.0.0", ISBN 978-1-936213-09-2, 2014, 657 . 659 Preferred Citation: The Unicode Consortium. The Unicode 660 Standard, Version 7.0.0, (Mountain View, CA: The Unicode 661 Consortium, 2014. ISBN 978-1-936213-09-2) 663 8.2. Informative References 665 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 666 "Internationalizing Domain Names in Applications (IDNA)", 667 RFC 3490, March 2003. 669 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and 670 Internationalized Domain Names for Applications (IDNA) - 671 Unicode 6.0", RFC 6452, November 2011. 673 [Unicode32] 674 The Unicode Consortium, "The Unicode Standard, Version 675 3.2.0", . 677 The Unicode Standard, Version 3.2.0 is defined by The 678 Unicode Standard, Version 3.0 (Reading, MA, Addison- 679 Wesley, 2000. ISBN 0-201-61633-5), as amended by the 680 Unicode Standard Annex #27: Unicode 3.1 681 (http://www.unicode.org/reports/tr27/) and by the Unicode 682 Standard Annex #28: Unicode 3.2 683 (http://www.unicode.org/reports/tr28/). 685 Appendix A. Change Log 687 RFC Editor: Please remove this appendix before publication. 689 A.1. Changes from version -00 to -01 691 o Version 01 of this document is an extensive rewrite and 692 reorganization, reflecting discussions with UTC members and adding 693 three more options for discussion to the original proposal to 694 simply disallow the new code point. 696 A.2. Changes from version -01 to -02 698 Corrected a typographical error in which Hamza Above was incorrectly 699 listed with the wrong code point. 701 A.3. Changes from version -02 to -03 703 Corrected a typographical error in the Abstract in which RFC 5892 was 704 incorrectly shown as 5982. 706 Authors' Addresses 707 John C Klensin 708 1770 Massachusetts Ave, Ste 322 709 Cambridge, MA 02140 710 USA 712 Phone: +1 617 245 1457 713 Email: john-ietf@jck.com 715 Patrik Faltstrom 716 Netnod 717 Franzengatan 5 718 Stockholm 112 51 719 Sweden 721 Phone: +46 70 6059051 722 Email: paf@netnod.se