idnits 2.17.00 (12 Aug 2021) /tmp/idnits2967/draft-klensin-idna-5892upd-unicode70-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 522: '...ated to True for the label, it MUST be...' -- The draft header indicates that this document updates RFC5892, but the abstract doesn't seem to mention this, which it should. -- The abstract seems to indicate that this document updates RFC5982, but the header doesn't have an 'Updates:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5892, updated by this document, for RFC5378 checks: 2008-04-26) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 7, 2014) is 2721 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also mentioned in 'RFC5892'. ** Downref: Normative reference to an Informational RFC: RFC 5894 ** Downref: Normative reference to an Informational RFC: RFC 6943 -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Versioning' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Arabic' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Hamza' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7' -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 3 errors (**), 0 flaws (~~), 1 warning (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft 4 Updates: 5892, 5894 (if approved) P. Faltstrom 5 Intended status: Standards Track Netnod 6 Expires: June 10, 2015 December 7, 2014 8 IDNA Update for Unicode 7.0.0 9 draft-klensin-idna-5892upd-unicode70-01.txt 11 Abstract 13 The current version of the IDNA specifications anticipated that each 14 new version of Unicode would be reviewed to verify that no changes 15 had been introduced that required adjustments to the set of rules 16 and, in particular, whether new exceptions or backward compatibility 17 adjustments were needed. That review was conducted for Unicode 7.0.0 18 and identified a potentially problematic new code point. This 19 specification discusses that code point and associated issues and 20 updates RFC 5982 accordingly. It also applies an editorial 21 clarification that was the subject of an earlier erratum. In 22 addition, the discussion of the specific issue updates RFC 5894. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on June 10, 2015. 41 Copyright Notice 43 Copyright (c) 2014 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Problem Description . . . . . . . . . . . . . . . . . . . . . 5 60 2.1. IDNA assumptions about Unicode normalization . . . . . . 5 61 2.2. New code point U+08A1, decomposition, and language 62 dependency . . . . . . . . . . . . . . . . . . . . . . . 6 63 2.3. Other examples of the same behavior . . . . . . . . . . . 7 64 2.4. Hamza and Combining Sequences . . . . . . . . . . . . . . 8 65 3. Proposed/ Alternative Changes to RFC 5892 for new character 66 U+08A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 67 3.1. Disallow This New Code Point . . . . . . . . . . . . . . 9 68 3.2. Disallow the combining sequences for these characters . . 10 69 3.3. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 11 70 3.4. Normalization Form IETF (or DNS) . . . . . . . . . . . . 11 71 4. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 11 72 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 73 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 74 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 75 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 76 8.1. Normative References . . . . . . . . . . . . . . . . . . 13 77 8.2. Informative References . . . . . . . . . . . . . . . . . 14 78 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 15 79 A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 15 80 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 82 1. Introduction 84 The current version of the IDNA specifications, known as "IDNA2008" 85 [RFC5890], anticipated that each new version of Unicode would be 86 reviewed to verify that no changes had been introduced that required 87 adjustments to IDNA's rules and, in particular, whether new 88 exceptions or backward compatibility adjustments were needed. When 89 that review was carefully conducted for Unicode 7.0.0 [Unicode7], 90 comparing it to prior versions including the text in Unicode 6.2 91 [Unicode62], it identified a problematic new code point (U+08A1, 92 ARABIC LETTER BEH WITH HAMZA ABOVE). The specific problem is 93 discussed in detail in Section 2. The behavior of that code point, 94 while non-optimal for IDNA, follows that of a few code points that 95 predate Unicode 7.x and even the IDNA 2008 specifications and Unicode 96 6.0. Those existing code points make the question of what, if 97 anything, to do about this new one exceedingly problematic because 98 different reasonable criteria yield different decisions, 99 specifically: 101 o To disallow it as an IDNA exception case creates inconsistencies 102 with how those earlier code points were handled. 104 o To disallow it and the similar code points as well would 105 necessitate invalidating some potential labels that would have 106 been valid under IDNA2008 until this time. However, there is 107 reason to believe that no such labels exist. 109 o To permit the new code point to be treated as PVALID creates a 110 situation in which it is possible, within the same script, to 111 compose the same character symbol (glyph) in two different ways 112 that do not compare equal even after normalization. That 113 condition would then apply to it and the earlier code points with 114 the same behavior. That situation contradicts a fundamental 115 assumption of IDNA that is discussed in more detail below. 117 NOTE IN DRAFT: 119 This working draft discusses four alternatives, including, for 120 illustration, a radical idea that seems too drastic to be 121 considered now although it would have been appropriate to discuss 122 when the IDNA2008 specifications were being developed. The 123 authors suggest that the community discuss the relevant tradeoffs 124 and make a decision and that the document then be revised to 125 reflect that decision, with the other alternatives discussed as 126 options not chosen. Because there is no ideal choice, the 127 discussion of the issues in Section 2, is probably as or more 128 important than the particular choice of how to handle this code 129 point. In addition to providing information for this document, 130 that section should be considered as an updating addendum to RFC 131 5894 [RFC5894] and should be incorporated into any future revision 132 of that document. 134 As the result of this version of the document containing several 135 alternate proposals, some of the text is also a little bit 136 redundant. That will be corrected in future versions. 138 As anticipated when IDNA2008, and RFC 5892 in particular, were 139 written, exceptions and explicit updates are likely to be needed only 140 if there is disagreement between the Unicode Consortium's view about 141 what is best for the Standard and the IETF's view of what is best for 142 IDNs, the DNS, and IDNA. It was hoped that a situation would never 143 arise in which the the two perspectives would disagree, but the 144 possibility was anticipated and considerable mechanism added to RFC 145 5890 and 5982 as a result. It is probably important to note that a 146 disagreement in this context does not imply that anyone is "wrong", 147 only that the two different groups have different needs and therefore 148 criteria about what is acceptable. For that reason, the IETF has, in 149 the past, allowed some characters for IDNA that active Unicode 150 Technical Committee members suggested be disallowed to avoid a change 151 in derived tables [RFC6452]. This document describes a case where 152 the IETF should disallow a character or characters that the various 153 properties would otherwise treat as PVALID. 155 This document provides the "flagging for the IESG" specified by 156 Section 5.1 of RFC 5892. As specified there, the change itself 157 requires IETF review because it alters the rules of Section 2 of that 158 document. 160 Readers of this document are expected to be familiar with Unicode 161 terminology [Unicode62] and the IETF conventions for representing 162 Unicode code points [RFC5137]. 164 As a convenience to readers of RFC 5892 and to reduce the risks of 165 confusion, this document also formally applies the content of an 166 erratum to the text of the RFC (see Section 4) and so brings that RFC 167 up to date with all agreed changes. 169 [[RFC Editor: please remove the following comment and note if they 170 get to you.]] 172 [[IESG: It might not be a bad idea to incorporate some version of 173 the following into the Last Call announcement.]] 175 NOTE IN DRAFT to IETF Reviewers: The issues in this document, and 176 particularly the choices among options for either adding exception 177 cases to RFC 5892 or ignoring the issue, warning people, and 178 hoping the results do not include serious problems, are fairly 179 esoteric. Understanding them requires that one have at least some 180 understanding of how the Arabic Script works and the reasons the 181 Unicode Standard gives various Arabic Script characters a fairly 182 extended discussion [Unicode62-Arabic]. It also requires 183 understanding of a number of Unicode principles, including the 184 Normalization Stability rules [UAX15-Versioning] as applied to new 185 precomposed characters and guidelines for adding new characters. 186 There is considerable discussion of the issues in Section 2 and 187 references are provided for those who want to pursue them, but 188 potential reviewers should assume that the background needed to 189 understand the reasons for this change is no less deep in the 190 subject matter than would be expected of someone reviewing a 191 proposed change in, e.g., the fundamentals of BGP, TCP congestion 192 control, or some cryptographic algorithm. Put more bluntly, one's 193 ability to read or speak languages other than English, or even one 194 or more languages that use the Arabic script, does not make one an 195 expert in these matters. 197 2. Problem Description 199 2.1. IDNA assumptions about Unicode normalization 201 IDNA makes several assumptions about Unicode, Unicode "characters", 202 and the effects of normalization. Those assumptions were based on 203 careful reading of the Unicode Standard at the time [Unicode5], 204 guided by advice and commitments by members of the Unicode Technical 205 Committee. Those assumptions, and the associated requirements, are 206 necessitated by three properties of DNS labels that do not apply to 207 blocks of running text: 209 1. There is no language context for a label. While particular DNS 210 zones may impose restrictions, including language or script 211 restrictions, on what labels can be registered, neither the DNS 212 nor IDNA impose either type of restriction or give the user of a 213 label any indication about the registration or other restrictions 214 that may have been imposed. 216 2. Labels are often mnemonics rather than words in any language. 217 They may be abbreviations or acronyms or contain embedded digits 218 and have other characteristics that are not typical of words. 220 3. Labels are, in practice, usually short. Even when they are the 221 maximum length allowed by the DNS and IDNA, they are typically 222 too short to provide significant context. Statements that 223 suggest that languages can almost always be determined from 224 relatively short paragraphs or equivalent bodies of text do not 225 apply to DNS labels because of their typical short length and 226 because, as noted above, they are not required to be formed 227 according to language-based rules. 229 At the same time, because the DNS is an exact-match system, there 230 must be no ambiguity about whether two labels are equal. Although 231 there have been extensive discussions about "confusingly similar" 232 characters, labels, and strings, such tests between scripts are 233 always somewhat subjective: they are affected by choices of type 234 styles and by what the user expects to see. In spite of the fact 235 that the glyphs that represent many characters in different scripts 236 are identical in appearance (e.g., basic Latin "a" (U+0061) and the 237 identical-appearing Cyrillic character (U+0430), the most important 238 test is that, if two glyphs are the same within a given script, they 239 must represent the same character no matter how they are formed. 241 Unicode normalization, as explained in [UAX15], is expected to 242 resolve those "same script, same glyph, different formation methods" 243 issues. Within the Latin script, the code point sequence for lower 244 case "o" (U+006F) and combining diaeresis (U+0308) will, when 245 normalized using the "NFC" method required by IDNA, produce the 246 precombined small letter o with diaeresis (U+00F6) and hence the two 247 ways of forming the character will compare equal (and the combining 248 sequence is effectively prohibited from U-labels). 250 NFC was preferred over other normalization methods for IDNA because 251 it is more compact, more likely to be produced on keyboards on which 252 the relevant characters actually appeared, and because it does not 253 lose substantive information (e.g., some types of compatibility 254 equivalence involves judgment calls as to whether two characters are 255 actually the same -- they may be "the same" in some contexts but not 256 others -- while canonical equivalence is about different ways to 257 produce the glyph for the same abstract character). 259 IDNA also assumed that the extensive Unicode stability rules would be 260 applied and work as specified when new code points were added. Those 261 rules, as described in The Unicode Standard and the normative annexes 262 identified below, provide that: 264 1. New code points representing precombined characters that can be 265 formed from combining sequences will not be added to Unicode 266 unless neither the relevant base character nor required combining 267 character are part of the Standard within the relevant script 268 [UAX15-Versioning]. 270 2. If circumstances require that principle be violated, 271 normalization stability requires that the newly-added character 272 decompose (even under NFC) to the previously-available combining 273 sequence [UAX15-Exclusion]. 275 There is no explicit provision in the Standard's discussion of 276 conditions for adding new code points, nor of normalization 277 stability, for an exception based on different languages using the 278 same script. 280 2.2. New code point U+08A1, decomposition, and language dependency 282 Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH 283 WITH HAMZA ABOVE. As can be deduced from the name, it is visually 284 identical to the glyph that can be formed from a combining sequence 285 consisting of the code point for ARABIC LETTER BEH (U+0628) and the 286 code point for Combining Hamza Above (U+0654). The two rules 287 summarized above suggest that either the new code point should not be 288 allocated at all or that it should have a decomposition to 289 \u'0628'\u'0654'. 291 Had the issues outlined in this document been better understood at 292 the time, it probably would have been wise for RFC 5892 to disallow 293 either the precomposed character or the combining sequence of each 294 pair in those cases in which Unicode normalization rules do not cause 295 the right thing to happen, i.e., the combining sequence and 296 precomposed character to be treated as equivalent. Failure to do so 297 at the time places an extra burden on registries to be sure that 298 conflicts (and the potential for confusion and attacks) do not exist. 299 Oddly, had the exclusion been made part of the specification at that 300 time, the preference for precombined forms noted above would probably 301 have dictated excluding the combining sequence, something not 302 otherwise done in IDNA2008 because the NFC requirement serves the 303 same purpose. Today, the only thing that can be excluded without the 304 potential disruption of disallowing a previously-PVALID combining 305 sequence is the to exclude the newly-added code point so whatever is 306 done, or might have been contemplated with hindsight, will be 307 somewhat inconsistent. 309 2.3. Other examples of the same behavior 311 One of the things that complicates the issue with the new U+08A1 code 312 point is that there are several other Arabic-script code points that 313 behave in the same way for similar language-specific reasons. 315 In particular, at least three other grapheme clusters that have been 316 present for many version of Unicode can be seen as involving issues 317 similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA 318 ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER 319 REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are 320 preferred over combining sequences using HAMZA ABOVE (U+0654) 321 [Unicode62-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE 322 (U+0623) decomposes into \u'0627'\u'0653' and ARABIC LETTER YEH WITH 323 HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' so the 324 precomposed character and combining sequences compare equal when both 325 are normalized, as this specification prefers. 327 There are other variations in which a precomposed character involving 328 HAMZA ABOVE has a decomposition to a combining sequence that can form 329 it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a 330 compatibility (???) decomposition into the combining sequence 331 \u'06C7'\u'0674'. 333 2.4. Hamza and Combining Sequences 335 As the Unicode Standard points out at some length [Unicode62-Arabic], 336 Hamza is a problematic abstract character and the "Hamza Above" 337 construction even more so [Unicode62-Hamza]. Those sections explain 338 a distinction made by Unicode between the use of a Hamza mark to 339 denote a glottal stop and one used as a diacritic mark to denote a 340 separate letter. In the first case, the combining sequence is used. 341 In the second, a precombined character is assigned. 343 Unlike Unicode generally and because of concerns about identifier 344 spoofing and attacks based on similarities, character distinctions in 345 IDNA are based much more strictly on the appearance of characters; 346 language and pronunciation distinctions within a script are not 347 considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- 348 tautologically the same as BEH WITH HAMZA ABOVE, even if one of them 349 is written as U+08A1 (new to Unicode 7.0.0) and the other as the 350 sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also 351 available in versions of Unicode going back at least to the version 352 [Unicode32] used in the original version of IDNA [RFC3490]. Because 353 the precomposed form and combining sequence are, for IDNA purposes, 354 the same, IDNA expects that normalization (specifically the 355 requirement that all U-labels be in NFC form) will cause them to 356 compare equal. 358 If Unicode also considered them the same, then the principle would 359 apply that new precomposed ("composition") forms are not added unless 360 one of the code points that could be used to construct it did not 361 exist in an earlier version (and even then is 362 discouraged)[UAX15-Versioning]. When exceptions are made, they are 363 expected to conform to the rules and classes in the "Composition 364 Exclusion Table", with class 2 being relevant to this case 365 [UAX15-Exclusion]. That rule essentially requires that the 366 normalization for the old combining sequence to itself be retained 367 (for stability) but that the newly-added character be treated as 368 canonically decomposable and decompose back to the older sequence 369 even under NFC. That was not done for this particular case, 370 presumably because of the distinction about pronunciation modifiers 371 versus separate letters noted above. Because, for IDNA and the DNS, 372 there is a possibility that the composing sequence \u'0628'\u'0654' 373 already appears in labels, the only choice other than allowing an 374 otherwise-identical, and identically-appearing, label with U+08A1 375 substituted to identify a different DNS entry is to DISALLOW the new 376 character. 378 3. Proposed/ Alternative Changes to RFC 5892 for new character U+08A1 380 NOTE IN DRAFT: See the comments in the Introduction, Section 1 and 381 the first paragraph of each Subsection below for the status of the 382 Subsections that follow. Each one, in combination with the material 383 in Section 2 above, also provides information about the reasons why 384 that particular strategy is appropriate. 386 3.1. Disallow This New Code Point 388 If chosen by the community, this subsection would update the portion 389 of the IDNA2008 specification that identifies rules for what 390 characters are permitted [RFC5892] to disallow that code point. 392 With the publication of this document, Section 2.6 ("Exceptions (F)") 393 of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in 394 Category F so that the rule itself reads: 396 F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, 397 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, 398 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, 399 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B, 400 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, 401 303B, 30FB} 403 and then add to the subtable designated 404 "DISALLOWED -- Would otherwise have been PVALID" 405 after the line that begins "07FA", the additional line: 407 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE 409 This has the effect of making the cited code point DISALLOWED 410 independent of application of the rest of the IDNA rule set to the 411 current version of Unicode. Those wishing to create domain name 412 labels containing Beh with Hamza Above may continue to use the 413 sequence 415 U+0628, ARABIC LETTER BEH 416 followed by 418 U+0654, ARABIC HAMZA ABOVE 420 which was valid for IDNA purposes in Unicode 5.0 and earlier and 421 which continues to be valid. 423 In principle, much the same thing could be accomplished by using the 424 IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892 425 Section 5.3). However, that category is described as applying only 426 when "property values in versions of Unicode after 5.2 have changed 427 in such a way that the derived property value would no longer be 428 PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in 429 Unicode 7.0.0 and no property values of code points in prior versions 430 have changed, category G does not apply. If that section of RFC 5892 431 were to be replaced in the future, perhaps consideration should be 432 given to adding Normalization Stability and other issues to that 433 description but, at present, it is not relevant. 435 3.2. Disallow the combining sequences for these characters 437 If chosen by the community, this subsection would update the portion 438 of the IDNA2008 specification that identifies contextual rules 439 [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction 440 with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that 441 the choice of this option is consistent with the general preference 442 for precomposed characters discussed above but would ban some labels 443 that are valid today and that might, in principle, be in use. 445 The required prohibition could be imposed by creating a new 446 contextual rule in RFC 5892 to constrain combining sequences 447 containing Hamza Above. 449 As the Unicode Standard points out at some length [Unicode62-Arabic], 450 Hamza is a problematic abstract character and the "Hamza Above" 451 construction even more so. IDNA has historically associated 452 characters whose use is reasonable in some contexts but not others 453 with the special derived property "CONTEXTO" and then specified 454 specific, context-dependent, rules about where they may be used. 455 Because Hamza Above is problematic (and spawns edge cases, as 456 discussed in the Unicode Standard section cited above), it was 457 suggested that a contextual rule might be appropriate. There are at 458 least two reasons why a contextual rule would not be suitable for the 459 present situation. 461 1. As discussed above, the present situation is a normalization 462 stability and predictability problem, not a contextual one. Had 463 the same issues arisen with a newly-added precomposed character 464 that could previously be constructed from non-problematic base 465 and combining characters, it would be even more clearly a 466 normalization issue and, following the principles discussed there 467 and particularly in UAX 15 [UAX15-Exclusion], might not have been 468 assigned at all. 470 2. The contextual rule sets are designed around restricting the use 471 of code points to a particular script or adjacent to particular 472 characters within that script. Neither of these cases applies to 473 the newly-added character even if one could imagine rules for the 474 use of Hamza Above (U+0654) that would reflect the considerations 475 of Chapter 8 of Unicode 6.2. Even had the latter been desired, 476 it would be somewhat late now -- Hamza Above has been present as 477 a combining character (U+0654) in many versions of Unicode. 478 While that section of the Unicode Standard describes the issues, 479 it does not provide actionable guidance about what to do about it 480 for cases going forward or when visual identity is important. 482 3.3. Do Nothing Other Than Warn 484 The recommendation from UTC is to simply warn registries, at all 485 levels of the tree, to be careful with this set of characters, making 486 language distinctions within zones. Because the DNS cannot make or 487 enforce language distinctions, this suggestion is problematic but it 488 would avoid having the IETF either invalidating label strings that 489 are potentially now in use or creating inconsistencies among the 490 characters that combine with Hamza Above but that also have 491 precomposed forms that do not have decompositions. The potential 492 would still exist for registries to respect the warning and deprecate 493 such labels if they existed. 495 3.4. Normalization Form IETF (or DNS) 497 The most radical possibility would be to decide that none of the 498 Unicode Normalization Forms specified in UAX 15 [UAX15] are adequate 499 for use with the DNS because, contrary to their apparent 500 descriptions, normalization tables are actually determined using 501 language information. However, use of language information is 502 unacceptable for IDNA for reasons described elsewhere in this 503 document. The remedy would be to define an IETF-specific (or DNS- 504 specific) normalization form, building on NFC but adhering strictly 505 to the rule that normalization causes two different forms of the same 506 character (glyph image) within the same script to be treated as 507 equal. In practice such a form would be implemented for IDNA 508 purposes as an additional rule within RFC 5892 (and its successors) 509 that constituted an exception list for the NFC tables. For this set 510 of characters, the special IETF normalization form would be 511 equivalent to the exclusion discussed in Section 3.2 above. 513 4. Editorial clarification to RFC 5892 515 Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a 516 clarification to Appendix A and Section A.1 of RFC 5892. This 517 section of this document updates the RFC to apply that clarification. 519 1. In Appendix A, add a new paragraph after the paragraph that 520 begins "The code point...". The new paragraph should read: 522 "For the rule to be evaluated to True for the label, it MUST be 523 evaluated separately for every occurrence of the Code point in 524 the label; each of those evaluations must result in True." 526 2. In Appendix A, Section A.1, replace the "Rule Set" by 528 Rule Set: 529 False; 530 If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; 531 If cp .eq. \u200C And 532 RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp 533 (Joining_Type:T)*(Joining_Type:{R,D})) Then True; 535 5. Acknowledgements 537 The Unicode 7.0.0 changes were extensively discussed within the IAB's 538 Internationalization Program. The authors are grateful for the 539 discussions and feedback there, especially from Andrew Sullivan and 540 David Thaler. Additional information was requested and received from 541 Mark Davis and Ken Whistler and while they probably do not agree with 542 the necessity of excluding this code point or taking even more 543 drastic action as their responsibility is to look at the Unicode 544 Consortium requirements for stability, the decision would not have 545 been possible without their input. Several experts and reviewers who 546 prefer to remain anonymous also provided helpful input and comments 547 on preliminary versions of this document. 549 6. IANA Considerations 551 When the IANA registry and tables are updated to reflect Unicode 552 7.0.0, changes should be made according to the decisions the IETF 553 makes about Section 3. 555 7. Security Considerations 557 [[CREF1: NOTE IN DRAFT: This section is unchanged in version -01 of 558 this document relative to what appeared in -00. It will need to be 559 rewritten once decisions are made about what path to follow. In 560 particular, if "just warn" is chosen, it will need to contain very 561 strong warnings.]] 563 This specification excludes a code point for which the Unicode- 564 specified normalization behavior could result in two ways to form a 565 visually-identical character within the same script not comparing 566 equal. That behavior could create a dream case for someone intending 567 to confuse the user by use of a domain name that looked identical to 568 another one, was entirely in the same script, but was still 569 considered different (see, for example, the discussion of false 570 negatives in identifier comparison in Section 2.1 of RFC 6943 571 [RFC6943]). This exclusion therefore should improve Internet 572 security. 574 8. References 576 8.1. Normative References 578 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP 579 137, RFC 5137, February 2008. 581 [RFC5890] Klensin, J., "Internationalized Domain Names for 582 Applications (IDNA): Definitions and Document Framework", 583 RFC 5890, August 2010. 585 [RFC5892] Faltstrom, P., "The Unicode Code Points and 586 Internationalized Domain Names for Applications (IDNA)", 587 RFC 5892, August 2010. 589 [RFC5892Erratum] 590 "RFC5892, "The Unicode Code Points and Internationalized 591 Domain Names for Applications (IDNA)", August 2010, Errata 592 ID: 3312", Errata ID 3312, August 2012, 593 . 595 [RFC5894] Klensin, J., "Internationalized Domain Names for 596 Applications (IDNA): Background, Explanation, and 597 Rationale", RFC 5894, August 2010. 599 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security 600 Purposes", RFC 6943, May 2013. 602 [UAX15] Davis, M., Ed., "Unicode Standard Annex #15: Unicode 603 Normalization Forms", June 2014, 604 . 606 [UAX15-Exclusion] 607 "Unicode Standard Annex #15: ob. cit., Section 5", 608 . 611 [UAX15-Versioning] 612 "Unicode Standard Annex #15, ob. cit., Section 3", 613 . 615 [Unicode5] 616 The Unicode Consortium, "The Unicode Standard, Version 617 5.0", ISBN 0-321-48091-0, 2007. 619 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. 620 This printed reference has now been updated online to 621 reflect additional code points. For code points, the 622 reference at the time RFC 5890-5894 were published is to 623 Unicode 5.2. 625 [Unicode62] 626 The Unicode Consortium, "The Unicode Standard, Version 627 6.2.0", ISBN 978-1-936213-07-8, 2012, 628 . 630 Preferred citation: The Unicode Consortium. The Unicode 631 Standard, Version 6.2.0, (Mountain View, CA: The Unicode 632 Consortium, 2012. ISBN 978-1-936213-07-8) 634 [Unicode62-Arabic] 635 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 636 Chapter 8, 2012, 637 . 639 Subsection titled "Encoding Principles", paragraph 640 numbered 4, starting on page 251. 642 [Unicode62-Hamza] 643 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 644 Chapter 8, 2012, 645 . 647 Subsection titled "Combining Hamza Above" starting on page 648 263. 650 [Unicode7] 651 The Unicode Consortium, "The Unicode Standard, Version 652 7.0.0", ISBN 978-1-936213-09-2, 2014, 653 . 655 Preferred Citation: The Unicode Consortium. The Unicode 656 Standard, Version 7.0.0, (Mountain View, CA: The Unicode 657 Consortium, 2014. ISBN 978-1-936213-09-2) 659 8.2. Informative References 661 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 662 "Internationalizing Domain Names in Applications (IDNA)", 663 RFC 3490, March 2003. 665 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and 666 Internationalized Domain Names for Applications (IDNA) - 667 Unicode 6.0", RFC 6452, November 2011. 669 [Unicode32] 670 The Unicode Consortium, "The Unicode Standard, Version 671 3.2.0", . 673 The Unicode Standard, Version 3.2.0 is defined by The 674 Unicode Standard, Version 3.0 (Reading, MA, Addison- 675 Wesley, 2000. ISBN 0-201-61633-5), as amended by the 676 Unicode Standard Annex #27: Unicode 3.1 677 (http://www.unicode.org/reports/tr27/) and by the Unicode 678 Standard Annex #28: Unicode 3.2 679 (http://www.unicode.org/reports/tr28/). 681 Appendix A. Change Log 683 RFC Editor: Please remove this appendix before publication. 685 A.1. Changes from version -00 to -01 687 o Version 01 of this document is an extensive rewrite and 688 reorganization, reflecting discussions with UTC members and adding 689 three more options for discussion to the original proposal to 690 simply disallow the new code point. 692 Authors' Addresses 694 John C Klensin 695 1770 Massachusetts Ave, Ste 322 696 Cambridge, MA 02140 697 USA 699 Phone: +1 617 245 1457 700 Email: john-ietf@jck.com 702 Patrik Faltstrom 703 Netnod 704 Franzengatan 5 705 Stockholm 112 51 706 Sweden 708 Phone: +46 70 6059051 709 Email: paf@netnod.se