1
2
3
4
5
6
7Internet Engineering Task Force (IETF) H. Alvestrand, Ed.
8Request for Comments: 5893 Google
9Category: Standards Track C. Karp
10ISSN: 2070-1721 Swedish Museum of Natural History
11 August 2010
12
13
14 Right-to-Left Scripts for
15 Internationalized Domain Names for Applications (IDNA)
16
17Abstract
18
19 The use of right-to-left scripts in Internationalized Domain Names
20 (IDNs) has presented several challenges. This memo provides a new
21 Bidi rule for Internationalized Domain Names for Applications (IDNA)
22 labels, based on the encountered problems with some scripts and some
23 shortcomings in the 2003 IDNA Bidi criterion.
24
25Status of This Memo
26
27 This is an Internet Standards Track document.
28
29 This document is a product of the Internet Engineering Task Force
30 (IETF). It represents the consensus of the IETF community. It has
31 received public review and has been approved for publication by the
32 Internet Engineering Steering Group (IESG). Further information on
33 Internet Standards is available in Section 2 of RFC 5741.
34
35 Information about the current status of this document, any errata,
36 and how to provide feedback on it may be obtained at
37 http://www.rfc-editor.org/info/rfc5893.
38
39Copyright Notice
40
41 Copyright (c) 2010 IETF Trust and the persons identified as the
42 document authors. All rights reserved.
43
44 This document is subject to BCP 78 and the IETF Trust's Legal
45 Provisions Relating to IETF Documents
46 (http://trustee.ietf.org/license-info) in effect on the date of
47 publication of this document. Please review these documents
48 carefully, as they describe your rights and restrictions with respect
49 to this document. Code Components extracted from this document must
50 include Simplified BSD License text as described in Section 4.e of
51 the Trust Legal Provisions and are provided without warranty as
52 described in the Simplified BSD License.
53
54
55
56
57
58Alvestrand & Karp Standards Track [Page 1]
59
60RFC 5893 IDNA Right to Left August 2010
61
62
63Table of Contents
64
65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
66 1.1. Purpose and Applicability . . . . . . . . . . . . . . . . 2
67 1.2. Background and History . . . . . . . . . . . . . . . . . . 3
68 1.3. Structure of the Rest of This Document . . . . . . . . . . 3
69 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
70 2. The Bidi Rule . . . . . . . . . . . . . . . . . . . . . . . . 6
71 3. The Requirement Set for the Bidi Rule . . . . . . . . . . . . 6
72 4. Examples of Issues Found with RFC 3454 . . . . . . . . . . . . 9
73 4.1. Dhivehi . . . . . . . . . . . . . . . . . . . . . . . . . 9
74 4.2. Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . 10
75 4.3. Strings with Numbers . . . . . . . . . . . . . . . . . . . 12
76 5. Troublesome Situations and Guidelines . . . . . . . . . . . . 12
77 6. Other Issues in Need of Resolution . . . . . . . . . . . . . . 13
78 7. Compatibility Considerations . . . . . . . . . . . . . . . . . 14
79 7.1. Backwards Compatibility Considerations . . . . . . . . . . 14
80 7.2. Forward Compatibility Considerations . . . . . . . . . . . 15
81 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15
82 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
83 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
84 10.1. Normative References . . . . . . . . . . . . . . . . . . . 16
85 10.2. Informative References . . . . . . . . . . . . . . . . . . 17
86
871. Introduction
88
891.1. Purpose and Applicability
90
91 The purpose of this document is to establish a rule that can be
92 applied to Internationalized Domain Name (IDN) labels in Unicode form
93 (U-labels) containing characters from scripts that are written from
94 right to left. It is part of the revised IDNA protocol [RFC5891].
95
96 When labels satisfy the rule, and when certain other conditions are
97 satisfied, there is only a minimal chance of these labels being
98 displayed in a confusing way by the Unicode bidirectional display
99 algorithm.
100
101 The other normative documents in the IDNA2008 document set establish
102 criteria for valid labels, including listing the permitted
103 characters. This document establishes additional validity criteria
104 for labels in scripts normally written from right to left.
105
106 This specification is not intended to place any requirements on
107 domain names that do not contain characters from such scripts.
108
109
110
111
112
113
114Alvestrand & Karp Standards Track [Page 2]
115
116RFC 5893 IDNA Right to Left August 2010
117
118
1191.2. Background and History
120
121 The "Stringprep" specification [RFC3454], part of IDNA2003, made the
122 following statement in its Section 6 on the Bidi algorithm:
123
124 3) If a string contains any RandALCat character, a RandALCat
125 character MUST be the first character of the string, and a
126 RandALCat character MUST be the last character of the string.
127
128 (A RandALCat character is a character with unambiguously
129 right-to-left directionality.)
130
131 The reasoning behind this prohibition was to ensure that every
132 component of a displayed domain name has an unambiguously preferred
133 direction. However, this made certain words in languages written
134 with right-to-left scripts invalid as IDN labels, and in at least one
135 case (Dhivehi) meant that all the words of an entire language were
136 forbidden as IDN labels.
137
138 This is illustrated below with examples taken from the Dhivehi and
139 Yiddish languages, as written with the Thaana and Hebrew scripts,
140 respectively.
141
142 RFC 3454 did not explicitly state the requirement to be fulfilled.
143 Therefore, it is impossible to determine whether a simple relaxation
144 of the rule would continue to fulfill the requirement.
145
146 While this document specifies rules quite different from RFC 3454,
147 most reasonable labels that were allowed under RFC 3454 will also be
148 allowed under this specification (the most important example of
149 non-permitted labels being labels that mix Arabic and European digits
150 (AN and EN) inside an RTL label, and labels that use AN in an LTR
151 label -- see Section 1.4 for terminology), so the operational impact
152 of using the new rule in the updated IDNA specification is limited.
153
1541.3. Structure of the Rest of This Document
155
156 Section 2 defines a rule, the "Bidi rule", which can be used on a
157 domain name label to check how safe it is to use in a domain name of
158 possibly mixed directionality. The primary initial use of this rule
159 is as part of the IDNA2008 protocol [RFC5891].
160
161 Section 3 sets out the requirements for defining the Bidi rule.
162
163 Section 4 gives detailed examples that serve as justification for the
164 new rule.
165
166
167
168
169
170Alvestrand & Karp Standards Track [Page 3]
171
172RFC 5893 IDNA Right to Left August 2010
173
174
175 Section 5 to Section 8 describe various situations that can occur
176 when dealing with domain names with characters of different
177 directionality.
178
179 Only Section 1.4 and Section 2 are normative.
180
1811.4. Terminology
182
183 The terminology used to describe IDNA concepts is defined in the
184 Definitions document [RFC5890].
185
186 The terminology used for the Bidi properties of Unicode characters is
187 taken from the Unicode Standard [Unicode52].
188
189 The Unicode Standard specifies a Bidi property for each character.
190 That property controls the character's behavior in the Unicode
191 bidirectional algorithm [Unicode-UAX9]. For reference, here are the
192 values that the Unicode Bidi property can have:
193
194 o L - Left to right - most letters in LTR scripts
195
196 o R - Right to left - most letters in non-Arabic RTL scripts
197
198 o AL - Arabic letters - most letters in the Arabic script
199
200 o EN - European Number (0-9, and Extended Arabic-Indic numbers)
201
202 o ES - European Number Separator (+ and -)
203
204 o ET - European Number Terminator (currency symbols, the hash sign,
205 the percent sign and so on)
206
207 o AN - Arabic Number; this encompasses the Arabic-Indic numbers, but
208 not the Extended Arabic-Indic numbers
209
210 o CS - Common Number Separator (. , / : et al)
211
212 o NSM - Nonspacing Mark - most combining accents
213
214 o BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)
215
216 o B - Paragraph Separator
217
218 o S - Segment Separator
219
220 o WS - Whitespace, including the SPACE character
221
222 o ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT
223
224
225
226Alvestrand & Karp Standards Track [Page 4]
227
228RFC 5893 IDNA Right to Left August 2010
229
230
231 o LRE, LRO, RLE, RLO, PDF - these are "directional control
232 characters" and are not used in IDNA labels.
233
234 In this memo, we use "network order" to describe the sequence of
235 characters as transmitted on the wire or stored in a file; the terms
236 "first", "next", "previous", "beginning", "end", "before", and
237 "after" are used to refer to the relationship of characters and
238 labels in network order.
239
240 We use "display order" to talk about the sequence of characters as
241 imaged on a display medium; the terms "left" and "right" are used to
242 refer to the relationship of characters and labels in display order.
243
244 Most of the time, the examples use the abbreviations for the Unicode
245 Bidi classes to denote the directionality of the characters; the
246 example string CS L consists of one character of class CS and one
247 character of class L. In some examples, the convention that
248 uppercase characters are of class R or AL, and lowercase characters
249 are of class L is used -- thus, the example string ABC.abc would
250 consist of three right-to-left characters and three left-to-right
251 characters.
252
253 The directionality of such examples is determined by context -- for
254 instance, in the sentence "ABC.abc is displayed as CBA.abc", the
255 first example string is in network order, the second example string
256 is in display order.
257
258 The term "paragraph" is used in the sense of the Unicode Bidi
259 specification [Unicode-UAX9]. It means "a block of text that has an
260 overall direction, either left to right or right to left",
261 approximately; see the "Unicode Bidirectional Algorithm"
262 [Unicode-UAX9] for details.
263
264 "RTL" and "LTR" are abbreviations for "right to left" and "left to
265 right", respectively.
266
267 An RTL label is a label that contains at least one character of type
268 R, AL, or AN.
269
270 An LTR label is any label that is not an RTL label.
271
272 A "Bidi domain name" is a domain name that contains at least one RTL
273 label. (Note: This definition includes domain names containing only
274 dots and right-to-left characters. Providing a separate category of
275 "RTL domain names" would not make this specification simpler, so it
276 has not been done.)
277
278
279
280
281
282Alvestrand & Karp Standards Track [Page 5]
283
284RFC 5893 IDNA Right to Left August 2010
285
286
2872. The Bidi Rule
288
289 The following rule, consisting of six conditions, applies to labels
290 in Bidi domain names. The requirements that this rule satisfies are
291 described in Section 3. All of the conditions must be satisfied for
292 the rule to be satisfied.
293
294 1. The first character must be a character with Bidi property L, R,
295 or AL. If it has the R or AL property, it is an RTL label; if it
296 has the L property, it is an LTR label.
297
298 2. In an RTL label, only characters with the Bidi properties R, AL,
299 AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.
300
301 3. In an RTL label, the end of the label must be a character with
302 Bidi property R, AL, EN, or AN, followed by zero or more
303 characters with Bidi property NSM.
304
305 4. In an RTL label, if an EN is present, no AN may be present, and
306 vice versa.
307
308 5. In an LTR label, only characters with the Bidi properties L, EN,
309 ES, CS, ET, ON, BN, or NSM are allowed.
310
311 6. In an LTR label, the end of the label must be a character with
312 Bidi property L or EN, followed by zero or more characters with
313 Bidi property NSM.
314
315 The following guarantees can be made based on the above:
316
317 o In a domain name consisting of only labels that satisfy the rule,
318 the requirements of Section 3 are satisfied. Note that even LTR
319 labels and pure ASCII labels have to be tested.
320
321 o In a domain name consisting of only LDH labels (as defined in the
322 Definitions document [RFC5890]) and labels that satisfy the rule,
323 the requirements of Section 3 are satisfied as long as a label
324 that starts with an ASCII digit does not come after a
325 right-to-left label.
326
327 No guarantee is given for other combinations.
328
3293. The Requirement Set for the Bidi Rule
330
331 This document, unlike RFC 3454 [RFC3454], provides an explicit
332 justification for the Bidi rule, and states a set of requirements for
333 which it is possible to test whether or not the modified rule
334 fulfills the requirement.
335
336
337
338Alvestrand & Karp Standards Track [Page 6]
339
340RFC 5893 IDNA Right to Left August 2010
341
342
343 All the text in this document assumes that text containing the labels
344 under consideration will be displayed using the Unicode bidirectional
345 algorithm [Unicode-UAX9].
346
347 The requirements proposed are these:
348
349 o Label Uniqueness: No two labels, when presented in display order
350 in the same paragraph, should have the same sequence of characters
351 without also having the same sequence of characters in network
352 order, both when the paragraph has LTR direction and when the
353 paragraph has RTL direction. (This is the criterion that is
354 explicit in RFC 3454). (Note that a label displayed in an RTL
355 paragraph may display the same as a different label displayed in
356 an LTR paragraph and still satisfy this criterion.)
357
358 o Character Grouping: When displaying a string of labels, using the
359 Unicode Bidi algorithm to reorder the characters for display, the
360 characters of each label should remain grouped between the
361 characters delimiting the labels, both when the string is embedded
362 in a paragraph with LTR direction and when it is embedded in a
363 paragraph with RTL direction.
364
365 Several stronger statements were considered and rejected, because
366 they seem to be impossible to fulfill within the constraints of the
367 Unicode bidirectional algorithm. These include:
368
369 o The appearance of a label should be unaffected by its embedding
370 context. This proved impossible even for ASCII labels; the label
371 "123-A" will have a different display order in an RTL context than
372 in an LTR context. (This particular example is, however,
373 disallowed anyway.)
374
375 o The sequence of labels should be consistent with network order.
376 This proved impossible -- a domain name consisting of the labels
377 (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in
378 an LTR context. (In an RTL context, it will be displayed as
379 L4.R3.R2.L1).
380
381 o No two domain names should be displayed the same, even under
382 differing directionality. This was shown to be unsound, since the
383 domain name (in network order) ABC.abc will have display order
384 CBA.abc in an LTR context and abc.CBA in an RTL context, while the
385 domain name (network) abc.ABC will have display order abc.CBA in
386 an LTR context and CBA.abc in an RTL context.
387
388
389
390
391
392
393
394Alvestrand & Karp Standards Track [Page 7]
395
396RFC 5893 IDNA Right to Left August 2010
397
398
399 One possible requirement was thought to be problematic, but turned
400 out to be satisfied by a string that obeys the proposed rules:
401
402 o The Character Grouping requirement should be satisfied when
403 directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
404 same paragraph (outside of the labels). Because these controls
405 affect presentation order in non-obvious ways, by affecting the
406 "sor" and "eor" properties of the Unicode Bidi algorithm, the
407 conditions above require extra testing in order to figure out
408 whether or not they influence the display of the domain name.
409 Testing found that for the strings allowed under the rule
410 presented in this document, directional controls do not influence
411 the display of the domain name.
412
413 This is still not stated as a requirement, since it did not seem as
414 important as the stated requirements, but it is useful to know that
415 Bidi domain names where the labels satisfy the rule have this
416 property.
417
418 In the following descriptions, first-level bullets are used to
419 indicate rules or normative statements; second-level bullets are
420 commentary.
421
422 The Character Grouping requirement can be more formally stated as:
423
424 o Let "Delimiterchars" be a set of characters with the Unicode Bidi
425 properties CS, WS, ON. (These are commonly used to delimit labels
426 -- both the FULL STOP and the space are included. They are not
427 allowed in domain labels.)
428
429 * ET, though it commonly occurs next to domain names in practice,
430 is problematic: the context R CS L EN ET (for instance A.a1%)
431 makes the label L EN not satisfy the character grouping
432 requirement.
433
434 * ES commonly occurs in labels as HYPHEN-MINUS, but could also be
435 used as a delimiter (for instance, the plus sign). It is left
436 out here.
437
438 o Let "unproblematic label" be a label that either satisfies the
439 requirements or does not contain any character with the Bidi
440 properties R, AL, or AN and does not begin with a character with
441 the Bidi property EN. (Informally, "it does not start with a
442 number".)
443
444
445
446
447
448
449
450Alvestrand & Karp Standards Track [Page 8]
451
452RFC 5893 IDNA Right to Left August 2010
453
454
455 A label X satisfies the Character Grouping requirement when, for any
456 Delimiter Character D1 and D2, and for any label S1 and S2 that is an
457 unproblematic label or an empty string, the following holds true:
458
459 If the string formed by concatenating S1, D1, X, D2, and S2 is
460 reordered according to the Bidi algorithm, then all the characters of
461 X in the reordered string are between D1 and D2, and no other
462 characters are between D1 and D2, both if the overall paragraph
463 direction is LTR and if the overall paragraph direction is RTL.
464
465 Note that the definition is self-referential, since S1 and S2 are
466 constrained to be "legal" by this definition. This makes testing
467 changes to proposed rules a little complex, but does not create
468 problems for testing whether or not a given proposed rule satisfies
469 the criterion.
470
471 The "zero-length" case represents the case where a domain name is
472 next to something that isn't a domain name, separated by a delimiter
473 character.
474
475 Note about the position of BN: The Unicode bidirectional algorithm
476 specifies that a BN has an effect on the adjoining characters in
477 network order, not in display order, and are therefore treated as if
478 removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule
479 X9 and Section 5.3). Therefore, the question of "what position does
480 a BN have after reordering" is not meaningful. It has been ignored
481 while developing the rules here.
482
483 The Label Uniqueness requirement can be formally stated as:
484
485 If two non-identical labels X and Y, embedded as for the test above,
486 displayed in paragraphs with the same directionality, are reordered
487 by the Bidi algorithm into the same sequence of code points, the
488 labels X and Y cannot both be legal.
489
4904. Examples of Issues Found with RFC 3454
491
4924.1. Dhivehi
493
494 Dhivehi, the official language of the Maldives, is written with the
495 Thaana script. This script displays some of the characteristics of
496 the Arabic script, including its directional properties, and the
497 indication of vowels by the diacritical marking of consonantal base
498 characters. This marking is obligatory, and both two consecutive
499 vowels and syllable-final consonants are indicated with unvoiced
500 combining marks. Every Dhivehi word therefore ends with a combining
501 mark.
502
503
504
505
506Alvestrand & Karp Standards Track [Page 9]
507
508RFC 5893 IDNA Right to Left August 2010
509
510
511 The word for "computer", which is romanized as "konpeetaru", is
512 written with the following sequence of Unicode code points:
513
514 U+0786 THAANA LETTER KAAFU (AL)
515
516 U+07AE THAANA OBOFILI (NSM)
517
518 U+0782 THAANA LETTER NOONU (AL)
519
520 U+07B0 THAANA SUKUN (NSM)
521
522 U+0795 THAANA LETTER PAVIYANI (AL)
523
524 U+07A9 THAANA LETTER EEBEEFILI (AL)
525
526 U+0793 THAANA LETTER TAVIYANI (AL)
527
528 U+07A6 THAANA ABAFILI (NSM)
529
530 U+0783 THAANA LETTER RAA (AL)
531
532 U+07AA THAANA UBUFILI (NSM)
533
534 The directionality class of U+07AA in the Unicode database
535 [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a
536 conformant implementation of the IDNA2003 algorithm will say that
537 "this is not in RandALCat" and refuse to encode the string.
538
5394.2. Yiddish
540
541 Yiddish is one of several languages written with the Hebrew script
542 (others include Hebrew and Ladino). This is basically a consonantal
543 alphabet (also termed an "abjad"), but Yiddish is written using an
544 extended form that is fully vocalic. The vowels are indicated in
545 several ways, one of which is by repurposing letters that are
546 consonants in Hebrew. Other letters are used both as vowels and
547 consonants, with combining marks, called "points", used to
548 differentiate between them. Finally, some base characters can
549 indicate several different vowels, which are also disambiguated by
550 combining marks. Pointed characters can appear in word-final
551 position and may therefore also be needed at the end of labels. This
552 is not an invariable attribute of a Yiddish string and there is thus
553 greater latitude here than there is with Dhivehi.
554
555 The organization now known as the "YIVO Institute for Jewish
556 Research" developed orthographic rules for modern Standard Yiddish
557 during the 1930s on the basis of work conducted in several venues
558 since earlier in that century. These are given in, "The Standardized
559
560
561
562Alvestrand & Karp Standards Track [Page 10]
563
564RFC 5893 IDNA Right to Left August 2010
565
566
567 Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken
568 as normatively descriptive of modern Standard Yiddish in any context
569 where that notion is deemed relevant. They have been applied
570 exclusively in all formal Yiddish dictionaries published since their
571 establishment, and are similarly dominant in academic and
572 bibliographic regards.
573
574 It therefore appears appropriate for this repertoire also to be
575 supported fully by IDNA. This presents no difficulty with characters
576 in initial and medial positions, but pointed characters are regularly
577 used in final position as well. All of the characters in the SYO
578 repertoire appear in both marked and unmarked form with one
579 exception: the HEBREW LETTER PE (U+05E4). The SYO only permits this
580 with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
581 to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
582 to the Latin letter "f". There is, however, a separate unpointed
583 allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
584 character when it appears in final position. The constraint on the
585 use of the SYO repertoire resulting from the proscription of
586 combining marks at the end of RTL strings thus reduces to nothing
587 more, or less, than the equivalent of saying that a string of Latin
588 characters cannot end with the letter "p". It must also be noted
589 that the HEBREW LETTER PE with the HEBREW POINT DAGESH is
590 characteristic of almost all traditional Yiddish orthographies that
591 predate (or remain in use in parallel to) the SYO, being the first
592 pointed character to appear in any of them.
593
594 A more general instantiation of the basic problem can be seen in the
595 representation of the YIVO acronym. This acronym is written with the
596 Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and
597 QAMATS are combining points. The Unicode code points are:
598
599 U+05D9 HEBREW LETTER YOD (R)
600
601 U+05B4 HEBREW POINT HIRIQ (NSM)
602
603 U+05D5 HEBREW LETTER VAV (R)
604
605 U+05D0 HEBREW LETTER ALEF (R)
606
607 U+05B8 HEBREW POINT QAMATS (NSM)
608
609 The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
610 database is NSM, which again causes the IDNA2003 algorithm to reject
611 the string.
612
613
614
615
616
617
618Alvestrand & Karp Standards Track [Page 11]
619
620RFC 5893 IDNA Right to Left August 2010
621
622
623 It may also be noted that all of the combined characters mentioned
624 above exist in precomposed form at separate positions in the Unicode
625 chart. However, by invoking Stringprep, the IDNA2003 algorithm also
626 rejects those code points, for reasons not discussed here.
627
6284.3. Strings with Numbers
629
630 By requiring that the first or last character of a string be a member
631 of category R or AL, the Stringprep specification [RFC3454]
632 prohibited a string containing right-to-left characters from ending
633 with a number.
634
635 Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
636 ALEF. Displayed in an LTR context, the first one will be displayed
637 from left to right as 5 ALEF (with the 5 being considered right to
638 left because of the leading ALEF), while 5 ALEF will be displayed in
639 exactly the same order (5 taking the direction from context).
640 Clearly, only one of those should be permitted as a registered label,
641 but barring them both seems unnecessary.
642
6435. Troublesome Situations and Guidelines
644
645 There are situations in which labels that satisfy the rule above will
646 be displayed in a surprising fashion. The most important of these is
647 the case where a label ending in a character with Bidi property AL,
648 AN, or R occurs before a label beginning with a character of Bidi
649 property EN. In that case, the number will appear to move into the
650 label containing the right-to-left character, violating the Character
651 Grouping requirement.
652
653 If the label that occurs after the right-to-left label itself
654 satisfies the Bidi criterion, the requirements will be satisfied in
655 all cases (this is the reason why the criterion talks about strings
656 containing L in some cases). However, the IDNABIS WG concluded that
657 this could not be required for several reasons:
658
659 o There is a large current deployment of ASCII domain names starting
660 with digits. These cannot possibly be invalidated.
661
662 o Domain names are often constructed piecemeal, for instance, by
663 combining a string with the content of a search list. This may
664 occur after IDNA processing, and thus in part of the code that is
665 not IDNA-aware, making detection of the undesirable combination
666 impossible.
667
668
669
670
671
672
673
674Alvestrand & Karp Standards Track [Page 12]
675
676RFC 5893 IDNA Right to Left August 2010
677
678
679 o Even if a label is registered under a "safe" label, there may be a
680 DNAME [RFC2672] with an "unsafe" label that points to the "safe"
681 label, thus creating seemingly valid names that would not satisfy
682 the criterion.
683
684 o Wildcards create the odd situation where a label is "valid" (can
685 be looked up successfully) without the zone owner knowing that
686 this label exists. So an owner of a zone whose name starts with a
687 digit and contains a wildcard has no way of controlling whether or
688 not names with RTL labels in them are looked up in his zone.
689
690 Rather than trying to suggest rules that disallow all such
691 undesirable situations, this document merely warns about the
692 possibility, and leaves it to application developers to take whatever
693 measures they deem appropriate to avoid problematic situations.
694
6956. Other Issues in Need of Resolution
696
697 This document concerns itself only with the rules that are needed
698 when dealing with domain names with characters that have differing
699 Bidi properties, and considers characters only in terms of their Bidi
700 properties. All other issues with scripts that are written from
701 right to left must be considered in other contexts.
702
703 One such issue is the need to keep numbers separate. Several scripts
704 are used with multiple sets of numbers -- most commonly they use
705 Latin numbers and a script-specific set of numbers, but in the case
706 of Arabic, there are two sets of "Arabic-Indic" digits involved.
707
708 The algorithm in this document disallows occurrences of AN-class
709 characters ("Arabic-Indic digits", U+0660 to U+0669) together with
710 EN-class characters (which includes "European" digits, U+0030 to
711 U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but
712 does not help in preventing the mixing of, for instance, Bengali
713 digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),
714 both of which have Bidi class L. A registry or script community that
715 wishes to create rules restricting the mixing of digits in a label
716 will be able to specify these restrictions at the registry level.
717 Some rules are also specified at the protocol level.
718
719 Another set of issues concerns the proper display of IDNs with a
720 mixture of LTR and RTL labels, or only RTL labels.
721
722 It is unrealistic to expect that applications will display domain
723 names using embedded formatting codes between their labels (for one
724 thing, no reliable algorithms for identifying domain names in running
725 text exist); thus, the display order will be determined by the Bidi
726 algorithm. Thus, a sequence (in network order) of R1.R2.ltr will be
727
728
729
730Alvestrand & Karp Standards Track [Page 13]
731
732RFC 5893 IDNA Right to Left August 2010
733
734
735 displayed in the order 2R.1R.ltr in an LTR context, which might
736 surprise someone expecting to see labels displayed in hierarchical
737 order. People used to working with text that mixes LTR and RTL
738 strings might not be so surprised by this. Again, this memo does not
739 attempt to suggest a solution to this problem.
740
7417. Compatibility Considerations
742
7437.1. Backwards Compatibility Considerations
744
745 As with any change to an existing standard, it is important to
746 consider what happens with existing implementations when the change
747 is introduced. Some troublesome cases include:
748
749 o An old program used to input the newly allowed label. If the old
750 program checks the input against RFC 3454, some labels will not be
751 allowed, and domain names containing those labels will remain
752 inaccessible.
753
754 o An old program is asked to display the newly allowed label, and
755 checks it against RFC 3454 before displaying. The program will
756 perform some kind of fallback, most likely displaying the label in
757 A-label form.
758
759 o An old program tries to display the newly allowed label. If the
760 old program has code for displaying the last character of a label
761 that is different from the code used to display the characters in
762 the middle of the label, the display may be inconsistent and cause
763 confusion.
764
765 One particular example of the last case is if a program chooses to
766 examine the last character (in network order) of a string in order to
767 determine its directionality, rather than its first. If it finds an
768 NSM character and tries to display the string as if it was a
769 left-to-right string, the resulting display may be interesting, but
770 not useful.
771
772 The editors believe that these cases will have a less harmful impact
773 in practice than continuing to deny the use of words from the
774 languages for which these strings are necessary as IDN labels.
775
776 This specification does not forbid using leading European digits in
777 ASCII-only labels, since this would conflict with a large installed
778 base of such labels, and would increase the scope of the
779 specification from RTL labels to all labels. The harm resulting from
780 this limitation of scope is described in Section 5. Registries and
781 private zone managers can check for this particular condition before
782 they allow registration of any RTL label. Generally, it is best to
783
784
785
786Alvestrand & Karp Standards Track [Page 14]
787
788RFC 5893 IDNA Right to Left August 2010
789
790
791 disallow registration of any right-to-left strings in a zone where
792 the label at the level above begins with a digit.
793
7947.2. Forward Compatibility Considerations
795
796 This text is intentionally specified strictly in terms of the Unicode
797 Bidi properties. The determination that the condition is sufficient
798 to fulfill the criteria depends on the Unicode Bidi algorithm; it is
799 unlikely that drastic changes will be made to this algorithm.
800
801 However, the determination of validity for any string depends on the
802 Unicode Bidi property values, which are not declared immutable by the
803 Unicode Consortium. Furthermore, the behavior of the algorithm for
804 any given character is likely to be linguistically and culturally
805 sensitive, so while it should occur rarely, it is possible that later
806 versions of the Unicode Standard may change the Bidi properties
807 assigned to certain Unicode characters.
808
809 This memo does not propose a solution for this problem.
810
8118. Security Considerations
812
813 The display behavior of mixed-direction text can be extremely
814 surprising to users who are not used to it; for instance, cut and
815 paste of a piece of text can cause the text to display differently at
816 the destination, if the destination is in another directionality
817 context, and adding a character in one place of a text can cause
818 characters some distance from the point of insertion to change their
819 display position. This is, however, not a phenomenon unique to the
820 display of domain names.
821
822 The new IDNA protocol, and particularly these new Bidi rules, will
823 allow some strings to be used in IDNA contexts that are not allowed
824 today. It is possible that differences in the interpretation of
825 labels between implementations of IDNA2003 and IDNA2008 could pose a
826 security risk, but it is difficult to envision any specific
827 instantiation of this.
828
829 Any rational attempt to compute, for instance, a hash over an
830 identifier processed by IDNA would use network order for its
831 computation, and thus be unaffected by the new rules proposed here.
832
833 While it is not believed to pose a problem, if display routines had
834 been written with specific knowledge of the RFC 3454 IDNA
835 prohibitions, it is possible that the potential problems noted under
836 "Backwards Compatibility Considerations" could cause new kinds of
837 confusion.
838
839
840
841
842Alvestrand & Karp Standards Track [Page 15]
843
844RFC 5893 IDNA Right to Left August 2010
845
846
8479. Acknowledgements
848
849 While the listed editors held the pen, this document represents the
850 joint work and conclusions of an ad hoc design team. In addition to
851 the editors, this consisted of, in alphabetic order, Tina Dam, Patrik
852 Faltstrom, and John Klensin. Many further specific contributions and
853 helpful comments were received from the people listed below, and
854 others who have contributed to the development and use of the IDNA
855 protocols.
856
857 The particular formulation of the Bidi rule in Section 2 was
858 suggested by Matitiahu Allouche.
859
860 The team wishes, in particular, to thank Roozbeh Pournader for
861 calling its attention to the issue with the Thaana script, Paul
862 Hoffman for pointing out the need to be explicit about backwards
863 compatibility considerations, Ken Whistler for suggesting the basis
864 of the formalized "Character Grouping" requirement, Mark Davis for
865 commentary, Erik van der Poel for careful review, comments, and
866 verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete
867 Resnick for reviews, and Vint Cerf for chairing the working group and
868 contributing massively to getting the documents finished.
869
87010. References
871
87210.1. Normative References
873
874 [RFC5890] Klensin, J., "Internationalized Domain Names for
875 Applications (IDNA): Definitions and Document
876 Framework", RFC 5890, August 2010.
877
878 [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:
879 Unicode Bidirectional Algorithm", September 2009,
880 <http://www.unicode.org/reports/tr9/>.
881
882 [Unicode52] The Unicode Consortium. The Unicode Standard, Version
883 5.2.0, defined by: "The Unicode Standard, Version
884 5.2.0", (Mountain View, CA: The Unicode Consortium,
885 2009. ISBN 978-1-936213-00-9).
886 <http://www.unicode.org/versions/Unicode5.2.0/>.
887
888
889
890
891
892
893
894
895
896
897
898Alvestrand & Karp Standards Track [Page 16]
899
900RFC 5893 IDNA Right to Left August 2010
901
902
90310.2. Informative References
904
905 [RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection",
906 RFC 2672, August 1999.
907
908 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
909 Internationalized Strings ("stringprep")", RFC 3454,
910 December 2002.
911
912 [RFC5891] Klensin, J., "Internationalized Domain Names in
913 Applications (IDNA): Protocol", RFC 5891, August 2010.
914
915 [SYO] "The Standardized Yiddish Orthography: Rules of
916 Yiddish Spelling, 6th ed., New York, ISBN
917 0-914512-25-0", 1999.
918
919Authors' Addresses
920
921 Harald Tveit Alvestrand (editor)
922 Google
923 Beddingen 10
924 Trondheim, 7014
925 Norway
926
927 EMail: harald@alvestrand.no
928
929
930 Cary Karp
931 Swedish Museum of Natural History
932 Frescativ. 40
933 Stockholm, 10405
934 Sweden
935
936 Phone: +46 8 5195 4055
937 Fax:
938 EMail: ck@nic.museum
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954Alvestrand & Karp Standards Track [Page 17]
955
956