7Internet Engineering Task Force (IETF) H. Alvestrand, Ed.

8Request for Comments: 5893 Google

9Category: Standards Track C. Karp

10ISSN: 2070-1721 Swedish Museum of Natural History

11 August 2010

14 Right-to-Left Scripts for

15 Internationalized Domain Names for Applications (IDNA)

17Abstract

19 The use of right-to-left scripts in Internationalized Domain Names

20 (IDNs) has presented several challenges. This memo provides a new

21 Bidi rule for Internationalized Domain Names for Applications (IDNA)

22 labels, based on the encountered problems with some scripts and some

23 shortcomings in the 2003 IDNA Bidi criterion.

25Status of This Memo

27 This is an Internet Standards Track document.

29 This document is a product of the Internet Engineering Task Force

30 (IETF). It represents the consensus of the IETF community. It has

31 received public review and has been approved for publication by the

32 Internet Engineering Steering Group (IESG). Further information on

33 Internet Standards is available in Section 2 of RFC 5741.

35 Information about the current status of this document, any errata,

36 and how to provide feedback on it may be obtained at

37 http://www.rfc-editor.org/info/rfc5893.

39Copyright Notice

44 This document is subject to BCP 78 and the IETF Trust's Legal

45 Provisions Relating to IETF Documents

46 (http://trustee.ietf.org/license-info) in effect on the date of

47 publication of this document. Please review these documents

48 carefully, as they describe your rights and restrictions with respect

49 to this document. Code Components extracted from this document must

50 include Simplified BSD License text as described in Section 4.e of

51 the Trust Legal Provisions and are provided without warranty as

52 described in the Simplified BSD License.

58Alvestrand & Karp Standards Track [Page 1]

60RFC 5893 IDNA Right to Left August 2010

63Table of Contents

65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2

66 1.1. Purpose and Applicability . . . . . . . . . . . . . . . . 2

67 1.2. Background and History . . . . . . . . . . . . . . . . . . 3

68 1.3. Structure of the Rest of This Document . . . . . . . . . . 3

69 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4

70 2. The Bidi Rule . . . . . . . . . . . . . . . . . . . . . . . . 6

71 3. The Requirement Set for the Bidi Rule . . . . . . . . . . . . 6

72 4. Examples of Issues Found with RFC 3454 . . . . . . . . . . . . 9

73 4.1. Dhivehi . . . . . . . . . . . . . . . . . . . . . . . . . 9

74 4.2. Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . 10

75 4.3. Strings with Numbers . . . . . . . . . . . . . . . . . . . 12

76 5. Troublesome Situations and Guidelines . . . . . . . . . . . . 12

77 6. Other Issues in Need of Resolution . . . . . . . . . . . . . . 13

78 7. Compatibility Considerations . . . . . . . . . . . . . . . . . 14

79 7.1. Backwards Compatibility Considerations . . . . . . . . . . 14

80 7.2. Forward Compatibility Considerations . . . . . . . . . . . 15

81 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15

82 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16

83 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16

84 10.1. Normative References . . . . . . . . . . . . . . . . . . . 16

85 10.2. Informative References . . . . . . . . . . . . . . . . . . 17

871. Introduction

891.1. Purpose and Applicability

91 The purpose of this document is to establish a rule that can be

92 applied to Internationalized Domain Name (IDN) labels in Unicode form

93 (U-labels) containing characters from scripts that are written from

94 right to left. It is part of the revised IDNA protocol [RFC5891].

96 When labels satisfy the rule, and when certain other conditions are

97 satisfied, there is only a minimal chance of these labels being

98 displayed in a confusing way by the Unicode bidirectional display

99 algorithm.

100

101 The other normative documents in the IDNA2008 document set establish

102 criteria for valid labels, including listing the permitted

103 characters. This document establishes additional validity criteria

104 for labels in scripts normally written from right to left.

105

106 This specification is not intended to place any requirements on

107 domain names that do not contain characters from such scripts.

108

109

110

111

112

113

114Alvestrand & Karp Standards Track [Page 2]

115

116RFC 5893 IDNA Right to Left August 2010

117

118

1191.2. Background and History

120

121 The "Stringprep" specification [RFC3454], part of IDNA2003, made the

122 following statement in its Section 6 on the Bidi algorithm:

123

124 3) If a string contains any RandALCat character, a RandALCat

125 character MUST be the first character of the string, and a

126 RandALCat character MUST be the last character of the string.

127

128 (A RandALCat character is a character with unambiguously

129 right-to-left directionality.)

130

131 The reasoning behind this prohibition was to ensure that every

132 component of a displayed domain name has an unambiguously preferred

133 direction. However, this made certain words in languages written

134 with right-to-left scripts invalid as IDN labels, and in at least one

135 case (Dhivehi) meant that all the words of an entire language were

136 forbidden as IDN labels.

137

138 This is illustrated below with examples taken from the Dhivehi and

139 Yiddish languages, as written with the Thaana and Hebrew scripts,

140 respectively.

141

142 RFC 3454 did not explicitly state the requirement to be fulfilled.

143 Therefore, it is impossible to determine whether a simple relaxation

144 of the rule would continue to fulfill the requirement.

145

146 While this document specifies rules quite different from RFC 3454,

147 most reasonable labels that were allowed under RFC 3454 will also be

148 allowed under this specification (the most important example of

149 non-permitted labels being labels that mix Arabic and European digits

150 (AN and EN) inside an RTL label, and labels that use AN in an LTR

151 label -- see Section 1.4 for terminology), so the operational impact

152 of using the new rule in the updated IDNA specification is limited.

153

1541.3. Structure of the Rest of This Document

155

156 Section 2 defines a rule, the "Bidi rule", which can be used on a

157 domain name label to check how safe it is to use in a domain name of

158 possibly mixed directionality. The primary initial use of this rule

159 is as part of the IDNA2008 protocol [RFC5891].

160

161 Section 3 sets out the requirements for defining the Bidi rule.

162

163 Section 4 gives detailed examples that serve as justification for the

164 new rule.

165

166

167

168

169

170Alvestrand & Karp Standards Track [Page 3]

171

172RFC 5893 IDNA Right to Left August 2010

173

174

175 Section 5 to Section 8 describe various situations that can occur

176 when dealing with domain names with characters of different

177 directionality.

178

179 Only Section 1.4 and Section 2 are normative.

180

1811.4. Terminology

182

183 The terminology used to describe IDNA concepts is defined in the

184 Definitions document [RFC5890].

185

186 The terminology used for the Bidi properties of Unicode characters is

187 taken from the Unicode Standard [Unicode52].

188

189 The Unicode Standard specifies a Bidi property for each character.

190 That property controls the character's behavior in the Unicode

191 bidirectional algorithm [Unicode-UAX9]. For reference, here are the

192 values that the Unicode Bidi property can have:

193

194 o L - Left to right - most letters in LTR scripts

195

196 o R - Right to left - most letters in non-Arabic RTL scripts

197

198 o AL - Arabic letters - most letters in the Arabic script

199

200 o EN - European Number (0-9, and Extended Arabic-Indic numbers)

201

202 o ES - European Number Separator (+ and -)

203

204 o ET - European Number Terminator (currency symbols, the hash sign,

205 the percent sign and so on)

206

207 o AN - Arabic Number; this encompasses the Arabic-Indic numbers, but

208 not the Extended Arabic-Indic numbers

209

210 o CS - Common Number Separator (. , / : et al)

211

212 o NSM - Nonspacing Mark - most combining accents

213

214 o BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)

215

216 o B - Paragraph Separator

217

218 o S - Segment Separator

219

220 o WS - Whitespace, including the SPACE character

221

222 o ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT

223

224

225

226Alvestrand & Karp Standards Track [Page 4]

227

228RFC 5893 IDNA Right to Left August 2010

229

230

231 o LRE, LRO, RLE, RLO, PDF - these are "directional control

232 characters" and are not used in IDNA labels.

233

234 In this memo, we use "network order" to describe the sequence of

235 characters as transmitted on the wire or stored in a file; the terms

236 "first", "next", "previous", "beginning", "end", "before", and

237 "after" are used to refer to the relationship of characters and

238 labels in network order.

239

240 We use "display order" to talk about the sequence of characters as

241 imaged on a display medium; the terms "left" and "right" are used to

242 refer to the relationship of characters and labels in display order.

243

244 Most of the time, the examples use the abbreviations for the Unicode

245 Bidi classes to denote the directionality of the characters; the

246 example string CS L consists of one character of class CS and one

247 character of class L. In some examples, the convention that

248 uppercase characters are of class R or AL, and lowercase characters

249 are of class L is used -- thus, the example string ABC.abc would

250 consist of three right-to-left characters and three left-to-right

251 characters.

252

253 The directionality of such examples is determined by context -- for

254 instance, in the sentence "ABC.abc is displayed as CBA.abc", the

255 first example string is in network order, the second example string

256 is in display order.

257

258 The term "paragraph" is used in the sense of the Unicode Bidi

259 specification [Unicode-UAX9]. It means "a block of text that has an

260 overall direction, either left to right or right to left",

261 approximately; see the "Unicode Bidirectional Algorithm"

262 [Unicode-UAX9] for details.

263

264 "RTL" and "LTR" are abbreviations for "right to left" and "left to

265 right", respectively.

266

267 An RTL label is a label that contains at least one character of type

268 R, AL, or AN.

269

270 An LTR label is any label that is not an RTL label.

271

272 A "Bidi domain name" is a domain name that contains at least one RTL

273 label. (Note: This definition includes domain names containing only

274 dots and right-to-left characters. Providing a separate category of

275 "RTL domain names" would not make this specification simpler, so it

276 has not been done.)

277

278

279

280

281

282Alvestrand & Karp Standards Track [Page 5]

283

284RFC 5893 IDNA Right to Left August 2010

285

286

2872. The Bidi Rule

288

289 The following rule, consisting of six conditions, applies to labels

290 in Bidi domain names. The requirements that this rule satisfies are

291 described in Section 3. All of the conditions must be satisfied for

292 the rule to be satisfied.

293

294 1. The first character must be a character with Bidi property L, R,

295 or AL. If it has the R or AL property, it is an RTL label; if it

296 has the L property, it is an LTR label.

297

298 2. In an RTL label, only characters with the Bidi properties R, AL,

299 AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.

300

301 3. In an RTL label, the end of the label must be a character with

302 Bidi property R, AL, EN, or AN, followed by zero or more

303 characters with Bidi property NSM.

304

305 4. In an RTL label, if an EN is present, no AN may be present, and

306 vice versa.

307

308 5. In an LTR label, only characters with the Bidi properties L, EN,

309 ES, CS, ET, ON, BN, or NSM are allowed.

310

311 6. In an LTR label, the end of the label must be a character with

312 Bidi property L or EN, followed by zero or more characters with

313 Bidi property NSM.

314

315 The following guarantees can be made based on the above:

316

317 o In a domain name consisting of only labels that satisfy the rule,

318 the requirements of Section 3 are satisfied. Note that even LTR

319 labels and pure ASCII labels have to be tested.

320

321 o In a domain name consisting of only LDH labels (as defined in the

322 Definitions document [RFC5890]) and labels that satisfy the rule,

323 the requirements of Section 3 are satisfied as long as a label

324 that starts with an ASCII digit does not come after a

325 right-to-left label.

326

327 No guarantee is given for other combinations.

328

3293. The Requirement Set for the Bidi Rule

330

331 This document, unlike RFC 3454 [RFC3454], provides an explicit

332 justification for the Bidi rule, and states a set of requirements for

333 which it is possible to test whether or not the modified rule

334 fulfills the requirement.

335

336

337

338Alvestrand & Karp Standards Track [Page 6]

339

340RFC 5893 IDNA Right to Left August 2010

341

342

343 All the text in this document assumes that text containing the labels

344 under consideration will be displayed using the Unicode bidirectional

345 algorithm [Unicode-UAX9].

346

347 The requirements proposed are these:

348

349 o Label Uniqueness: No two labels, when presented in display order

350 in the same paragraph, should have the same sequence of characters

351 without also having the same sequence of characters in network

352 order, both when the paragraph has LTR direction and when the

353 paragraph has RTL direction. (This is the criterion that is

354 explicit in RFC 3454). (Note that a label displayed in an RTL

355 paragraph may display the same as a different label displayed in

356 an LTR paragraph and still satisfy this criterion.)

357

358 o Character Grouping: When displaying a string of labels, using the

359 Unicode Bidi algorithm to reorder the characters for display, the

360 characters of each label should remain grouped between the

361 characters delimiting the labels, both when the string is embedded

362 in a paragraph with LTR direction and when it is embedded in a

363 paragraph with RTL direction.

364

365 Several stronger statements were considered and rejected, because

366 they seem to be impossible to fulfill within the constraints of the

367 Unicode bidirectional algorithm. These include:

368

369 o The appearance of a label should be unaffected by its embedding

370 context. This proved impossible even for ASCII labels; the label

371 "123-A" will have a different display order in an RTL context than

372 in an LTR context. (This particular example is, however,

373 disallowed anyway.)

374

375 o The sequence of labels should be consistent with network order.

376 This proved impossible -- a domain name consisting of the labels

377 (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in

378 an LTR context. (In an RTL context, it will be displayed as

379 L4.R3.R2.L1).

380

381 o No two domain names should be displayed the same, even under

382 differing directionality. This was shown to be unsound, since the

383 domain name (in network order) ABC.abc will have display order

384 CBA.abc in an LTR context and abc.CBA in an RTL context, while the

385 domain name (network) abc.ABC will have display order abc.CBA in

386 an LTR context and CBA.abc in an RTL context.

387

388

389

390

391

392

393

394Alvestrand & Karp Standards Track [Page 7]

395

396RFC 5893 IDNA Right to Left August 2010

397

398

399 One possible requirement was thought to be problematic, but turned

400 out to be satisfied by a string that obeys the proposed rules:

401

402 o The Character Grouping requirement should be satisfied when

403 directional controls (LRE, RLE, RLO, LRO, PDF) are used in the

404 same paragraph (outside of the labels). Because these controls

405 affect presentation order in non-obvious ways, by affecting the

406 "sor" and "eor" properties of the Unicode Bidi algorithm, the

407 conditions above require extra testing in order to figure out

408 whether or not they influence the display of the domain name.

409 Testing found that for the strings allowed under the rule

410 presented in this document, directional controls do not influence

411 the display of the domain name.

412

413 This is still not stated as a requirement, since it did not seem as

414 important as the stated requirements, but it is useful to know that

415 Bidi domain names where the labels satisfy the rule have this

416 property.

417

418 In the following descriptions, first-level bullets are used to

419 indicate rules or normative statements; second-level bullets are

420 commentary.

421

422 The Character Grouping requirement can be more formally stated as:

423

424 o Let "Delimiterchars" be a set of characters with the Unicode Bidi

425 properties CS, WS, ON. (These are commonly used to delimit labels

426 -- both the FULL STOP and the space are included. They are not

427 allowed in domain labels.)

428

429 * ET, though it commonly occurs next to domain names in practice,

430 is problematic: the context R CS L EN ET (for instance A.a1%)

431 makes the label L EN not satisfy the character grouping

432 requirement.

433

434 * ES commonly occurs in labels as HYPHEN-MINUS, but could also be

435 used as a delimiter (for instance, the plus sign). It is left

436 out here.

437

438 o Let "unproblematic label" be a label that either satisfies the

439 requirements or does not contain any character with the Bidi

440 properties R, AL, or AN and does not begin with a character with

441 the Bidi property EN. (Informally, "it does not start with a

442 number".)

443

444

445

446

447

448

449

450Alvestrand & Karp Standards Track [Page 8]

451

452RFC 5893 IDNA Right to Left August 2010

453

454

455 A label X satisfies the Character Grouping requirement when, for any

456 Delimiter Character D1 and D2, and for any label S1 and S2 that is an

457 unproblematic label or an empty string, the following holds true:

458

459 If the string formed by concatenating S1, D1, X, D2, and S2 is

460 reordered according to the Bidi algorithm, then all the characters of

461 X in the reordered string are between D1 and D2, and no other

462 characters are between D1 and D2, both if the overall paragraph

463 direction is LTR and if the overall paragraph direction is RTL.

464

465 Note that the definition is self-referential, since S1 and S2 are

466 constrained to be "legal" by this definition. This makes testing

467 changes to proposed rules a little complex, but does not create

468 problems for testing whether or not a given proposed rule satisfies

469 the criterion.

470

471 The "zero-length" case represents the case where a domain name is

472 next to something that isn't a domain name, separated by a delimiter

473 character.

474

475 Note about the position of BN: The Unicode bidirectional algorithm

476 specifies that a BN has an effect on the adjoining characters in

477 network order, not in display order, and are therefore treated as if

478 removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule

479 X9 and Section 5.3). Therefore, the question of "what position does

480 a BN have after reordering" is not meaningful. It has been ignored

481 while developing the rules here.

482

483 The Label Uniqueness requirement can be formally stated as:

484

485 If two non-identical labels X and Y, embedded as for the test above,

486 displayed in paragraphs with the same directionality, are reordered

487 by the Bidi algorithm into the same sequence of code points, the

488 labels X and Y cannot both be legal.

489

4904. Examples of Issues Found with RFC 3454

491

4924.1. Dhivehi

493

494 Dhivehi, the official language of the Maldives, is written with the

495 Thaana script. This script displays some of the characteristics of

496 the Arabic script, including its directional properties, and the

497 indication of vowels by the diacritical marking of consonantal base

498 characters. This marking is obligatory, and both two consecutive

499 vowels and syllable-final consonants are indicated with unvoiced

500 combining marks. Every Dhivehi word therefore ends with a combining

501 mark.

502

503

504

505

506Alvestrand & Karp Standards Track [Page 9]

507

508RFC 5893 IDNA Right to Left August 2010

509

510

511 The word for "computer", which is romanized as "konpeetaru", is

512 written with the following sequence of Unicode code points:

513

514 U+0786 THAANA LETTER KAAFU (AL)

515

516 U+07AE THAANA OBOFILI (NSM)

517

518 U+0782 THAANA LETTER NOONU (AL)

519

520 U+07B0 THAANA SUKUN (NSM)

521

522 U+0795 THAANA LETTER PAVIYANI (AL)

523

524 U+07A9 THAANA LETTER EEBEEFILI (AL)

525

526 U+0793 THAANA LETTER TAVIYANI (AL)

527

528 U+07A6 THAANA ABAFILI (NSM)

529

530 U+0783 THAANA LETTER RAA (AL)

531

532 U+07AA THAANA UBUFILI (NSM)

533

534 The directionality class of U+07AA in the Unicode database

535 [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a

536 conformant implementation of the IDNA2003 algorithm will say that

537 "this is not in RandALCat" and refuse to encode the string.

538

5394.2. Yiddish

540

541 Yiddish is one of several languages written with the Hebrew script

542 (others include Hebrew and Ladino). This is basically a consonantal

543 alphabet (also termed an "abjad"), but Yiddish is written using an

544 extended form that is fully vocalic. The vowels are indicated in

545 several ways, one of which is by repurposing letters that are

546 consonants in Hebrew. Other letters are used both as vowels and

547 consonants, with combining marks, called "points", used to

548 differentiate between them. Finally, some base characters can

549 indicate several different vowels, which are also disambiguated by

550 combining marks. Pointed characters can appear in word-final

551 position and may therefore also be needed at the end of labels. This

552 is not an invariable attribute of a Yiddish string and there is thus

553 greater latitude here than there is with Dhivehi.

554

555 The organization now known as the "YIVO Institute for Jewish

556 Research" developed orthographic rules for modern Standard Yiddish

557 during the 1930s on the basis of work conducted in several venues

558 since earlier in that century. These are given in, "The Standardized

559

560

561

562Alvestrand & Karp Standards Track [Page 10]

563

564RFC 5893 IDNA Right to Left August 2010

565

566

567 Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken

568 as normatively descriptive of modern Standard Yiddish in any context

569 where that notion is deemed relevant. They have been applied

570 exclusively in all formal Yiddish dictionaries published since their

571 establishment, and are similarly dominant in academic and

572 bibliographic regards.

573

574 It therefore appears appropriate for this repertoire also to be

575 supported fully by IDNA. This presents no difficulty with characters

576 in initial and medial positions, but pointed characters are regularly

577 used in final position as well. All of the characters in the SYO

578 repertoire appear in both marked and unmarked form with one

579 exception: the HEBREW LETTER PE (U+05E4). The SYO only permits this

580 with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent

581 to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent

582 to the Latin letter "f". There is, however, a separate unpointed

583 allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter

584 character when it appears in final position. The constraint on the

585 use of the SYO repertoire resulting from the proscription of

586 combining marks at the end of RTL strings thus reduces to nothing

587 more, or less, than the equivalent of saying that a string of Latin

588 characters cannot end with the letter "p". It must also be noted

589 that the HEBREW LETTER PE with the HEBREW POINT DAGESH is

590 characteristic of almost all traditional Yiddish orthographies that

591 predate (or remain in use in parallel to) the SYO, being the first

592 pointed character to appear in any of them.

593

594 A more general instantiation of the basic problem can be seen in the

595 representation of the YIVO acronym. This acronym is written with the

596 Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and

597 QAMATS are combining points. The Unicode code points are:

598

599 U+05D9 HEBREW LETTER YOD (R)

600

601 U+05B4 HEBREW POINT HIRIQ (NSM)

602

603 U+05D5 HEBREW LETTER VAV (R)

604

605 U+05D0 HEBREW LETTER ALEF (R)

606

607 U+05B8 HEBREW POINT QAMATS (NSM)

608

609 The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode

610 database is NSM, which again causes the IDNA2003 algorithm to reject

611 the string.

612

613

614

615

616

617

618Alvestrand & Karp Standards Track [Page 11]

619

620RFC 5893 IDNA Right to Left August 2010

621

622

623 It may also be noted that all of the combined characters mentioned

624 above exist in precomposed form at separate positions in the Unicode

625 chart. However, by invoking Stringprep, the IDNA2003 algorithm also

626 rejects those code points, for reasons not discussed here.

627

6284.3. Strings with Numbers

629

630 By requiring that the first or last character of a string be a member

631 of category R or AL, the Stringprep specification [RFC3454]

632 prohibited a string containing right-to-left characters from ending

633 with a number.

634

635 Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5

636 ALEF. Displayed in an LTR context, the first one will be displayed

637 from left to right as 5 ALEF (with the 5 being considered right to

638 left because of the leading ALEF), while 5 ALEF will be displayed in

639 exactly the same order (5 taking the direction from context).

640 Clearly, only one of those should be permitted as a registered label,

641 but barring them both seems unnecessary.

642

6435. Troublesome Situations and Guidelines

644

645 There are situations in which labels that satisfy the rule above will

646 be displayed in a surprising fashion. The most important of these is

647 the case where a label ending in a character with Bidi property AL,

648 AN, or R occurs before a label beginning with a character of Bidi

649 property EN. In that case, the number will appear to move into the

650 label containing the right-to-left character, violating the Character

651 Grouping requirement.

652

653 If the label that occurs after the right-to-left label itself

654 satisfies the Bidi criterion, the requirements will be satisfied in

655 all cases (this is the reason why the criterion talks about strings

656 containing L in some cases). However, the IDNABIS WG concluded that

657 this could not be required for several reasons:

658

659 o There is a large current deployment of ASCII domain names starting

660 with digits. These cannot possibly be invalidated.

661

662 o Domain names are often constructed piecemeal, for instance, by

663 combining a string with the content of a search list. This may

664 occur after IDNA processing, and thus in part of the code that is

665 not IDNA-aware, making detection of the undesirable combination

666 impossible.

667

668

669

670

671

672

673

674Alvestrand & Karp Standards Track [Page 12]

675

676RFC 5893 IDNA Right to Left August 2010

677

678

679 o Even if a label is registered under a "safe" label, there may be a

680 DNAME [RFC2672] with an "unsafe" label that points to the "safe"

681 label, thus creating seemingly valid names that would not satisfy

682 the criterion.

683

684 o Wildcards create the odd situation where a label is "valid" (can

685 be looked up successfully) without the zone owner knowing that

686 this label exists. So an owner of a zone whose name starts with a

687 digit and contains a wildcard has no way of controlling whether or

688 not names with RTL labels in them are looked up in his zone.

689

690 Rather than trying to suggest rules that disallow all such

691 undesirable situations, this document merely warns about the

692 possibility, and leaves it to application developers to take whatever

693 measures they deem appropriate to avoid problematic situations.

694

6956. Other Issues in Need of Resolution

696

697 This document concerns itself only with the rules that are needed

698 when dealing with domain names with characters that have differing

699 Bidi properties, and considers characters only in terms of their Bidi

700 properties. All other issues with scripts that are written from

701 right to left must be considered in other contexts.

702

703 One such issue is the need to keep numbers separate. Several scripts

704 are used with multiple sets of numbers -- most commonly they use

705 Latin numbers and a script-specific set of numbers, but in the case

706 of Arabic, there are two sets of "Arabic-Indic" digits involved.

707

708 The algorithm in this document disallows occurrences of AN-class

709 characters ("Arabic-Indic digits", U+0660 to U+0669) together with

710 EN-class characters (which includes "European" digits, U+0030 to

711 U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but

712 does not help in preventing the mixing of, for instance, Bengali

713 digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),

714 both of which have Bidi class L. A registry or script community that

715 wishes to create rules restricting the mixing of digits in a label

716 will be able to specify these restrictions at the registry level.

717 Some rules are also specified at the protocol level.

718

719 Another set of issues concerns the proper display of IDNs with a

720 mixture of LTR and RTL labels, or only RTL labels.

721

722 It is unrealistic to expect that applications will display domain

723 names using embedded formatting codes between their labels (for one

724 thing, no reliable algorithms for identifying domain names in running

725 text exist); thus, the display order will be determined by the Bidi

726 algorithm. Thus, a sequence (in network order) of R1.R2.ltr will be

727

728

729

730Alvestrand & Karp Standards Track [Page 13]

731

732RFC 5893 IDNA Right to Left August 2010

733

734

735 displayed in the order 2R.1R.ltr in an LTR context, which might

736 surprise someone expecting to see labels displayed in hierarchical

737 order. People used to working with text that mixes LTR and RTL

738 strings might not be so surprised by this. Again, this memo does not

739 attempt to suggest a solution to this problem.

740

7417. Compatibility Considerations

742

7437.1. Backwards Compatibility Considerations

744

745 As with any change to an existing standard, it is important to

746 consider what happens with existing implementations when the change

747 is introduced. Some troublesome cases include:

748

749 o An old program used to input the newly allowed label. If the old

750 program checks the input against RFC 3454, some labels will not be

751 allowed, and domain names containing those labels will remain

752 inaccessible.

753

754 o An old program is asked to display the newly allowed label, and

755 checks it against RFC 3454 before displaying. The program will

756 perform some kind of fallback, most likely displaying the label in

757 A-label form.

758

759 o An old program tries to display the newly allowed label. If the

760 old program has code for displaying the last character of a label

761 that is different from the code used to display the characters in

762 the middle of the label, the display may be inconsistent and cause

763 confusion.

764

765 One particular example of the last case is if a program chooses to

766 examine the last character (in network order) of a string in order to

767 determine its directionality, rather than its first. If it finds an

768 NSM character and tries to display the string as if it was a

769 left-to-right string, the resulting display may be interesting, but

770 not useful.

771

772 The editors believe that these cases will have a less harmful impact

773 in practice than continuing to deny the use of words from the

774 languages for which these strings are necessary as IDN labels.

775

776 This specification does not forbid using leading European digits in

777 ASCII-only labels, since this would conflict with a large installed

778 base of such labels, and would increase the scope of the

779 specification from RTL labels to all labels. The harm resulting from

780 this limitation of scope is described in Section 5. Registries and

781 private zone managers can check for this particular condition before

782 they allow registration of any RTL label. Generally, it is best to

783

784

785

786Alvestrand & Karp Standards Track [Page 14]

787

788RFC 5893 IDNA Right to Left August 2010

789

790

791 disallow registration of any right-to-left strings in a zone where

792 the label at the level above begins with a digit.

793

7947.2. Forward Compatibility Considerations

795

796 This text is intentionally specified strictly in terms of the Unicode

797 Bidi properties. The determination that the condition is sufficient

798 to fulfill the criteria depends on the Unicode Bidi algorithm; it is

799 unlikely that drastic changes will be made to this algorithm.

800

801 However, the determination of validity for any string depends on the

802 Unicode Bidi property values, which are not declared immutable by the

803 Unicode Consortium. Furthermore, the behavior of the algorithm for

804 any given character is likely to be linguistically and culturally

805 sensitive, so while it should occur rarely, it is possible that later

806 versions of the Unicode Standard may change the Bidi properties

807 assigned to certain Unicode characters.

808

809 This memo does not propose a solution for this problem.

810

8118. Security Considerations

812

813 The display behavior of mixed-direction text can be extremely

814 surprising to users who are not used to it; for instance, cut and

815 paste of a piece of text can cause the text to display differently at

816 the destination, if the destination is in another directionality

817 context, and adding a character in one place of a text can cause

818 characters some distance from the point of insertion to change their

819 display position. This is, however, not a phenomenon unique to the

820 display of domain names.

821

822 The new IDNA protocol, and particularly these new Bidi rules, will

823 allow some strings to be used in IDNA contexts that are not allowed

824 today. It is possible that differences in the interpretation of

825 labels between implementations of IDNA2003 and IDNA2008 could pose a

826 security risk, but it is difficult to envision any specific

827 instantiation of this.

828

829 Any rational attempt to compute, for instance, a hash over an

830 identifier processed by IDNA would use network order for its

831 computation, and thus be unaffected by the new rules proposed here.

832

833 While it is not believed to pose a problem, if display routines had

834 been written with specific knowledge of the RFC 3454 IDNA

835 prohibitions, it is possible that the potential problems noted under

836 "Backwards Compatibility Considerations" could cause new kinds of

837 confusion.

838

839

840

841

842Alvestrand & Karp Standards Track [Page 15]

843

844RFC 5893 IDNA Right to Left August 2010

845

846

8479. Acknowledgements

848

849 While the listed editors held the pen, this document represents the

850 joint work and conclusions of an ad hoc design team. In addition to

851 the editors, this consisted of, in alphabetic order, Tina Dam, Patrik

852 Faltstrom, and John Klensin. Many further specific contributions and

853 helpful comments were received from the people listed below, and

854 others who have contributed to the development and use of the IDNA

855 protocols.

856

857 The particular formulation of the Bidi rule in Section 2 was

858 suggested by Matitiahu Allouche.

859

860 The team wishes, in particular, to thank Roozbeh Pournader for

861 calling its attention to the issue with the Thaana script, Paul

862 Hoffman for pointing out the need to be explicit about backwards

863 compatibility considerations, Ken Whistler for suggesting the basis

864 of the formalized "Character Grouping" requirement, Mark Davis for

865 commentary, Erik van der Poel for careful review, comments, and

866 verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete

867 Resnick for reviews, and Vint Cerf for chairing the working group and

868 contributing massively to getting the documents finished.

869

87010. References

871

87210.1. Normative References

873

874 [RFC5890] Klensin, J., "Internationalized Domain Names for

875 Applications (IDNA): Definitions and Document

876 Framework", RFC 5890, August 2010.

877

878 [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:

879 Unicode Bidirectional Algorithm", September 2009,

880 <http://www.unicode.org/reports/tr9/>.

881

882 [Unicode52] The Unicode Consortium. The Unicode Standard, Version

883 5.2.0, defined by: "The Unicode Standard, Version

884 5.2.0", (Mountain View, CA: The Unicode Consortium,

885 2009. ISBN 978-1-936213-00-9).

886 <http://www.unicode.org/versions/Unicode5.2.0/>.

887

888

889

890

891

892

893

894

895

896

897

898Alvestrand & Karp Standards Track [Page 16]

899

900RFC 5893 IDNA Right to Left August 2010

901

902

90310.2. Informative References

904

905 [RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection",

906 RFC 2672, August 1999.

907

908 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of

909 Internationalized Strings ("stringprep")", RFC 3454,

910 December 2002.

911

912 [RFC5891] Klensin, J., "Internationalized Domain Names in

913 Applications (IDNA): Protocol", RFC 5891, August 2010.

914

915 [SYO] "The Standardized Yiddish Orthography: Rules of

916 Yiddish Spelling, 6th ed., New York, ISBN

917 0-914512-25-0", 1999.

918

919Authors' Addresses

920

921 Harald Tveit Alvestrand (editor)

922 Google

923 Beddingen 10

924 Trondheim, 7014

925 Norway

926

927 EMail: harald@alvestrand.no

928

929

930 Cary Karp

931 Swedish Museum of Natural History

932 Frescativ. 40

933 Stockholm, 10405

934 Sweden

935

936 Phone: +46 8 5195 4055

937 Fax:

938 EMail: ck@nic.museum

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954Alvestrand & Karp Standards Track [Page 17]

955

956