7Network Working Group J. Klensin

8Request for Comments: 5198 M. Padlipsky

9Obsoletes: 698 March 2008

10Updates: 854

11Category: Standards Track

14 Unicode Format for Network Interchange

16Status of This Memo

18 This document specifies an Internet standards track protocol for the

19 Internet community, and requests discussion and suggestions for

20 improvements. Please refer to the current edition of the "Internet

21 Official Protocol Standards" (STD 1) for the standardization state

22 and status of this protocol. Distribution of this memo is unlimited.

24Abstract

26 The Internet today is in need of a standardized form for the

27 transmission of internationalized "text" information, paralleling the

28 specifications for the use of ASCII that date from the early days of

29 the ARPANET. This document specifies that format, using UTF-8 with

30 normalization and specific line-ending sequences.

32Table of Contents

34 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2

35 1.1. Requirement for a Standardized Text Stream Format . . . . 2

36 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3

37 2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 3

38 3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 5

39 4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 5

40 5. Applicability and Stability of this Specification . . . . . . 7

41 5.1. Use in IETF Applications Specifications . . . . . . . . . 7

42 5.2. Unicode Versions and Applicability . . . . . . . . . . . . 7

43 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9

44 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10

45 Appendix A. History and Context . . . . . . . . . . . . . . . . . 11

46 Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 12

47 Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14

48 Appendix D. A Note about Related Future Work . . . . . . . . . . 14

49 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

50 Normative References . . . . . . . . . . . . . . . . . . . . . . 15

51 Informative References . . . . . . . . . . . . . . . . . . . . . 16

58Klensin & Padlipsky Standards Track [Page 1]

60RFC 5198 Network Unicode March 2008

631. Introduction

651.1. Requirement for a Standardized Text Stream Format

67 Historically, Internet protocols have been largely ASCII-based and

68 references to "text" in protocols have assumed ASCII text and

69 specifically text in Network Virtual Terminal ("NVT") or "Network

70 ASCII" form (see Appendix A and Appendix B). Protocols and formats

71 that have moved beyond ASCII have included arrangements to

72 specifically identify the character set and often the language being

73 used.

75 In our more internationalized world, "text" clearly no longer equates

76 unambiguously to "network ASCII". Fortunately, however, we are

77 converging on Unicode [Unicode] [ISO10646] as a single international

78 interchange character coding and no longer need to deal with per-

79 script standards for character sets (e.g., one standard for each of

80 Arabic, Cyrillic, Devanagari, etc., or even standards keyed to

81 languages that are usually considered to share a script, such as

82 French, German, or Swedish). Unfortunately, though, while it is

83 certainly time to define a Unicode-based text type for use as a

84 common text interchange format, "use Unicode" involves even more

85 ambiguity than "use ASCII" did decades ago.

87 Unicode identifies each character by an integer, called its "code

88 point", in the range 0-0x10ffff. These integers can be encoded into

89 byte sequences for transmission in at least three standard and

90 generally-recognized encoding forms, all of which are completely

91 defined in The Unicode Standard and the documents cited below:

93 o UTF-8 [RFC3629] defines a variable-length encoding that may be

94 applied uniformly to all code points.

96 o UTF-16 [RFC2781] encodes the range of Unicode characters whose

97 code points are less than 65536 straightforwardly as 16-bit

98 integers, and provides a "surrogate" mechanism for encoding larger

99 code points in 32 bits.

100

101 o UTF-32 (also known as UCS-4) simply encodes each code point as a

102 32-bit integer.

103

104 Older forms and nomenclature, such as the 16-bit UCS-2, are now

105 strongly discouraged.

106

107 As with ASCII, any of these forms may be used with different line-

108 ending conventions. That flexibility can be an additional source of

109 confusion with, e.g., index (offset) references into documents based

110 on character counts.

111

112

113

114Klensin & Padlipsky Standards Track [Page 2]

115

116RFC 5198 Network Unicode March 2008

117

118

119 This document proposes to establish "Net-Unicode" as a new

120 standardized text transmission form for the Internet, to serve as an

121 internationalized alternative for NVT ASCII when specified in new --

122 and, where appropriate, updated -- protocols. UTF-8 [RFC3629] is

123 chosen for the coding because it has good compatibility properties

124 with ASCII and for other reasons discussed in the existing IETF

125 character set policy [RFC2277]. "Net-Unicode" is specified in

126 Section 2; the subsequent sections of the document provide background

127 and explanation.

128

129 Whenever there is a choice, Unicode SHOULD be used with the text

130 encoding specified here. This combination is preferred to the

131 double-byte encoding of "extended ASCII" [RFC0698] or the assorted

132 per-language or per-country character coding systems.

133

1341.2. Terminology

135

136 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

137 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this

138 document are to be interpreted as described in [RFC2119].

139

1402. Net-Unicode Definition

141

142 The Network Unicode format (Net-Unicode) is defined as follows.

143 Parts of this definition are deliberately informal, providing

144 guidance for specific profiles or rules in the protocols that

145 reference this one rather than firm rules that apply globally.

146

147 1. Characters MUST be encoded in UTF-8 as defined in [RFC3629].

148

149 2. If the protocol has the concept of "lines", line-endings MUST be

150 indicated by the sequence Carriage-Return (CR, U+000D) followed

151 by Line-Feed (LF, U+000A), often known just as CRLF. CR SHOULD

152 NOT appear except when followed by LF. The only other allowed

153 context in which CR is permitted is in the combination CR NUL,

154 which is not recommended (see the note at the end of this

155 section).

156

157 3. The control characters in the ASCII range (U+0000 to U+001F and

158 U+007F to U+009F) SHOULD generally be avoided. Space (SP,

159 U+0020), CR, LF, and Form Feed (FF, U+000C) are exceptions to

160 this principle, but use of all but the first requires care as

161 discussed elsewhere in this document. The so-called "C1

162 Controls" (U+0080 through U+009F), which did not appear in ASCII,

163 MUST NOT appear.

164

165 FF should be used only with caution: it does not have a standard

166 and universal interpretation and, in particular, if its use

167

168

169

170Klensin & Padlipsky Standards Track [Page 3]

171

172RFC 5198 Network Unicode March 2008

173

174

175 assumes a page length, such assumptions may not be appropriate in

176 international contexts (e.g., considering 8.5x11 inch paper

177 versus A4). Other control characters are used to affect display

178 format, control devices, or to structure files. None of those

179 uses is appropriate for streams of plain text.

180

181 4. Before transmission, all character sequences SHOULD be normalized

182 according to Unicode normalization form "NFC" (see Section 3).

183

184 5. As suggested in Section 6 of RFC 3629, the Byte Order Mark

185 ("BOM") signature MUST NOT appear at the beginning of these text

186 strings.

187

188 6. Systems conforming to this specification MUST NOT transmit any

189 string containing any code point that is unassigned in the

190 version of Unicode on which they are dependent. The version of

191 NFC and the version of Unicode used by that system MUST be

192 consistent.

193

194 The use of LF without CR is questionable; see Appendix B for more

195 discussion. The newer control characters IND (U+0084) and NEL ("Next

196 Line", U+0085) might have been used to disambiguate the various line-

197 ending situations, but, because their use has not been established on

198 the Internet, because many protocols require CRLF, and because IND

199 and NEL fall within the "C1 Controls" group (see below), they MUST

200 NOT be used. Similar observations apply to the yet newer line and

201 paragraph separators at U+2028 and U+2029 and any future characters

202 that might be defined to serve these functions. For this

203 specification and protocols that depend on it, lines end in CRLF and

204 only in CRLF. Anything that does not end in CRLF is either not a

205 line or is severely malformed.

206

207 The NVT specification contained a number of additional provisions,

208 e.g., for the optional use of backspacing and "bare CR" (sent as CR

209 NUL) to generate overstruck character sequences. The much greater

210 number of precomposed characters in Unicode, the availability of

211 combining characters, and the growing use of markup conventions of

212 various types to show, e.g., emphasis (rather than attempting to do

213 that via the use of special characters), should make such sequences

214 largely unnecessary. These sequences SHOULD be avoided if at all

215 possible. However, because they were optional in NVT applications

216 and this specification is an NVT superset, they cannot be prohibited

217 entirely. The most important of these rules is that CR MUST NOT

218 appear unless it is immediately followed by LF (indicating end of

219 line) or NUL. Because NUL (an octet whose value is all zeros, i.e.,

220 %x00 in the notation of [RFC5234]) is hostile to programming

221 languages that use that character as a string delimiter, the CR NUL

222 sequence SHOULD be avoided for that reason as well.

223

224

225

226Klensin & Padlipsky Standards Track [Page 4]

227

228RFC 5198 Network Unicode March 2008

229

230

2313. Normalization

232

233 There are cases where strings of Unicode are fundamentally

234 equivalent, essentially representing the same text. These are called

235 "canonical equivalents" in the Unicode Standard. For example, the

236 following pairs of strings are canonically equivalent:

237

238 U+2126 OHM SIGN

239 U+03A9 GREEK CAPITAL LETTER OMEGA

240

241 U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT

242 U+00E0 LATIN SMALL LETTER A WITH GRAVE

243

244 Comparison of strings becomes much easier if any such cases are

245 always represented by a single unique form. The Unicode Consortium

246 specifies a normalization form, known as NFC [NFC], which provides

247 the necessary mappings and mechanisms to convert all canonically

248 equivalent sequences to a single unique form. Typically, this form

249 produces precomposed characters for any sequences that can be

250 represented in that fashion. It also reorders other combining marks

251 so that they have a unique and unambiguous order.

252

253 Of the various normalization forms defined as part of Unicode, NFC is

254 closest to actual use in practice, minimizes side-effects due to

255 considering characters equivalent that may not be equivalent in all

256 situations, and typically requires the least work when converting

257 from non-Unicode encodings.

258

259 The section above requires that, except in very unusual

260 circumstances, all Net-Unicode strings be transmitted in normalized

261 form. Recognition of the fact that some implementations of

262 applications may rely on operating system libraries over which they

263 have little control and adherence to the robustness principle

264 suggests that receivers of such strings should be prepared to receive

265 unnormalized ones and to not react to that in excessive ways.

266

2674. Versions of Unicode

268

269 Unicode changes and expands over time. Large blocks of space are

270 reserved for future expansion. New versions, which appear at regular

271 intervals, add new scripts and characters. Occasionally they also

272 change some property definitions. In retrospect, one of the

273 advantages of ASCII [ASCII] when it was chosen was that the code

274 space was full when the Standard was first published. There was no

275 practical way to add characters or change code point assignments

276 without being obviously incompatible.

277

278

279

280

281

282Klensin & Padlipsky Standards Track [Page 5]

283

284RFC 5198 Network Unicode March 2008

285

286

287 While there are some security issues if people deliberately try to

288 trick the system (see Section 6), Unicode version changes should not

289 have a significant impact on the text stream specification of this

290 document for the following reasons:

291

292 o The transformation between Unicode code table positions and the

293 corresponding UTF-8 code is algorithmic; it does not depend on

294 whether a code point has been assigned or not.

295

296 o The normalization recommended here, NFC (see Section 3), performs

297 a very limited set of mappings, much more limited than those of

298 the more extensive NFKC used in, e.g., Nameprep [RFC3491].

299

300 The NFC tables may be updated over time as new characters are added,

301 but the Unicode Consortium has guaranteed the stability of all NFC

302 strings. That is, if a string does not contain any unassigned

303 characters, and it is normalized according to NFC, it will always be

304 normalized according to all future versions of the Unicode Standard.

305 The stability of the Net-Unicode format is thus guaranteed when any

306 implementation that converts text into Net-Unicode format does not

307 permit unassigned characters.

308

309 Because Unicode code points that are reserved for private use do not

310 have standard definitions or normalization interpretations, they

311 SHOULD be avoided in strings intended for Internet interchange.

312

313 Were Unicode to be changed in a way that violated these assumptions,

314 i.e., that either invalidated the byte string order specified in RFC

315 3629 or that changed the stability of NFC as stated above, this

316 specification would not apply. Put differently, this specification

317 applies only to versions of Unicode starting with version 5.0 and

318 extending to, but not including, any version for which changes are

319 made in either the UTF-8 definition or to NFC stability. Such

320 changes would violate established Unicode policies and are hence

321 unlikely, but, should they occur, it would be necessary to evaluate

322 them for compatibility with this specification and other Internet

323 uses of NFC.

324

325 If the specification of a protocol references this one, strings that

326 are received by that protocol and that appear to be UTF-8 and are not

327 otherwise identified (e.g., by charset labeling) SHOULD be treated as

328 using UTF-8 in conformance with this specification.

329

330

331

332

333

334

335

336

337

338Klensin & Padlipsky Standards Track [Page 6]

339

340RFC 5198 Network Unicode March 2008

341

342

3435. Applicability and Stability of this Specification

344

3455.1. Use in IETF Applications Specifications

346

347 During the development of this specification, there was some

348 confusion about where it would be useful given that, e.g., the

349 individual MIME media types used in email and with HTTP have their

350 own rules about UTF-8 character types and normalization, and the

351 application transport protocols impose their own conventions about

352 line endings. There are three answers. The first is that, in

353 retrospect, it would have been better to have those protocols and

354 content types standardized in the way specified here, even though it

355 is certainly too late to change them at this time. The second is

356 that we have several protocols that are dependent on either the

357 original Telnet design or other arrangements requiring a standard,

358 interoperable, string definition without specific content-labels of

359 one sort or another. Whois [RFC3912] is an example member of this

360 group. As consideration is given to upgrading them for non-ASCII

361 use, this specification provides a normative reference that provides

362 the same stability that NVT has provided the ASCII forms. This

363 specification is intended for use by other specifications that have

364 not yet defined how to use Unicode. Having a preferred standard

365 Internet definition for Unicode text streams -- rather than just one

366 for transmission codings -- may help improve the specification and

367 interoperability of protocols to be developed in the future. This

368 specification is not intended for use with specifications that

369 already allow the use of UTF-8 and precisely define that use.

370

3715.2. Unicode Versions and Applicability

372

373 The IETF faces a practical dilemma with regard to versions of

374 Unicode. Each new version brings with it new characters and

375 sometimes new combining characters. Version 5.0 introduces the new

376 concept of sequences of characters named as if they were individual

377 characters (see [NamedSequences]). The normalization represented by

378 NFC is stable if all strings are transmitted and stored in normalized

379 form if corrections are never made to character definitions or

380 normalization tables and if unassigned code points are never used.

381 The latter is important because an unassigned code point always

382 normalizes to itself. However, if the same code point is assigned to

383 a character in a future version, it may participate in some other

384 normalization mapping (some specific difficulties in this regard are

385 discussed in [RFC4690]). It is worth noting that transmission in

386 normalized form is not required by either the IETF's UTF-8 Standard

387 [RFC3629] or by standards dependent on the current version of

388 Stringprep [RFC3454].

389

390

391

392

393

394Klensin & Padlipsky Standards Track [Page 7]

395

396RFC 5198 Network Unicode March 2008

397

398

399 All would be well with this as described in Section 4 except for one

400 problem: Applications typically do not perform their own conversions

401 to Unicode and may not perform their own normalizations but instead

402 rely on operating system or language library functions -- functions

403 that may be upgraded or otherwise changed without changes to the

404 application code itself. Consequently, there may be no plausible way

405 for an application to know which version of Unicode, or which version

406 of the normalization procedures, it is utilizing, nor is there any

407 way by which it can guarantee that the two will be consistent.

408

409 Because of per-version changes in definitions and tables, Stringprep

410 and documents depending on it are now tied to Unicode Version 3.2

411 [Unicode32] and full interoperability of Internet Standard UTF-8

412 [RFC3629], when used with normalization as specified here, is

413 dependent on normalization definitions and the definition of UTF-8

414 itself not changing after Unicode Version 5.0. These assumptions

415 seem fairly safe, but they are still assumptions. Rather than being

416 linked to the latest available version of Unicode, version 5.0

417 [Unicode] or broader concepts of version independence based on

418 specific assumptions and conditions, this specification could

419 reasonably have been tied, like Stringprep and Nameprep to Unicode

420 3.2 [Unicode32] or some more recent intermediate version, but, in

421 addition to the obvious disadvantages of having different IETF

422 standards tied to different versions of Unicode, the library-based

423 application implementation behavior described above makes these

424 version linkages nearly meaningless in practice.

425

426 In theory, one can get around this problem in four ways:

427

428 1. Freeze on a particular version of Unicode and try to insist that

429 applications enforce that version by, e.g., containing lists of

430 unassigned characters and prohibiting their use. Of course, this

431 would prohibit evolution to include newly-added scripts and the

432 tables of unassigned code points would be cumbersome.

433

434 2. Require that every Unicode "text" string or file start with a

435 version indication, somewhat akin to the "byte order mark"

436 indicator. It is unlikely that this provision would be

437 practical. More important, it would require that each

438 application implementation be prepared to either support multiple

439 normalization tables and versions or that it reject text from

440 Unicode versions with which it was not prepared to deal.

441

442 3. Devise a different set of normalization rules that would, e.g.,

443 guarantee that no character assigned to a previously-unassigned

444 code point in Unicode was ever normalized to anything but itself

445 and use those rules instead of NFC. It is not clear whether or

446 not such a set of rules is possible or whether some other

447

448

449

450Klensin & Padlipsky Standards Track [Page 8]

451

452RFC 5198 Network Unicode March 2008

453

454

455 completely stable set of rules could be devised, perhaps in

456 combination with restrictions on the ways in which characters

457 were added in future versions of Unicode.

458

459 4. Devise a normalization process that is otherwise equivalent to

460 NFC but that rejects code points that are unassigned in the

461 current version of Unicode, rather than mapping those code points

462 to themselves. This would still leave some risk of incompatible

463 corrections in Unicode and possibly a few edge cases, but it is

464 probably stable enough for Internet use in the overwhelming

465 number of cases. This process has been discussed in the Unicode

466 Consortium under the name "Stable NFC".

467

468 None of these approaches seems ideal: the ideal procedure would be as

469 stable and predictable as ASCII has been. But that level is simply

470 not feasible as long as Unicode continues to evolve by the addition

471 of new code points and scripts. The fourth option listed above

472 appears to be a reasonable compromise.

473

4746. Security Considerations

475

476 This specification provides a standard form for the use of Unicode as

477 "network text". Most of the same security issues that apply to

478 UTF-8, as discussed in [RFC3629], apply to it, although it should be

479 slightly less subject to some risks by virtue of requiring NFC

480 normalization and generally being somewhat more restrictive.

481 However, shifts in Unicode versions, as discussed in Section 5.2, may

482 introduce other security issues.

483

484 Programs that receive these streams should use extreme caution about

485 assuming that incoming data are normalized, since it might be

486 possible to use unnormalized forms, as well as invalid UTF-8, as part

487 of an attack. In particular, firewalls and other systems that

488 interpret UTF-8 streams should be developed with the clear knowledge

489 that an attacker may deliberately send unnormalized text, for

490 instance, to avoid detection by naive text-matching systems.

491

492 NVT contains a requirement, of necessity repeated here (see

493 Section 2), that the CR character be immediately followed by either

494 LF or ASCII NUL (an octet with all bits zero). NUL may be

495 problematic for some programming languages that use it as a string

496 terminator, and hence a trap for the unwary, unless caution is used.

497 This may be an additional reason to avoid the use of CR entirely,

498 except in sequence with LF, as suggested above.

499

500 The discussion about Unicode versions above (see Section 4 and

501 Section 5.2) makes several assumptions about future versions of

502 Unicode, about NFC normalization being applied properly, and about

503

504

505

506Klensin & Padlipsky Standards Track [Page 9]

507

508RFC 5198 Network Unicode March 2008

509

510

511 UTF-8 being processed and transmitted exactly as specified in RFC

512 3629. If any of those assumptions are not correct, then there are

513 cases in which strings that would be considered equivalent do not

514 compare equal. Robust code should be prepared for those

515 possibilities.

516

5177. Acknowledgments

518

519 Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for

520 suggestions about Unicode normalization that led to the format

521 described here, and especially to Mark for providing the paragraphs

522 that describe the role of NFC. Thanks also to Mark, Doug Ewell,

523 Asmus Freytag for corrected text describing Unicode transmission

524 forms, and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin

525 Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern

526 Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George

527 Michaelson, Chris Newman, and Marcos Sanz for a number of helpful

528 comments and clarification requests.

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562Klensin & Padlipsky Standards Track [Page 10]

563

564RFC 5198 Network Unicode March 2008

565

566

567Appendix A. History and Context

568

569 This subsection contains a review of prior work in the ARPANET and

570 Internet to establish a standard text type, work that establishes the

571 context and motivation for the approach taken in this document. The

572 text is explanatory rather than normative: nothing in this section is

573 intended to change or update any current specification. Those who

574 are uninterested in this review and analysis can safely skip this

575 section.

576

577 One of the earlier application design decisions made in the

578 development of ARPANET, a decision that was carried forward into the

579 Internet, was the decision to standardize on a single and very

580 specific coding for "text" to be passed across the network [RFC0020].

581 Hosts on the network were then responsible for translating or mapping

582 from whatever character coding conventions were used locally to that

583 common intermediate representation, with sending hosts mapping to it

584 and receiving ones mapping from it to their local forms as needed.

585 It is interesting to note that at the time the ARPANET was being

586 developed, participating host operating systems used at least three

587 different character coding standards: the antiquated BCD (Binary

588 Coded Decimal), the then-dominant major manufacturer-backed EBCDIC

589 (Extended BCD Interchange Code), and the then-still emerging ASCII

590 (American Standard Code for Information Interchange). Since the

591 ARPANET was an "open" project and EBCDIC was intimately linked to a

592 particular hardware vendor, the original Network Working Group agreed

593 that its standard should be ASCII. That ASCII form was precisely

594 "7-bit ASCII in an 8-bit field", which was in effect a compromise

595 between hosts that were natively 7-bit oriented (e.g., with five

596 seven-bit characters in a 36-bit word), those that were 8-bit

597 oriented (using eight-bit characters) and those that placed the

598 seven-bit ASCII characters in 9-bit fields with two leading zero bits

599 (four characters in a 36-bit word).

600

601 More standardization was suggested in the first preliminary

602 description of the Telnet protocol [RFC0097]. With the iterations of

603 that protocol [RFC0137] [RFC0139] and the drawing together of an

604 essentially formal definition somewhat later [RFC0318], a standard

605 abstraction, the Network Virtual Terminal (NVT) was established. NVT

606 character-coding conventions (initially called "Telnet ASCII" and

607 later called "NVT ASCII", or, more casually, "network ASCII")

608 included the requirement that Carriage Return followed by Line Feed

609 (CRLF) be the common representation for ending lines of text (given

610 that some participating "Host" operating systems used the one

611 natively, some the other, at least one used both, and a few used

612 neither (preferring variable-length lines with counts or special

613 delimiters or markers instead) and specified conventions for some

614 other characters. Also, since NVT ASCII was restricted to seven-bit

615

616

617

618Klensin & Padlipsky Standards Track [Page 11]

619

620RFC 5198 Network Unicode March 2008

621

622

623 characters, use of the high-order bit in octets was reserved for the

624 transmission of control signaling information.

625

626 At a very high level, the concept was that a system could use

627 whatever character coding and line representations were appropriate

628 locally, but text transmitted over the network as text must conform

629 to the single "network virtual terminal" convention. Virtually all

630 early Internet protocols that presume transfer of "text" assume this

631 virtual terminal model, although different ones assume or limit it in

632 different ways. Telnet, the command stream and ASCII Type in FTP

633 [RFC0542], the message stream in SMTP transfer [RFC2821], and the

634 strings passed to finger [RFC0742] and whois [RFC0954] are the

635 classic examples. More recently, HTTP [RFC1945] [RFC2616] follows

636 the same general model but permits 8-bit data and leaves the line end

637 sequence unspecified (the latter has been the source of a significant

638 number of problems).

639

640Appendix B. The ASCII NVT Definition

641

642 The main body of this specification is intended as an update to, and

643 internationalized version of, the Net-ASCII definition. The

644 specification is self-contained in that parts of the Net-ASCII

645 definition that are no longer recommended are not included above.

646 Because Net-ASCII evolved somewhat over time and there has been

647 debate about which specification is the "official" Net-ASCII, it is

648 appropriate to review the key elements of that definition here. This

649 review is informal with regard to the contents of Net-ASCII and

650 should not be considered as a normative update or summary of the

651 earlier specifications (Section 2 does specify some normative updates

652 to those specifications and some comments below are consistent with

653 it).

654

655 The first part of the section titled "THE NVT PRINTER AND KEYBOARD"

656 in RFC 854 [RFC0854] is generally, although not universally,

657 considered to be the normative definition of the (ASCII) Network

658 Virtual Terminal and hence of Net-ASCII. It includes not only the

659 graphic ASCII characters but a number of control characters. The

660 latter are given Internet-specific meanings that are often more

661 specific than the definitions in the ASCII specification. In today's

662 usage, and for the present specification, the following

663 clarifications and updates to that list should be noted. Each one is

664 accompanied by a brief explanation of the reason why the original

665 specification is no longer appropriate.

666

667 1. The "defined but not required" codes -- BEL (U+0007), BS

668 (U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the

669 undefined control codes ("C0") SHOULD NOT be used unless required

670 by exceptional circumstances. Either their original "network

671

672

673

674Klensin & Padlipsky Standards Track [Page 12]

675

676RFC 5198 Network Unicode March 2008

677

678

679 printer" definitions are no longer in general use, common

680 practice has evolved away from the formats specified there, or

681 their use to simulate characters that are better handled by

682 Unicode is no longer appropriate. While the appearance of some

683 of these characters on the list may seem surprising, BS now has

684 an ambiguous interpretation in practice (erasing in some systems

685 but not in others), the width associated with HT varies with the

686 environment, and VT and FF do not have a uniform effect with

687 regard to either vertical positioning or the associated

688 horizontal position result. Of course, telnet escapes are not

689 considered part of the data stream and hence are unaffected by

690 this provision.

691

692 2. In Net-ASCII, CR MUST NOT appear except when immediately followed

693 by either NUL or LF, with the latter (CR LF) designating the "new

694 line" function. Today and as specified above, CR should

695 generally appear only when followed by LF. Because page layout

696 is better done in other ways, because NUL has a special

697 interpretation in some programming languages, and to avoid other

698 types of confusion, CR NUL should preferably be avoided as

699 specified above.

700

701 3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF

702 sequences (e.g., CR LF CR LF).

703

704 4. The historical NVT documents do not call out either "bare LF" (LF

705 without CR) or HT for special treatment. Both have generally

706 been understood to be problematic. In the case of LF, there is a

707 difference in interpretation as to whether its semantics imply

708 "go to same position on the next line" or "go to the first

709 position on the next line" and interoperability considerations

710 suggest not depending on which interpretation the receiver

711 applies. At the same time, misinterpretation of LF is less

712 harmful than misinterpretation of "bare" CR: in the CR case, text

713 may be erased or made completely unreadable; in the LF one, the

714 worst consequence is a very funny-looking display. Obviously, HT

715 is problematic because there is no standard way to transmit

716 intended tab position or width information in running text.

717 Again, the harm is unlikely to be great if HT is simply

718 interpreted as one or more spaces, but, in general, it cannot be

719 relied upon to format information.

720

721 It is worth noting that the telnet IAC character (an octet consisting

722 of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that

723 particular octet cannot appear in a valid UTF-8 string. However,

724 while few of them have been used, telnet permits other command-

725 introducer characters whose bit sequences in an octet may be part of

726 valid UTF-8 characters. While it causes no ambiguity in UTF-8,

727

728

729

730Klensin & Padlipsky Standards Track [Page 13]

731

732RFC 5198 Network Unicode March 2008

733

734

735 Unicode assigns a graphic character ("Latin Small Letter Y with

736 Diaeresis") to U+00FF (octets C3 B0 in UTF-8). Some caution is

737 clearly in order in this area.

738

739Appendix C. The Line-Ending Problem

740

741 The definition of how a line ending should be denoted in plain text

742 strings on the wire for the Internet has been controversial from even

743 before the introduction of NVT. Some have argued that recipients

744 should be required to interpret almost anything that a sender might

745 intend as a line ending as actually a line ending. Others have

746 pointed out that this would lead to some ambiguities of

747 interpretation and presentation and would violate the principle that

748 we should minimize the number of forms that are permitted on the wire

749 in order to promote interoperability and eliminate the "every

750 recipient needs to understand every sender format" problem. The

751 design of this specification, like that of NVT, takes the latter

752 approach. Its designers believe that there is little point in a

753 standard if it is to specify "anyone can do whatever they like and

754 the receiver just needs to cope".

755

756 A further discussion of the nature and evolution of the line-ending

757 problem appears in Section 5.8 of the Unicode Standard [Unicode] and

758 is suggested for additional reading. If we were starting with the

759 Internet today, it would probably be sensible to follow the

760 recommendation there and use LS (U+2028) exclusively, in preference

761 to CRLF. However, the installed base of use of CRLF and the

762 importance of forward compatibility with NVT and protocols that

763 assume it makes that impossible, so it is necessary to continue using

764 CRLF as the "New Line Function" ("NLF", see the terminology section

765 in that reference).

766

767Appendix D. A Note about Related Future Work

768

769 Consideration should be given to a Telnet (or SSH [RFC4251]) option

770 to specify this type of stream and an FTP extension [RFC0959] to

771 permit a new "Unicode text" data TYPE.

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786Klensin & Padlipsky Standards Track [Page 14]

787

788RFC 5198 Network Unicode March 2008

789

790

791References

792

793Normative References

794

795 [ISO10646] International Organization for Standardization,

796 "Information Technology - Universal Multiple-Octet

797 Coded Character Set (UCS) - Part 1: Architecture

798 and Basic Multilingual Plane", ISO/

799 IEC 10646-1:2000, October 2000.

800

801 [NFC] Davis, M. and M. Duerst, "Unicode Standard Annex

802 #15: Unicode Normalization Forms", October 2006,

803 <http://www.unicode.org/reports/tr15/>.

804

805 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate

806 Requirement Levels", BCP 14, RFC 2119, March 1997.

807

808 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO

809 10646", STD 63, RFC 3629, November 2003.

810

811 [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for

812 Syntax Specifications: ABNF", STD 68, RFC 5234,

813 January 2008.

814

815 [Unicode] The Unicode Consortium, "The Unicode Standard,

816 Version 5.0", 2007.

817

818 Boston, MA, USA: Addison-Wesley. ISBN

819 0-321-48091-0

820

821 [Unicode32] The Unicode Consortium, "The Unicode Standard,

822 Version 3.0", 2000.

823

824 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-

825 61633-5). Version 3.2 consists of the definition

826 in that book as amended by the Unicode Standard

827 Annex #27: Unicode 3.1

828 (http://www.unicode.org/reports/tr27/) and by the

829 Unicode Standard Annex #28: Unicode 3.2

830 (http://www.unicode.org/reports/tr28/).

831

832

833

834

835

836

837

838

839

840

841

842Klensin & Padlipsky Standards Track [Page 15]

843

844RFC 5198 Network Unicode March 2008

845

846

847Informative References

848

849 [ASCII] American National Standards Institute (formerly

850 United States of America Standards Institute), "USA

851 Code for Information Interchange", ANSI X3.4-1968,

852 1968.

853

854 ANSI X3.4-1968 has been replaced by newer versions

855 with slight modifications, but the 1968 version

856 remains definitive for the Internet. ISO 646

857 International Reverence Version (IRV)

858 [ISO.646.1991] is usually considered equivalent to

859 ASCII.

860

861 [ISO.646.1991] International Organization for Standardization,

862 "Information technology - ISO 7-bit coded character

863 set for information interchange", ISO Standard 646,

864 1991.

865

866 [NamedSequences] The Unicode Consortium, "NamedSequences-4.1.0.txt",

867 2005, <http://www.unicode.org/Public/UNIDATA/

868 NamedSequences.txt>.

869

870 [RFC0020] Cerf, V., "ASCII format for network interchange",

871 RFC 20, October 1969.

872

873 [RFC0097] Melvin, J. and R. Watson, "First Cut at a Proposed

874 Telnet Protocol", RFC 97, February 1971.

875

876 [RFC0137] O'Sullivan, T., "Telnet Protocol - a proposed

877 document", RFC 137, April 1971.

878

879 [RFC0139] O'Sullivan, T., "Discussion of Telnet Protocol",

880 RFC 139, May 1971.

881

882 [RFC0318] Postel, J., "Telnet Protocols", RFC 318,

883 April 1972.

884

885 [RFC0542] Neigus, N., "File Transfer Protocol", RFC 542,

886 August 1973.

887

888 [RFC0698] Mock, T., "Telnet extended ASCII option", RFC 698,

889 July 1975.

890

891 [RFC0742] Harrenstien, K., "NAME/FINGER Protocol", RFC 742,

892 December 1977.

893

894

895

896

897

898Klensin & Padlipsky Standards Track [Page 16]

899

900RFC 5198 Network Unicode March 2008

901

902

903 [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol

904 Specification", STD 8, RFC 854, May 1983.

905

906 [RFC0954] Harrenstien, K., Stahl, M., and E. Feinler,

907 "NICNAME/WHOIS", RFC 954, October 1985.

908

909 [RFC0959] Postel, J. and J. Reynolds, "File Transfer

910 Protocol", STD 9, RFC 959, October 1985.

911

912 [RFC1945] Berners-Lee, T., Fielding, R., and H. Nielsen,

913 "Hypertext Transfer Protocol -- HTTP/1.0",

914 RFC 1945, May 1996.

915

916 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and

917 Languages", BCP 18, RFC 2277, January 1998.

918

919 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,

920 Masinter, L., Leach, P., and T. Berners-Lee,

921 "Hypertext Transfer Protocol -- HTTP/1.1",

922 RFC 2616, June 1999.

923

924 [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of

925 ISO 10646", RFC 2781, February 2000.

926

927 [RFC2821] Klensin, J., "Simple Mail Transfer Protocol",

928 RFC 2821, April 2001.

929

930 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of

931 Internationalized Strings ("stringprep")",

932 RFC 3454, December 2002.

933

934 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A

935 Stringprep Profile for Internationalized Domain

936 Names (IDN)", RFC 3491, March 2003.

937

938 [RFC3912] Daigle, L., "WHOIS Protocol Specification",

939 RFC 3912, September 2004.

940

941 [RFC4251] Ylonen, T. and C. Lonvick, "The Secure Shell (SSH)

942 Protocol Architecture", RFC 4251, January 2006.

943

944 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB,

945 "Review and Recommendations for Internationalized

946 Domain Names (IDNs)", RFC 4690, September 2006.

947

948

949

950

951

952

953

954Klensin & Padlipsky Standards Track [Page 17]

955

956RFC 5198 Network Unicode March 2008

957

958

959Authors' Addresses

960

961 John C Klensin

962 1770 Massachusetts Ave, #322

963 Cambridge, MA 02140

964 USA

965

966 Phone: +1 617 491 5735

967 EMail: john-ietf@jck.com

968

969

970 Michael A. Padlipsky

971 8011 Stewart Ave.

972 Los Angeles, CA 90045

973 USA

974

975 Phone: +1 310-670-4288

976 EMail: the.map@alum.mit.edu

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1010Klensin & Padlipsky Standards Track [Page 18]

1011

1012RFC 5198 Network Unicode March 2008

1013

1014

1015Full Copyright Statement

1016

1017 Copyright (C) The IETF Trust (2008).

1018

1019 This document is subject to the rights, licenses and restrictions

1020 contained in BCP 78, and except as set forth therein, the authors

1021 retain all their rights.

1022

1023 This document and the information contained herein are provided on an

1024 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS

1025 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND

1026 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS

1027 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF

1028 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED

1029 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1030

1031Intellectual Property

1032

1033 The IETF takes no position regarding the validity or scope of any

1034 Intellectual Property Rights or other rights that might be claimed to

1035 pertain to the implementation or use of the technology described in

1036 this document or the extent to which any license under such rights

1037 might or might not be available; nor does it represent that it has

1038 made any independent effort to identify any such rights. Information

1039 on the procedures with respect to rights in RFC documents can be

1040 found in BCP 78 and BCP 79.

1041

1042 Copies of IPR disclosures made to the IETF Secretariat and any

1043 assurances of licenses to be made available, or the result of an

1044 attempt made to obtain a general license or permission for the use of

1045 such proprietary rights by implementers or users of this

1046 specification can be obtained from the IETF on-line IPR repository at

1047 http://www.ietf.org/ipr.

1048

1049 The IETF invites any interested party to bring to its attention any

1050 copyrights, patents or patent applications, or other proprietary

1051 rights that may cover technology that may be required to implement

1052 this standard. Please address the information to the IETF at

1053 ietf-ipr@ietf.org.

1066Klensin & Padlipsky Standards Track [Page 19]

1067

1068