1
2
3
4
5
6
7Network Working Group J. Klensin
8Request for Comments: 5198 M. Padlipsky
9Obsoletes: 698 March 2008
10Updates: 854
11Category: Standards Track
12
13
14 Unicode Format for Network Interchange
15
16Status of This Memo
17
18 This document specifies an Internet standards track protocol for the
19 Internet community, and requests discussion and suggestions for
20 improvements. Please refer to the current edition of the "Internet
21 Official Protocol Standards" (STD 1) for the standardization state
22 and status of this protocol. Distribution of this memo is unlimited.
23
24Abstract
25
26 The Internet today is in need of a standardized form for the
27 transmission of internationalized "text" information, paralleling the
28 specifications for the use of ASCII that date from the early days of
29 the ARPANET. This document specifies that format, using UTF-8 with
30 normalization and specific line-ending sequences.
31
32Table of Contents
33
34 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
35 1.1. Requirement for a Standardized Text Stream Format . . . . 2
36 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
37 2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 3
38 3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 5
39 4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 5
40 5. Applicability and Stability of this Specification . . . . . . 7
41 5.1. Use in IETF Applications Specifications . . . . . . . . . 7
42 5.2. Unicode Versions and Applicability . . . . . . . . . . . . 7
43 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9
44 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10
45 Appendix A. History and Context . . . . . . . . . . . . . . . . . 11
46 Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 12
47 Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14
48 Appendix D. A Note about Related Future Work . . . . . . . . . . 14
49 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
50 Normative References . . . . . . . . . . . . . . . . . . . . . . 15
51 Informative References . . . . . . . . . . . . . . . . . . . . . 16
52
53
54
55
56
57
58Klensin & Padlipsky Standards Track [Page 1]
59
60RFC 5198 Network Unicode March 2008
61
62
631. Introduction
64
651.1. Requirement for a Standardized Text Stream Format
66
67 Historically, Internet protocols have been largely ASCII-based and
68 references to "text" in protocols have assumed ASCII text and
69 specifically text in Network Virtual Terminal ("NVT") or "Network
70 ASCII" form (see Appendix A and Appendix B). Protocols and formats
71 that have moved beyond ASCII have included arrangements to
72 specifically identify the character set and often the language being
73 used.
74
75 In our more internationalized world, "text" clearly no longer equates
76 unambiguously to "network ASCII". Fortunately, however, we are
77 converging on Unicode [Unicode] [ISO10646] as a single international
78 interchange character coding and no longer need to deal with per-
79 script standards for character sets (e.g., one standard for each of
80 Arabic, Cyrillic, Devanagari, etc., or even standards keyed to
81 languages that are usually considered to share a script, such as
82 French, German, or Swedish). Unfortunately, though, while it is
83 certainly time to define a Unicode-based text type for use as a
84 common text interchange format, "use Unicode" involves even more
85 ambiguity than "use ASCII" did decades ago.
86
87 Unicode identifies each character by an integer, called its "code
88 point", in the range 0-0x10ffff. These integers can be encoded into
89 byte sequences for transmission in at least three standard and
90 generally-recognized encoding forms, all of which are completely
91 defined in The Unicode Standard and the documents cited below:
92
93 o UTF-8 [RFC3629] defines a variable-length encoding that may be
94 applied uniformly to all code points.
95
96 o UTF-16 [RFC2781] encodes the range of Unicode characters whose
97 code points are less than 65536 straightforwardly as 16-bit
98 integers, and provides a "surrogate" mechanism for encoding larger
99 code points in 32 bits.
100
101 o UTF-32 (also known as UCS-4) simply encodes each code point as a
102 32-bit integer.
103
104 Older forms and nomenclature, such as the 16-bit UCS-2, are now
105 strongly discouraged.
106
107 As with ASCII, any of these forms may be used with different line-
108 ending conventions. That flexibility can be an additional source of
109 confusion with, e.g., index (offset) references into documents based
110 on character counts.
111
112
113
114Klensin & Padlipsky Standards Track [Page 2]
115
116RFC 5198 Network Unicode March 2008
117
118
119 This document proposes to establish "Net-Unicode" as a new
120 standardized text transmission form for the Internet, to serve as an
121 internationalized alternative for NVT ASCII when specified in new --
122 and, where appropriate, updated -- protocols. UTF-8 [RFC3629] is
123 chosen for the coding because it has good compatibility properties
124 with ASCII and for other reasons discussed in the existing IETF
125 character set policy [RFC2277]. "Net-Unicode" is specified in
126 Section 2; the subsequent sections of the document provide background
127 and explanation.
128
129 Whenever there is a choice, Unicode SHOULD be used with the text
130 encoding specified here. This combination is preferred to the
131 double-byte encoding of "extended ASCII" [RFC0698] or the assorted
132 per-language or per-country character coding systems.
133
1341.2. Terminology
135
136 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
137 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
138 document are to be interpreted as described in [RFC2119].
139
1402. Net-Unicode Definition
141
142 The Network Unicode format (Net-Unicode) is defined as follows.
143 Parts of this definition are deliberately informal, providing
144 guidance for specific profiles or rules in the protocols that
145 reference this one rather than firm rules that apply globally.
146
147 1. Characters MUST be encoded in UTF-8 as defined in [RFC3629].
148
149 2. If the protocol has the concept of "lines", line-endings MUST be
150 indicated by the sequence Carriage-Return (CR, U+000D) followed
151 by Line-Feed (LF, U+000A), often known just as CRLF. CR SHOULD
152 NOT appear except when followed by LF. The only other allowed
153 context in which CR is permitted is in the combination CR NUL,
154 which is not recommended (see the note at the end of this
155 section).
156
157 3. The control characters in the ASCII range (U+0000 to U+001F and
158 U+007F to U+009F) SHOULD generally be avoided. Space (SP,
159 U+0020), CR, LF, and Form Feed (FF, U+000C) are exceptions to
160 this principle, but use of all but the first requires care as
161 discussed elsewhere in this document. The so-called "C1
162 Controls" (U+0080 through U+009F), which did not appear in ASCII,
163 MUST NOT appear.
164
165 FF should be used only with caution: it does not have a standard
166 and universal interpretation and, in particular, if its use
167
168
169
170Klensin & Padlipsky Standards Track [Page 3]
171
172RFC 5198 Network Unicode March 2008
173
174
175 assumes a page length, such assumptions may not be appropriate in
176 international contexts (e.g., considering 8.5x11 inch paper
177 versus A4). Other control characters are used to affect display
178 format, control devices, or to structure files. None of those
179 uses is appropriate for streams of plain text.
180
181 4. Before transmission, all character sequences SHOULD be normalized
182 according to Unicode normalization form "NFC" (see Section 3).
183
184 5. As suggested in Section 6 of RFC 3629, the Byte Order Mark
185 ("BOM") signature MUST NOT appear at the beginning of these text
186 strings.
187
188 6. Systems conforming to this specification MUST NOT transmit any
189 string containing any code point that is unassigned in the
190 version of Unicode on which they are dependent. The version of
191 NFC and the version of Unicode used by that system MUST be
192 consistent.
193
194 The use of LF without CR is questionable; see Appendix B for more
195 discussion. The newer control characters IND (U+0084) and NEL ("Next
196 Line", U+0085) might have been used to disambiguate the various line-
197 ending situations, but, because their use has not been established on
198 the Internet, because many protocols require CRLF, and because IND
199 and NEL fall within the "C1 Controls" group (see below), they MUST
200 NOT be used. Similar observations apply to the yet newer line and
201 paragraph separators at U+2028 and U+2029 and any future characters
202 that might be defined to serve these functions. For this
203 specification and protocols that depend on it, lines end in CRLF and
204 only in CRLF. Anything that does not end in CRLF is either not a
205 line or is severely malformed.
206
207 The NVT specification contained a number of additional provisions,
208 e.g., for the optional use of backspacing and "bare CR" (sent as CR
209 NUL) to generate overstruck character sequences. The much greater
210 number of precomposed characters in Unicode, the availability of
211 combining characters, and the growing use of markup conventions of
212 various types to show, e.g., emphasis (rather than attempting to do
213 that via the use of special characters), should make such sequences
214 largely unnecessary. These sequences SHOULD be avoided if at all
215 possible. However, because they were optional in NVT applications
216 and this specification is an NVT superset, they cannot be prohibited
217 entirely. The most important of these rules is that CR MUST NOT
218 appear unless it is immediately followed by LF (indicating end of
219 line) or NUL. Because NUL (an octet whose value is all zeros, i.e.,
220 %x00 in the notation of [RFC5234]) is hostile to programming
221 languages that use that character as a string delimiter, the CR NUL
222 sequence SHOULD be avoided for that reason as well.
223
224
225
226Klensin & Padlipsky Standards Track [Page 4]
227
228RFC 5198 Network Unicode March 2008
229
230
2313. Normalization
232
233 There are cases where strings of Unicode are fundamentally
234 equivalent, essentially representing the same text. These are called
235 "canonical equivalents" in the Unicode Standard. For example, the
236 following pairs of strings are canonically equivalent:
237
238 U+2126 OHM SIGN
239 U+03A9 GREEK CAPITAL LETTER OMEGA
240
241 U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT
242 U+00E0 LATIN SMALL LETTER A WITH GRAVE
243
244 Comparison of strings becomes much easier if any such cases are
245 always represented by a single unique form. The Unicode Consortium
246 specifies a normalization form, known as NFC [NFC], which provides
247 the necessary mappings and mechanisms to convert all canonically
248 equivalent sequences to a single unique form. Typically, this form
249 produces precomposed characters for any sequences that can be
250 represented in that fashion. It also reorders other combining marks
251 so that they have a unique and unambiguous order.
252
253 Of the various normalization forms defined as part of Unicode, NFC is
254 closest to actual use in practice, minimizes side-effects due to
255 considering characters equivalent that may not be equivalent in all
256 situations, and typically requires the least work when converting
257 from non-Unicode encodings.
258
259 The section above requires that, except in very unusual
260 circumstances, all Net-Unicode strings be transmitted in normalized
261 form. Recognition of the fact that some implementations of
262 applications may rely on operating system libraries over which they
263 have little control and adherence to the robustness principle
264 suggests that receivers of such strings should be prepared to receive
265 unnormalized ones and to not react to that in excessive ways.
266
2674. Versions of Unicode
268
269 Unicode changes and expands over time. Large blocks of space are
270 reserved for future expansion. New versions, which appear at regular
271 intervals, add new scripts and characters. Occasionally they also
272 change some property definitions. In retrospect, one of the
273 advantages of ASCII [ASCII] when it was chosen was that the code
274 space was full when the Standard was first published. There was no
275 practical way to add characters or change code point assignments
276 without being obviously incompatible.
277
278
279
280
281
282Klensin & Padlipsky Standards Track [Page 5]
283
284RFC 5198 Network Unicode March 2008
285
286
287 While there are some security issues if people deliberately try to
288 trick the system (see Section 6), Unicode version changes should not
289 have a significant impact on the text stream specification of this
290 document for the following reasons:
291
292 o The transformation between Unicode code table positions and the
293 corresponding UTF-8 code is algorithmic; it does not depend on
294 whether a code point has been assigned or not.
295
296 o The normalization recommended here, NFC (see Section 3), performs
297 a very limited set of mappings, much more limited than those of
298 the more extensive NFKC used in, e.g., Nameprep [RFC3491].
299
300 The NFC tables may be updated over time as new characters are added,
301 but the Unicode Consortium has guaranteed the stability of all NFC
302 strings. That is, if a string does not contain any unassigned
303 characters, and it is normalized according to NFC, it will always be
304 normalized according to all future versions of the Unicode Standard.
305 The stability of the Net-Unicode format is thus guaranteed when any
306 implementation that converts text into Net-Unicode format does not
307 permit unassigned characters.
308
309 Because Unicode code points that are reserved for private use do not
310 have standard definitions or normalization interpretations, they
311 SHOULD be avoided in strings intended for Internet interchange.
312
313 Were Unicode to be changed in a way that violated these assumptions,
314 i.e., that either invalidated the byte string order specified in RFC
315 3629 or that changed the stability of NFC as stated above, this
316 specification would not apply. Put differently, this specification
317 applies only to versions of Unicode starting with version 5.0 and
318 extending to, but not including, any version for which changes are
319 made in either the UTF-8 definition or to NFC stability. Such
320 changes would violate established Unicode policies and are hence
321 unlikely, but, should they occur, it would be necessary to evaluate
322 them for compatibility with this specification and other Internet
323 uses of NFC.
324
325 If the specification of a protocol references this one, strings that
326 are received by that protocol and that appear to be UTF-8 and are not
327 otherwise identified (e.g., by charset labeling) SHOULD be treated as
328 using UTF-8 in conformance with this specification.
329
330
331
332
333
334
335
336
337
338Klensin & Padlipsky Standards Track [Page 6]
339
340RFC 5198 Network Unicode March 2008
341
342
3435. Applicability and Stability of this Specification
344
3455.1. Use in IETF Applications Specifications
346
347 During the development of this specification, there was some
348 confusion about where it would be useful given that, e.g., the
349 individual MIME media types used in email and with HTTP have their
350 own rules about UTF-8 character types and normalization, and the
351 application transport protocols impose their own conventions about
352 line endings. There are three answers. The first is that, in
353 retrospect, it would have been better to have those protocols and
354 content types standardized in the way specified here, even though it
355 is certainly too late to change them at this time. The second is
356 that we have several protocols that are dependent on either the
357 original Telnet design or other arrangements requiring a standard,
358 interoperable, string definition without specific content-labels of
359 one sort or another. Whois [RFC3912] is an example member of this
360 group. As consideration is given to upgrading them for non-ASCII
361 use, this specification provides a normative reference that provides
362 the same stability that NVT has provided the ASCII forms. This
363 specification is intended for use by other specifications that have
364 not yet defined how to use Unicode. Having a preferred standard
365 Internet definition for Unicode text streams -- rather than just one
366 for transmission codings -- may help improve the specification and
367 interoperability of protocols to be developed in the future. This
368 specification is not intended for use with specifications that
369 already allow the use of UTF-8 and precisely define that use.
370
3715.2. Unicode Versions and Applicability
372
373 The IETF faces a practical dilemma with regard to versions of
374 Unicode. Each new version brings with it new characters and
375 sometimes new combining characters. Version 5.0 introduces the new
376 concept of sequences of characters named as if they were individual
377 characters (see [NamedSequences]). The normalization represented by
378 NFC is stable if all strings are transmitted and stored in normalized
379 form if corrections are never made to character definitions or
380 normalization tables and if unassigned code points are never used.
381 The latter is important because an unassigned code point always
382 normalizes to itself. However, if the same code point is assigned to
383 a character in a future version, it may participate in some other
384 normalization mapping (some specific difficulties in this regard are
385 discussed in [RFC4690]). It is worth noting that transmission in
386 normalized form is not required by either the IETF's UTF-8 Standard
387 [RFC3629] or by standards dependent on the current version of
388 Stringprep [RFC3454].
389
390
391
392
393
394Klensin & Padlipsky Standards Track [Page 7]
395
396RFC 5198 Network Unicode March 2008
397
398
399 All would be well with this as described in Section 4 except for one
400 problem: Applications typically do not perform their own conversions
401 to Unicode and may not perform their own normalizations but instead
402 rely on operating system or language library functions -- functions
403 that may be upgraded or otherwise changed without changes to the
404 application code itself. Consequently, there may be no plausible way
405 for an application to know which version of Unicode, or which version
406 of the normalization procedures, it is utilizing, nor is there any
407 way by which it can guarantee that the two will be consistent.
408
409 Because of per-version changes in definitions and tables, Stringprep
410 and documents depending on it are now tied to Unicode Version 3.2
411 [Unicode32] and full interoperability of Internet Standard UTF-8
412 [RFC3629], when used with normalization as specified here, is
413 dependent on normalization definitions and the definition of UTF-8
414 itself not changing after Unicode Version 5.0. These assumptions
415 seem fairly safe, but they are still assumptions. Rather than being
416 linked to the latest available version of Unicode, version 5.0
417 [Unicode] or broader concepts of version independence based on
418 specific assumptions and conditions, this specification could
419 reasonably have been tied, like Stringprep and Nameprep to Unicode
420 3.2 [Unicode32] or some more recent intermediate version, but, in
421 addition to the obvious disadvantages of having different IETF
422 standards tied to different versions of Unicode, the library-based
423 application implementation behavior described above makes these
424 version linkages nearly meaningless in practice.
425
426 In theory, one can get around this problem in four ways:
427
428 1. Freeze on a particular version of Unicode and try to insist that
429 applications enforce that version by, e.g., containing lists of
430 unassigned characters and prohibiting their use. Of course, this
431 would prohibit evolution to include newly-added scripts and the
432 tables of unassigned code points would be cumbersome.
433
434 2. Require that every Unicode "text" string or file start with a
435 version indication, somewhat akin to the "byte order mark"
436 indicator. It is unlikely that this provision would be
437 practical. More important, it would require that each
438 application implementation be prepared to either support multiple
439 normalization tables and versions or that it reject text from
440 Unicode versions with which it was not prepared to deal.
441
442 3. Devise a different set of normalization rules that would, e.g.,
443 guarantee that no character assigned to a previously-unassigned
444 code point in Unicode was ever normalized to anything but itself
445 and use those rules instead of NFC. It is not clear whether or
446 not such a set of rules is possible or whether some other
447
448
449
450Klensin & Padlipsky Standards Track [Page 8]
451
452RFC 5198 Network Unicode March 2008
453
454
455 completely stable set of rules could be devised, perhaps in
456 combination with restrictions on the ways in which characters
457 were added in future versions of Unicode.
458
459 4. Devise a normalization process that is otherwise equivalent to
460 NFC but that rejects code points that are unassigned in the
461 current version of Unicode, rather than mapping those code points
462 to themselves. This would still leave some risk of incompatible
463 corrections in Unicode and possibly a few edge cases, but it is
464 probably stable enough for Internet use in the overwhelming
465 number of cases. This process has been discussed in the Unicode
466 Consortium under the name "Stable NFC".
467
468 None of these approaches seems ideal: the ideal procedure would be as
469 stable and predictable as ASCII has been. But that level is simply
470 not feasible as long as Unicode continues to evolve by the addition
471 of new code points and scripts. The fourth option listed above
472 appears to be a reasonable compromise.
473
4746. Security Considerations
475
476 This specification provides a standard form for the use of Unicode as
477 "network text". Most of the same security issues that apply to
478 UTF-8, as discussed in [RFC3629], apply to it, although it should be
479 slightly less subject to some risks by virtue of requiring NFC
480 normalization and generally being somewhat more restrictive.
481 However, shifts in Unicode versions, as discussed in Section 5.2, may
482 introduce other security issues.
483
484 Programs that receive these streams should use extreme caution about
485 assuming that incoming data are normalized, since it might be
486 possible to use unnormalized forms, as well as invalid UTF-8, as part
487 of an attack. In particular, firewalls and other systems that
488 interpret UTF-8 streams should be developed with the clear knowledge
489 that an attacker may deliberately send unnormalized text, for
490 instance, to avoid detection by naive text-matching systems.
491
492 NVT contains a requirement, of necessity repeated here (see
493 Section 2), that the CR character be immediately followed by either
494 LF or ASCII NUL (an octet with all bits zero). NUL may be
495 problematic for some programming languages that use it as a string
496 terminator, and hence a trap for the unwary, unless caution is used.
497 This may be an additional reason to avoid the use of CR entirely,
498 except in sequence with LF, as suggested above.
499
500 The discussion about Unicode versions above (see Section 4 and
501 Section 5.2) makes several assumptions about future versions of
502 Unicode, about NFC normalization being applied properly, and about
503
504
505
506Klensin & Padlipsky Standards Track [Page 9]
507
508RFC 5198 Network Unicode March 2008
509
510
511 UTF-8 being processed and transmitted exactly as specified in RFC
512 3629. If any of those assumptions are not correct, then there are
513 cases in which strings that would be considered equivalent do not
514 compare equal. Robust code should be prepared for those
515 possibilities.
516
5177. Acknowledgments
518
519 Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for
520 suggestions about Unicode normalization that led to the format
521 described here, and especially to Mark for providing the paragraphs
522 that describe the role of NFC. Thanks also to Mark, Doug Ewell,
523 Asmus Freytag for corrected text describing Unicode transmission
524 forms, and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin
525 Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern
526 Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George
527 Michaelson, Chris Newman, and Marcos Sanz for a number of helpful
528 comments and clarification requests.
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562Klensin & Padlipsky Standards Track [Page 10]
563
564RFC 5198 Network Unicode March 2008
565
566
567Appendix A. History and Context
568
569 This subsection contains a review of prior work in the ARPANET and
570 Internet to establish a standard text type, work that establishes the
571 context and motivation for the approach taken in this document. The
572 text is explanatory rather than normative: nothing in this section is
573 intended to change or update any current specification. Those who
574 are uninterested in this review and analysis can safely skip this
575 section.
576
577 One of the earlier application design decisions made in the
578 development of ARPANET, a decision that was carried forward into the
579 Internet, was the decision to standardize on a single and very
580 specific coding for "text" to be passed across the network [RFC0020].
581 Hosts on the network were then responsible for translating or mapping
582 from whatever character coding conventions were used locally to that
583 common intermediate representation, with sending hosts mapping to it
584 and receiving ones mapping from it to their local forms as needed.
585 It is interesting to note that at the time the ARPANET was being
586 developed, participating host operating systems used at least three
587 different character coding standards: the antiquated BCD (Binary
588 Coded Decimal), the then-dominant major manufacturer-backed EBCDIC
589 (Extended BCD Interchange Code), and the then-still emerging ASCII
590 (American Standard Code for Information Interchange). Since the
591 ARPANET was an "open" project and EBCDIC was intimately linked to a
592 particular hardware vendor, the original Network Working Group agreed
593 that its standard should be ASCII. That ASCII form was precisely
594 "7-bit ASCII in an 8-bit field", which was in effect a compromise
595 between hosts that were natively 7-bit oriented (e.g., with five
596 seven-bit characters in a 36-bit word), those that were 8-bit
597 oriented (using eight-bit characters) and those that placed the
598 seven-bit ASCII characters in 9-bit fields with two leading zero bits
599 (four characters in a 36-bit word).
600
601 More standardization was suggested in the first preliminary
602 description of the Telnet protocol [RFC0097]. With the iterations of
603 that protocol [RFC0137] [RFC0139] and the drawing together of an
604 essentially formal definition somewhat later [RFC0318], a standard
605 abstraction, the Network Virtual Terminal (NVT) was established. NVT
606 character-coding conventions (initially called "Telnet ASCII" and
607 later called "NVT ASCII", or, more casually, "network ASCII")
608 included the requirement that Carriage Return followed by Line Feed
609 (CRLF) be the common representation for ending lines of text (given
610 that some participating "Host" operating systems used the one
611 natively, some the other, at least one used both, and a few used
612 neither (preferring variable-length lines with counts or special
613 delimiters or markers instead) and specified conventions for some
614 other characters. Also, since NVT ASCII was restricted to seven-bit
615
616
617
618Klensin & Padlipsky Standards Track [Page 11]
619
620RFC 5198 Network Unicode March 2008
621
622
623 characters, use of the high-order bit in octets was reserved for the
624 transmission of control signaling information.
625
626 At a very high level, the concept was that a system could use
627 whatever character coding and line representations were appropriate
628 locally, but text transmitted over the network as text must conform
629 to the single "network virtual terminal" convention. Virtually all
630 early Internet protocols that presume transfer of "text" assume this
631 virtual terminal model, although different ones assume or limit it in
632 different ways. Telnet, the command stream and ASCII Type in FTP
633 [RFC0542], the message stream in SMTP transfer [RFC2821], and the
634 strings passed to finger [RFC0742] and whois [RFC0954] are the
635 classic examples. More recently, HTTP [RFC1945] [RFC2616] follows
636 the same general model but permits 8-bit data and leaves the line end
637 sequence unspecified (the latter has been the source of a significant
638 number of problems).
639
640Appendix B. The ASCII NVT Definition
641
642 The main body of this specification is intended as an update to, and
643 internationalized version of, the Net-ASCII definition. The
644 specification is self-contained in that parts of the Net-ASCII
645 definition that are no longer recommended are not included above.
646 Because Net-ASCII evolved somewhat over time and there has been
647 debate about which specification is the "official" Net-ASCII, it is
648 appropriate to review the key elements of that definition here. This
649 review is informal with regard to the contents of Net-ASCII and
650 should not be considered as a normative update or summary of the
651 earlier specifications (Section 2 does specify some normative updates
652 to those specifications and some comments below are consistent with
653 it).
654
655 The first part of the section titled "THE NVT PRINTER AND KEYBOARD"
656 in RFC 854 [RFC0854] is generally, although not universally,
657 considered to be the normative definition of the (ASCII) Network
658 Virtual Terminal and hence of Net-ASCII. It includes not only the
659 graphic ASCII characters but a number of control characters. The
660 latter are given Internet-specific meanings that are often more
661 specific than the definitions in the ASCII specification. In today's
662 usage, and for the present specification, the following
663 clarifications and updates to that list should be noted. Each one is
664 accompanied by a brief explanation of the reason why the original
665 specification is no longer appropriate.
666
667 1. The "defined but not required" codes -- BEL (U+0007), BS
668 (U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the
669 undefined control codes ("C0") SHOULD NOT be used unless required
670 by exceptional circumstances. Either their original "network
671
672
673
674Klensin & Padlipsky Standards Track [Page 12]
675
676RFC 5198 Network Unicode March 2008
677
678
679 printer" definitions are no longer in general use, common
680 practice has evolved away from the formats specified there, or
681 their use to simulate characters that are better handled by
682 Unicode is no longer appropriate. While the appearance of some
683 of these characters on the list may seem surprising, BS now has
684 an ambiguous interpretation in practice (erasing in some systems
685 but not in others), the width associated with HT varies with the
686 environment, and VT and FF do not have a uniform effect with
687 regard to either vertical positioning or the associated
688 horizontal position result. Of course, telnet escapes are not
689 considered part of the data stream and hence are unaffected by
690 this provision.
691
692 2. In Net-ASCII, CR MUST NOT appear except when immediately followed
693 by either NUL or LF, with the latter (CR LF) designating the "new
694 line" function. Today and as specified above, CR should
695 generally appear only when followed by LF. Because page layout
696 is better done in other ways, because NUL has a special
697 interpretation in some programming languages, and to avoid other
698 types of confusion, CR NUL should preferably be avoided as
699 specified above.
700
701 3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF
702 sequences (e.g., CR LF CR LF).
703
704 4. The historical NVT documents do not call out either "bare LF" (LF
705 without CR) or HT for special treatment. Both have generally
706 been understood to be problematic. In the case of LF, there is a
707 difference in interpretation as to whether its semantics imply
708 "go to same position on the next line" or "go to the first
709 position on the next line" and interoperability considerations
710 suggest not depending on which interpretation the receiver
711 applies. At the same time, misinterpretation of LF is less
712 harmful than misinterpretation of "bare" CR: in the CR case, text
713 may be erased or made completely unreadable; in the LF one, the
714 worst consequence is a very funny-looking display. Obviously, HT
715 is problematic because there is no standard way to transmit
716 intended tab position or width information in running text.
717 Again, the harm is unlikely to be great if HT is simply
718 interpreted as one or more spaces, but, in general, it cannot be
719 relied upon to format information.
720
721 It is worth noting that the telnet IAC character (an octet consisting
722 of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that
723 particular octet cannot appear in a valid UTF-8 string. However,
724 while few of them have been used, telnet permits other command-
725 introducer characters whose bit sequences in an octet may be part of
726 valid UTF-8 characters. While it causes no ambiguity in UTF-8,
727
728
729
730Klensin & Padlipsky Standards Track [Page 13]
731
732RFC 5198 Network Unicode March 2008
733
734
735 Unicode assigns a graphic character ("Latin Small Letter Y with
736 Diaeresis") to U+00FF (octets C3 B0 in UTF-8). Some caution is
737 clearly in order in this area.
738
739Appendix C. The Line-Ending Problem
740
741 The definition of how a line ending should be denoted in plain text
742 strings on the wire for the Internet has been controversial from even
743 before the introduction of NVT. Some have argued that recipients
744 should be required to interpret almost anything that a sender might
745 intend as a line ending as actually a line ending. Others have
746 pointed out that this would lead to some ambiguities of
747 interpretation and presentation and would violate the principle that
748 we should minimize the number of forms that are permitted on the wire
749 in order to promote interoperability and eliminate the "every
750 recipient needs to understand every sender format" problem. The
751 design of this specification, like that of NVT, takes the latter
752 approach. Its designers believe that there is little point in a
753 standard if it is to specify "anyone can do whatever they like and
754 the receiver just needs to cope".
755
756 A further discussion of the nature and evolution of the line-ending
757 problem appears in Section 5.8 of the Unicode Standard [Unicode] and
758 is suggested for additional reading. If we were starting with the
759 Internet today, it would probably be sensible to follow the
760 recommendation there and use LS (U+2028) exclusively, in preference
761 to CRLF. However, the installed base of use of CRLF and the
762 importance of forward compatibility with NVT and protocols that
763 assume it makes that impossible, so it is necessary to continue using
764 CRLF as the "New Line Function" ("NLF", see the terminology section
765 in that reference).
766
767Appendix D. A Note about Related Future Work
768
769 Consideration should be given to a Telnet (or SSH [RFC4251]) option
770 to specify this type of stream and an FTP extension [RFC0959] to
771 permit a new "Unicode text" data TYPE.
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786Klensin & Padlipsky Standards Track [Page 14]
787
788RFC 5198 Network Unicode March 2008
789
790
791References
792
793Normative References
794
795 [ISO10646] International Organization for Standardization,
796 "Information Technology - Universal Multiple-Octet
797 Coded Character Set (UCS) - Part 1: Architecture
798 and Basic Multilingual Plane", ISO/
799 IEC 10646-1:2000, October 2000.
800
801 [NFC] Davis, M. and M. Duerst, "Unicode Standard Annex
802 #15: Unicode Normalization Forms", October 2006,
803 <http://www.unicode.org/reports/tr15/>.
804
805 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
806 Requirement Levels", BCP 14, RFC 2119, March 1997.
807
808 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
809 10646", STD 63, RFC 3629, November 2003.
810
811 [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for
812 Syntax Specifications: ABNF", STD 68, RFC 5234,
813 January 2008.
814
815 [Unicode] The Unicode Consortium, "The Unicode Standard,
816 Version 5.0", 2007.
817
818 Boston, MA, USA: Addison-Wesley. ISBN
819 0-321-48091-0
820
821 [Unicode32] The Unicode Consortium, "The Unicode Standard,
822 Version 3.0", 2000.
823
824 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-
825 61633-5). Version 3.2 consists of the definition
826 in that book as amended by the Unicode Standard
827 Annex #27: Unicode 3.1
828 (http://www.unicode.org/reports/tr27/) and by the
829 Unicode Standard Annex #28: Unicode 3.2
830 (http://www.unicode.org/reports/tr28/).
831
832
833
834
835
836
837
838
839
840
841
842Klensin & Padlipsky Standards Track [Page 15]
843
844RFC 5198 Network Unicode March 2008
845
846
847Informative References
848
849 [ASCII] American National Standards Institute (formerly
850 United States of America Standards Institute), "USA
851 Code for Information Interchange", ANSI X3.4-1968,
852 1968.
853
854 ANSI X3.4-1968 has been replaced by newer versions
855 with slight modifications, but the 1968 version
856 remains definitive for the Internet. ISO 646
857 International Reverence Version (IRV)
858 [ISO.646.1991] is usually considered equivalent to
859 ASCII.
860
861 [ISO.646.1991] International Organization for Standardization,
862 "Information technology - ISO 7-bit coded character
863 set for information interchange", ISO Standard 646,
864 1991.
865
866 [NamedSequences] The Unicode Consortium, "NamedSequences-4.1.0.txt",
867 2005, <http://www.unicode.org/Public/UNIDATA/
868 NamedSequences.txt>.
869
870 [RFC0020] Cerf, V., "ASCII format for network interchange",
871 RFC 20, October 1969.
872
873 [RFC0097] Melvin, J. and R. Watson, "First Cut at a Proposed
874 Telnet Protocol", RFC 97, February 1971.
875
876 [RFC0137] O'Sullivan, T., "Telnet Protocol - a proposed
877 document", RFC 137, April 1971.
878
879 [RFC0139] O'Sullivan, T., "Discussion of Telnet Protocol",
880 RFC 139, May 1971.
881
882 [RFC0318] Postel, J., "Telnet Protocols", RFC 318,
883 April 1972.
884
885 [RFC0542] Neigus, N., "File Transfer Protocol", RFC 542,
886 August 1973.
887
888 [RFC0698] Mock, T., "Telnet extended ASCII option", RFC 698,
889 July 1975.
890
891 [RFC0742] Harrenstien, K., "NAME/FINGER Protocol", RFC 742,
892 December 1977.
893
894
895
896
897
898Klensin & Padlipsky Standards Track [Page 16]
899
900RFC 5198 Network Unicode March 2008
901
902
903 [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol
904 Specification", STD 8, RFC 854, May 1983.
905
906 [RFC0954] Harrenstien, K., Stahl, M., and E. Feinler,
907 "NICNAME/WHOIS", RFC 954, October 1985.
908
909 [RFC0959] Postel, J. and J. Reynolds, "File Transfer
910 Protocol", STD 9, RFC 959, October 1985.
911
912 [RFC1945] Berners-Lee, T., Fielding, R., and H. Nielsen,
913 "Hypertext Transfer Protocol -- HTTP/1.0",
914 RFC 1945, May 1996.
915
916 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
917 Languages", BCP 18, RFC 2277, January 1998.
918
919 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
920 Masinter, L., Leach, P., and T. Berners-Lee,
921 "Hypertext Transfer Protocol -- HTTP/1.1",
922 RFC 2616, June 1999.
923
924 [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of
925 ISO 10646", RFC 2781, February 2000.
926
927 [RFC2821] Klensin, J., "Simple Mail Transfer Protocol",
928 RFC 2821, April 2001.
929
930 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
931 Internationalized Strings ("stringprep")",
932 RFC 3454, December 2002.
933
934 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A
935 Stringprep Profile for Internationalized Domain
936 Names (IDN)", RFC 3491, March 2003.
937
938 [RFC3912] Daigle, L., "WHOIS Protocol Specification",
939 RFC 3912, September 2004.
940
941 [RFC4251] Ylonen, T. and C. Lonvick, "The Secure Shell (SSH)
942 Protocol Architecture", RFC 4251, January 2006.
943
944 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB,
945 "Review and Recommendations for Internationalized
946 Domain Names (IDNs)", RFC 4690, September 2006.
947
948
949
950
951
952
953
954Klensin & Padlipsky Standards Track [Page 17]
955
956RFC 5198 Network Unicode March 2008
957
958
959Authors' Addresses
960
961 John C Klensin
962 1770 Massachusetts Ave, #322
963 Cambridge, MA 02140
964 USA
965
966 Phone: +1 617 491 5735
967 EMail: john-ietf@jck.com
968
969
970 Michael A. Padlipsky
971 8011 Stewart Ave.
972 Los Angeles, CA 90045
973 USA
974
975 Phone: +1 310-670-4288
976 EMail: the.map@alum.mit.edu
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010Klensin & Padlipsky Standards Track [Page 18]
1011
1012RFC 5198 Network Unicode March 2008
1013
1014
1015Full Copyright Statement
1016
1017 Copyright (C) The IETF Trust (2008).
1018
1019 This document is subject to the rights, licenses and restrictions
1020 contained in BCP 78, and except as set forth therein, the authors
1021 retain all their rights.
1022
1023 This document and the information contained herein are provided on an
1024 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1025 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
1026 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
1027 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
1028 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1029 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1030
1031Intellectual Property
1032
1033 The IETF takes no position regarding the validity or scope of any
1034 Intellectual Property Rights or other rights that might be claimed to
1035 pertain to the implementation or use of the technology described in
1036 this document or the extent to which any license under such rights
1037 might or might not be available; nor does it represent that it has
1038 made any independent effort to identify any such rights. Information
1039 on the procedures with respect to rights in RFC documents can be
1040 found in BCP 78 and BCP 79.
1041
1042 Copies of IPR disclosures made to the IETF Secretariat and any
1043 assurances of licenses to be made available, or the result of an
1044 attempt made to obtain a general license or permission for the use of
1045 such proprietary rights by implementers or users of this
1046 specification can be obtained from the IETF on-line IPR repository at
1047 http://www.ietf.org/ipr.
1048
1049 The IETF invites any interested party to bring to its attention any
1050 copyrights, patents or patent applications, or other proprietary
1051 rights that may cover technology that may be required to implement
1052 this standard. Please address the information to the IETF at
1053 ietf-ipr@ietf.org.
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066Klensin & Padlipsky Standards Track [Page 19]
1067
1068