7Network Working Group D. Goldsmith
8Request for Comments: 2152 Apple Computer, Inc.
9Obsoletes: RFC 1642 M. Davis
10Category: Informational Taligent, Inc.
16 A Mail-Safe Transformation Format of Unicode
20 This memo provides information for the Internet community. This memo
21 does not specify an Internet standard of any kind. Distribution of
22 this memo is unlimited.
26 The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as
27 amended) jointly define a character set (hereafter referred to as
28 Unicode) which encompasses most of the world's writing systems.
29 However, Internet mail (STD 11, RFC 822) currently supports only 7-
30 bit US ASCII as a character set. MIME (RFC 2045 through 2049) extends
31 Internet mail to support different media types and character sets,
32 and thus could support Unicode in mail messages. MIME neither defines
33 Unicode as a permitted character set nor specifies how it would be
34 encoded, although it does provide for the registration of additional
35 character sets over time.
37 This document describes a transformation format of Unicode that
38 contains only 7-bit ASCII octets and is intended to be readable by
39 humans in the limiting case that the document consists of characters
40 from the US-ASCII repertoire. It also specifies how this
41 transformation format is used in the context of MIME and RFC 1641,
42 "Using Unicode with MIME".
46 Although other transformation formats of Unicode exist and could
47 conceivably be used in this context (most notably UTF-8, also known
48 as UTF-2 or UTF-FSS), they suffer the disadvantage that they use
49 octets in the range decimal 128 through 255 to encode Unicode
50 characters outside the US-ASCII range. Thus, in the context of mail,
51 those octets must themselves be encoded. This requires putting text
52 through two successive encoding processes, and leads to a significant
53 expansion of characters outside the US-ASCII range, putting non-
54 English speakers at a disadvantage. For example, using UTF-8 together
58Goldsmith & Davis Informational [Page 1]
60RFC 2152 UTF-7 May 1997
63 with the Quoted-Printable content transfer encoding of MIME
64 represents US-ASCII characters in one octet, but other characters may
65 require up to nine octets.
70 shift sequences to encode characters outside that range. For this
71 purpose, one of the characters in the US-ASCII repertoire is reserved
72 for use as a shift character.
74 Many mail gateways and systems cannot handle the entire US-ASCII
75 character set (those based on EBCDIC, for example), and so UTF-7
76 contains provisions for encoding characters within US-ASCII in a way
77 that all mail systems can accomodate.
79 UTF-7 should normally be used only in the context of 7 bit
80 transports, such as mail. In other contexts, straight Unicode or
83 See RFC 1641, "Using Unicode with MIME" for the overall specification
84 on usage of Unicode transformation formats with MIME.
88 First, the definition of Unicode:
90 The 16 bit character set Unicode is defined by "The Unicode
91 Standard, Version 2.0". This character set is identical with the
92 character repertoire and coding of the international standard
93 ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2;
94 Subset=300; Implementation Level=3, including the first 7
95 amendments to 10646 plus editorial corrections.
97 Note. Unicode 2.0 further specifies the use and interaction of
98 these character codes beyond the ISO standard. However, any valid
99 10646 sequence is a valid Unicode sequence, and vice versa;
100 Unicode supplies interpretations of sequences on which the ISO
101 standard is silent as to interpretation.
103 Next, some handy definitions of US-ASCII character subsets:
105 Set D (directly encoded characters) consists of the following
106 characters (derived from RFC 1521, Appendix B, which no longer
107 appears in RFC 2045): the upper and lower case letters A through Z
108 and a through z, the 10 digits 0-9, and the following nine special
109 characters (note that "+" and "=" are omitted):
114Goldsmith & Davis Informational [Page 2]
116RFC 2152 UTF-7 May 1997
119 Character ASCII & Unicode Value (decimal)
130 Set O (optional direct characters) consists of the following
131 characters (note that "\" and "~" are omitted):
133 Character ASCII & Unicode Value (decimal)
155 Rationale. The characters "\" and "~" are omitted because they are
156 often redefined in variants of ASCII.
158 Set B (Modified Base 64) is the set of characters in the Base64
159 alphabet defined in RFC 2045, excluding the pad character "="
170Goldsmith & Davis Informational [Page 3]
172RFC 2152 UTF-7 May 1997
175 Rationale. The pad character = is excluded because UTF-7 is designed
176 for use within header fields as set forth in RFC 2047. Since the only
177 readable encoding in RFC 2047 is "Q" (based on RFC 2045's Quoted-
178 Printable), the "=" character is not available for use (without a lot
179 of escape sequences). This was very unfortunate but unavoidable. The
180 "=" character could otherwise have been used as the UTF-7 escape
181 character as well (rather than using "+").
183 Note that all characters in US-ASCII have the same value in Unicode
184 when zero-extended to 16 bits.
188 A UTF-7 stream represents 16-bit Unicode characters using 7-bit US-
189 ASCII octets as follows:
191 Rule 1: (direct encoding) Unicode characters in set D above may be
192 encoded directly as their ASCII equivalents. Unicode characters in
193 Set O may optionally be encoded directly as their ASCII
194 equivalents, bearing in mind that many of these characters are
195 illegal in header fields, or may not pass correctly through some
198 Rule 2: (Unicode shifted encoding) Any Unicode character sequence
199 may be encoded using a sequence of characters in set B, when
200 preceded by the shift character "+" (US-ASCII character value
201 decimal 43). The "+" signals that subsequent octets are to be
202 interpreted as elements of the Modified Base64 alphabet until a
203 character not in that alphabet is encountered. Such characters
204 include control characters such as carriage returns and line
205 feeds; thus, a Unicode shifted sequence always terminates at the
206 of a line. As a special case, if the sequence terminates with the
207 character "-" (US-ASCII decimal 45) then that character is
208 absorbed; other terminating characters are not absorbed and are
211 Note that if the first character after the shifted sequence is "-"
212 then an extra "-" must be present to terminate the shifted
213 sequence so that the actual "-" is not itself absorbed.
215 Rationale. A terminating character is necessary for cases where
216 the next character after the Modified Base64 sequence is part of
217 character set B or is itself the terminating character. It can
218 also enhance readability by delimiting encoded sequences.
226Goldsmith & Davis Informational [Page 4]
228RFC 2152 UTF-7 May 1997
231 Also as a special case, the sequence "+-" may be used to encode
232 the character "+". A "+" character followed immediately by any
233 character other than members of set B or "-" is an ill-formed
236 Unicode is encoded using Modified Base64 by first converting
237 Unicode 16-bit quantities to an octet stream (with the most
238 significant octet first). Surrogate pairs (UTF-16) are converted
239 by treating each half of the pair as a separate 16 bit quantity
240 (i.e., no special treatment). Text with an odd number of octets is
241 ill-formed. ISO 10646 characters outside the range addressable via
242 surrogate pairs cannot be encoded.
244 Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters
245 the UCS-2 form are serialized as octets, that the most significant
246 octet appear first. This is also in keeping with common network
247 practice of choosing a canonical format for transmission.
249 Rationale. The policy for code point allocation within ISO 10646
250 and Unicode is that the repertoires be kept synchronized. No code
251 points will be allocated in ISO 10646 outside the range
252 addressable by surrogate pairs.
254 Next, the octet stream is encoded by applying the Base64 content
255 transfer encoding algorithm as defined in RFC 2045, modified to
256 omit the "=" pad character. Instead, when encoding, zero bits are
257 added to pad to a Base64 character boundary. When decoding, any
258 bits at the end of the Modified Base64 sequence that do not
259 constitute a complete 16-bit Unicode character are discarded. If
260 such discarded bits are non-zero the sequence is ill-formed.
262 Rationale. The pad character "=" is not used when encoding
263 Modified Base64 because of the conflict with its use as an escape
264 character for the Q content transfer encoding in RFC 2047 header
265 fields, as mentioned above.
267 Rule 3: The space (decimal 32), tab (decimal 9), carriage return
268 (decimal 13), and line feed (decimal 10) characters may be
269 directly represented by their ASCII equivalents. However, note
270 that MIME content transfer encodings have rules concerning the use
271 of such characters. Usage that does not conform to the
272 restrictions of RFC 822, for example, would have to be encoded
273 using MIME content transfer encodings other than 7bit or 8bit,
274 such as quoted-printable, binary, or base64.
276 Given this set of rules, Unicode characters which may be encoded via
277 rules 1 or 3 take one octet per character, and other Unicode
278 characters are encoded on average with 2 2/3 octets per character
282Goldsmith & Davis Informational [Page 5]
284RFC 2152 UTF-7 May 1997
287 plus one octet to switch into Modified Base64 and an optional octet
290 Example. The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>."
291 (hexadecimal 0041,2262,0391,002E) may be encoded as follows:
295 Example. The Unicode sequence "Hi Mom -<WHITE SMILING FACE>-!"
296 (hexadecimal 0048, 0069, 0020, 004D, 006F, 006D, 0020, 002D, 263A,
297 002D, 0021) may be encoded as follows:
301 Example. The Unicode sequence representing the Han characters for
302 the Japanese word "nihongo" (hexadecimal 65E5,672C,8A9E) may be
307Use of Character Set UTF-7 Within MIME
309 Character set UTF-7 is safe for mail transmission and therefore may
310 be used with any content transfer encoding in MIME (except where line
311 length and line break restrictions are violated). Specifically, the 7
312 bit encoding for bodies and the Q encoding for headers are both
313 acceptable. The MIME character set tag is UTF-7. This signifies any
314 version of Unicode equal to or greater than 2.0.
316 Example. Here is a text portion of a MIME message containing the
317 Unicode sequence "Hi Mom <WHITE SMILING FACE>!" (hexadecimal 0048,
318 0069, 0020, 004D, 006F, 006D, 0020, 263A, 0021).
320 Content-Type: text/plain; charset=UTF-7
324 Example. Here is a text portion of a MIME message containing the
325 Unicode sequence representing the Han characters for the Japanese
326 word "nihongo" (hexadecimal 65E5,672C,8A9E).
328 Content-Type: text/plain; charset=UTF-7
332 Example. Here is a text portion of a MIME message containing the
333 Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (hexadecimal
334 0041,2262,0391,002E).
338Goldsmith & Davis Informational [Page 6]
340RFC 2152 UTF-7 May 1997
343 Content-Type: text/plain; charset=utf-7
347 Example. Here is a text portion of a MIME message containing the
348 Unicode sequence "Item 3 is <POUND SIGN>1." (hexadecimal 0049,
349 0074, 0065, 006D, 0020, 0033, 0020, 0069, 0073, 0020, 00A3, 0031,
352 Content-Type: text/plain; charset=UTF-7
356 Note that to achieve the best interoperability with systems that may
357 not support Unicode or MIME, when preparing text for mail
358 transmission line breaks should follow Internet conventions. This
359 means that lines should be short and terminated with the proper SMTP
360 CRLF sequence. Unicode LINE SEPARATOR (hexadecimal 2028) and
361 PARAGRAPH SEPARATOR (hexadecimal 2029) should be converted to SMTP
362 line breaks. Ideally, this would be handled transparently by a
363 Unicode-aware user agent.
365 This preparation is not absolutely necessary, since UTF-7 and the
366 appropriate MIME content transfer encoding can handle text that does
367 not follow Internet conventions, but readability by systems without
368 Unicode or MIME will be impaired. See RFC 2045 for a discussion of
369 mail interoperability issues.
371 Lines should never be broken in the middle of a UTF-7 shifted
372 sequence, since such sequences may not cross line breaks. Therefore,
373 UTF-7 encoding should take place after line breaking. If a line
374 containing a shifted sequence is too long after encoding, a MIME
375 content transfer encoding such as Quoted Printable can be used to
376 encode the text. Another possibility is to perform line breaking and
377 UTF-7 encoding at the same time, so that lines containing shifted
378 sequences already conform to length restrictions.
382 In this section we will motivate the introduction of UTF-7 as opposed
383 to the alternative of using the existing transformation formats of
384 Unicode (e.g., UTF-8) with MIME's content transfer encodings. Before
385 discussing this, it will be useful to list some assumptions about
386 character frequency within typical natural language text strings that
387 we use to estimate typical storage requirements:
389 1. Most Western European languages use roughly 7/8 of their letters
390 from US-ASCII and 1/8 from Latin 1 (ISO-8859-1).
394Goldsmith & Davis Informational [Page 7]
396RFC 2152 UTF-7 May 1997
399 2. Most non-Roman alphabet-based languages (e.g., Greek) use about
400 1/6 of their letters from ASCII (since white space is in the 7-bit
401 area) and the rest from their alphabets.
403 3. East Asian ideographic-based languages (including Japanese) use
404 essentially all of their characters from the Han or CJK syllabary
407 4. Non-directly encoded punctuation characters do not occur
408 frequently enough to affect the results.
410 Notice that current 8 bit standards, such as ISO-8859-x, require use
411 of a content transfer encoding. For comparison with the subsequent
412 discussion, the costs break down as follows (note that many of these
413 figures are approximate since they depend on the exact composition of
418 Text type Average octets/character
421 8859-x in Quoted Printable
423 Text type Average octets/character
425 Western European 1.25
428 Note also that Unicode encoded in Base64 takes a constant 2.67 octets
429 per character. For purposes of comparison, we will look at UTF-8 in
430 Base64 and Quoted Printable, and UTF-7. Also note that fixed overhead
431 for long strings is relative to 1/n, where n is the encoded string
436 Text type Average octets/character
439 Some Alphabetics 2.44
450Goldsmith & Davis Informational [Page 8]
452RFC 2152 UTF-7 May 1997
455 UTF-8 in Quoted Printable
457 Text type Average octets/character
459 Western European 1.63
460 Some Alphabetics 5.17
465 Text type Average octets/character
470 We feel that the UTF-8 in Quoted Printable option is not viable due
471 to the very large expansion of all text except Western European. This
472 would only be viable in texts consisting of large expanses of US-
473 ASCII or Latin characters with occasional other characters
474 interspersed. We would prefer to introduce one encoding that works
475 reasonably well for all users.
477 We also feel that UTF-8 in Base64 has high expansion for non-
478 Western-European users, and is less desirable because it cannot be
479 read directly, even when the content is largely US-ASCII. The base
480 encoding of UTF-7 gives competitive results and is readable for ASCII
483 UTF-7 gives results competitive with ISO-8859-x, with access to all
484 of the Unicode character set. We believe this justifies the
485 introduction of a new transformation format of Unicode.
506Goldsmith & Davis Informational [Page 9]
508RFC 2152 UTF-7 May 1997
511 As an alternative to use of UTF-7, it might be possible to intermix
512 Unicode characters with other character sets using an existing MIME
513 mechanism, the multipart/mixed content type, ignoring for the moment
514 the issues with line breaks (thanks to Nathaniel Borenstein for
515 suggesting this). For instance (repeating an earlier example):
517 Content-type: multipart/mixed; boundary=foo
518 Content-Disposition: inline
521 Content-type: text/plain; charset=us-ascii
525 Content-type: text/plain; charset=UNICODE-2-0
526 Content-transfer-encoding: base64
530 Content-type: text/plain; charset=us-ascii
535 Theoretically, this removes the need for UTF-7 in message bodies
536 (multipart may not be used in header fields). However, we feel that
537 as use of the Unicode character set becomes more widespread,
538 intermittent use of specialized Unicode characters (such as dingbats
539 and mathematical symbols) will occur, and that text will also
540 typically include small snippets from other scripts, such as
541 Cyrillic, Greek, or East Asian languages (anything in the Roman
542 script is already handled adequately by existing MIME character
543 sets). Although the multipart technique works well for large chunks
544 of text in alternating character sets, we feel it does not adequately
545 support the kinds of uses just discussed, and so we still believe the
546 introduction of UTF-7 is justified.
550 The UTF-7 encoding allows Unicode characters to be encoded within the
551 US-ASCII 7 bit character set. It is most effective for Unicode
552 sequences which contain relatively long strings of US-ASCII
553 characters interspersed with either single Unicode characters or
554 strings of Unicode characters, as it allows the US-ASCII portions to
555 be read on systems without direct Unicode support.
557 UTF-7 should only be used with 7 bit transports such as mail. In
558 other contexts, use of straight Unicode or UTF-8 is preferred.
562Goldsmith & Davis Informational [Page 10]
564RFC 2152 UTF-7 May 1997
569 Many thanks to the following people for their contributions,
570 comments, and suggestions. If we have omitted anyone it was through
571 oversight and not intentionally.
618Goldsmith & Davis Informational [Page 11]
620RFC 2152 UTF-7 May 1997
623Appendix A -- Examples
625 Here is a longer example, taken from a document originally in Big5
626 code. It has been condensed for brevity. There are two versions: the
627 first uses optional characters from set O (and so may not pass
628 through some mail gateways), and the second does not.
630 Content-type: text/plain; charset=utf-7
632 Below is the full Chinese text of the Analects (+itaKng-).
634 The sources for the text are:
636 "The sayings of Confucius," James R. Ware, trans. +U/BTFw-:
637 +ZYeB9FH6ckh5Pg-, 1980. (Chinese text with English translation)
639 +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-: +Ti1XC2b4Xpc-, 1990.
641 "The Chinese Classics with a Translation, Critical and Exegetical
642 Notes, Prolegomena, and Copius Indexes," James Legge, trans., Taipei:
643 Southern Materials Center Publishing, Inc., 1991. (Chinese text with
646 Big Five and GB versions of the text are being made available
649 Neither the Big Five nor GB contain all the characters used in this
650 text. Missing characters have been indicated using their Unicode/ISO
651 10646 code points. "U+-" followed by four hexadecimal digits
652 indicates a Unicode/10646 code (e.g., U+-9F08). There is no good
653 solution to the problem of the small size of the Big Five/GB
654 character sets; this represents the solution I find personally most
659 I have tried to minimize this problem by using variant characters
660 where they were available and the character actually in the text was
661 not. Only variants listed as such in the +XrdxmVtXUXg- were used.
665 John H. Jenkins +TpVPXGBG- jenkins@apple.com 5 January 1993
668 Content-type: text/plain; charset=utf-7
670 Below is the full Chinese text of the Analects (+itaKng-).
674Goldsmith & Davis Informational [Page 12]
676RFC 2152 UTF-7 May 1997
679 The sources for the text are:
681 +ACI-The sayings of Confucius,+ACI- James R. Ware, trans. +U/BTFw-:
682 +ZYeB9FH6ckh5Pg-, 1980. (Chinese text with English translation)
684 +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-: +Ti1XC2b4Xpc-, 1990.
686 +ACI-The Chinese Classics with a Translation, Critical and Exegetical
687 Notes, Prolegomena, and Copius Indexes,+ACI- James Legge, trans.,
688 Taipei: Southern Materials Center Publishing, Inc., 1991. (Chinese
689 text with English translation)
691 Big Five and GB versions of the text are being made available
694 Neither the Big Five nor GB contain all the characters used in this
695 text. Missing characters have been indicated using their Unicode/ISO
696 10646 code points. +ACI-U+-+ACI- followed by four hexadecimal digits
697 indicates a Unicode/10646 code (e.g., U+-9F08). There is no good
698 solution to the problem of the small size of the Big Five/GB
699 character sets+ADs- this represents the solution I find personally
704 I have tried to minimize this problem by using variant characters
705 where they were available and the character actually in the text was
706 not. Only variants listed as such in the +XrdxmVtXUXg- were used.
709 John H. Jenkins +TpVPXGBG- jenkins+AEA-apple.com 5 January 1993
730Goldsmith & Davis Informational [Page 13]
732RFC 2152 UTF-7 May 1997
735Security Considerations
737 Security issues are not discussed in this memo.
741[UNICODE 2.0] "The Unicode Standard, Version 2.0", The Unicode
742 Consortium, Addison-Wesley, 1996. ISBN 0-201-48345-9.
744[ISO 10646] ISO/IEC 10646-1:1993(E) Information Technology--Universal
745 Multiple-octet Coded Character Set (UCS). See also
746 amendments 1 through 7, plus editorial corrections.
748[RFC-1641] Goldsmith, D., and M. Davis, "Using Unicode with MIME",
749 RFC 1641, Taligent, Inc., July 1994.
751[US-ASCII] Coded Character Set--7-bit American Standard Code for
752 Information Interchange, ANSI X3.4-1986.
754[ISO-8859] Information Processing -- 8-bit Single-Byte Coded Graphic
755 Character Sets -- Part 1: Latin Alphabet No. 1, ISO
756 8859-1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2,
757 1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988.
758 Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5:
759 Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6:
760 Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7:
761 Latin/Greek alphabet, ISO 8859-7, 1987. Part 8:
762 Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin
763 alphabet No. 5, ISO 8859-9, 1990.
765[RFC822] Crocker, D., "Standard for the Format of ARPA Internet
766 Text Messages", STD 11, RFC 822, UDEL, August 1982.
768[MIME] Borenstein N., N. Freed, K. Moore, J. Klensin, and J.
769 Postel, "MIME (Multipurpose Internet Mail Extensions)
770 Parts One through Five", RFC 2045, 2046, 2047, 2048, and
777 2 Infinite Loop, MS: 302-2IS
782 EMail: goldsmith@apple.com
786Goldsmith & Davis Informational [Page 14]
788RFC 2152 UTF-7 May 1997
793 10201 N. DeAnza Blvd.
794 Cupertino, CA 95014-2233
798 EMail: mark_davis@taligent.com
842Goldsmith & Davis Informational [Page 15]