7Network Working Group                                         F. Yergeau
 
8Request for Comments: 3629                             Alis Technologies
 
11Category: Standards Track
 
14              UTF-8, a transformation format of ISO 10646
 
18   This document specifies an Internet standards track protocol for the
 
19   Internet community, and requests discussion and suggestions for
 
20   improvements.  Please refer to the current edition of the "Internet
 
21   Official Protocol Standards" (STD 1) for the standardization state
 
22   and status of this protocol.  Distribution of this memo is unlimited.
 
26   Copyright (C) The Internet Society (2003).  All Rights Reserved.
 
30   ISO/IEC 10646-1 defines a large character set called the Universal
 
31   Character Set (UCS) which encompasses most of the world's writing
 
32   systems.  The originally proposed encodings of the UCS, however, were
 
33   not compatible with many current applications and protocols, and this
 
34   has led to the development of UTF-8, the object of this memo.  UTF-8
 
35   has the characteristic of preserving the full US-ASCII range,
 
36   providing compatibility with file systems, parsers and other software
 
37   that rely on US-ASCII values but are transparent to other values.
 
38   This memo obsoletes and replaces RFC 2279.
 
42   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
 
43   2.  Notational conventions . . . . . . . . . . . . . . . . . . . .  3
 
44   3.  UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . .  4
 
45   4.  Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . .  5
 
46   5.  Versions of the standards  . . . . . . . . . . . . . . . . . .  6
 
47   6.  Byte order mark (BOM)  . . . . . . . . . . . . . . . . . . . .  6
 
48   7.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
 
49   8.  MIME registration  . . . . . . . . . . . . . . . . . . . . . .  9
 
50   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
 
51   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 10
 
52   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
 
53   12. Changes from RFC 2279  . . . . . . . . . . . . . . . . . . . . 11
 
54   13. Normative References . . . . . . . . . . . . . . . . . . . . . 12
 
58Yergeau                     Standards Track                     [Page 1]
 
60RFC 3629                         UTF-8                     November 2003
 
63   14. Informative References . . . . . . . . . . . . . . . . . . . . 12
 
64   15. URI's  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
 
65   16. Intellectual Property Statement  . . . . . . . . . . . . . . . 13
 
66   17. Author's Address . . . . . . . . . . . . . . . . . . . . . . . 13
 
67   18. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 14
 
71   ISO/IEC 10646 [ISO.10646] defines a large character set called the
 
72   Universal Character Set (UCS), which encompasses most of the world's
 
73   writing systems.  The same set of characters is defined by the
 
74   Unicode standard [UNICODE], which further defines additional
 
75   character properties and other application details of great interest
 
76   to implementers.  Up to the present time, changes in Unicode and
 
77   amendments and additions to ISO/IEC 10646 have tracked each other, so
 
78   that the character repertoires and code point assignments have
 
79   remained in sync.  The relevant standardization committees have
 
80   committed to maintain this very useful synchronism.
 
82   ISO/IEC 10646 and Unicode define several encoding forms of their
 
83   common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.  In an
 
84   encoding form, each character is represented as one or more encoding
 
85   units.  All standard UCS encoding forms except UTF-8 have an encoding
 
86   unit larger than one octet, making them hard to use in many current
 
87   applications and protocols that assume 8 or even 7 bit characters.
 
89   UTF-8, the object of this memo, has a one-octet encoding unit.  It
 
90   uses all bits of an octet, but has the quality of preserving the full
 
91   US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one
 
92   octet having the normal US-ASCII value, and any octet with such a
 
93   value can only stand for a US-ASCII character, and nothing else.
 
95   UTF-8 encodes UCS characters as a varying number of octets, where the
 
96   number of octets, and the value of each, depend on the integer value
 
97   assigned to the character in ISO/IEC 10646 (the character number,
 
98   a.k.a. code position, code point or Unicode scalar value).  This
 
99   encoding form has the following characteristics (all values are in
 
102   o  Character numbers from U+0000 to U+007F (US-ASCII repertoire)
 
103      correspond to octets 00 to 7F (7 bit US-ASCII values).  A direct
 
104      consequence is that a plain ASCII string is also a valid UTF-8
 
114Yergeau                     Standards Track                     [Page 2]
 
116RFC 3629                         UTF-8                     November 2003
 
119   o  US-ASCII octet values do not appear otherwise in a UTF-8 encoded
 
120      character stream.  This provides compatibility with file systems
 
121      or other software (e.g., the printf() function in C libraries)
 
122      that parse based on US-ASCII values but are transparent to other
 
125   o  Round-trip conversion is easy between UTF-8 and other encoding
 
128   o  The first octet of a multi-octet sequence indicates the number of
 
129      octets in the sequence.
 
131   o  The octet values C0, C1, F5 to FF never appear.
 
133   o  Character boundaries are easily found from anywhere in an octet
 
136   o  The byte-value lexicographic sorting order of UTF-8 strings is the
 
137      same as if ordered by character numbers.  Of course this is of
 
138      limited interest since a sort order based on character numbers is
 
139      almost never culturally valid.
 
141   o  The Boyer-Moore fast search algorithm can be used with UTF-8 data.
 
143   o  UTF-8 strings can be fairly reliably recognized as such by a
 
144      simple algorithm, i.e., the probability that a string of
 
145      characters in any other encoding appears as valid UTF-8 is low,
 
146      diminishing with increasing string length.
 
148   UTF-8 was devised in September 1992 by Ken Thompson, guided by design
 
149   criteria specified by Rob Pike, with the objective of defining a UCS
 
150   transformation format usable in the Plan9 operating system in a non-
 
151   disruptive manner.  Thompson's design was stewarded through
 
152   standardization by the X/Open Joint Internationalization Group XOJIG
 
153   (see [FSS_UTF]), bearing the names FSS-UTF (variant FSS/UTF), UTF-2
 
154   and finally UTF-8 along the way.
 
1562.  Notational conventions
 
158   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
 
159   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 
160   document are to be interpreted as described in [RFC2119].
 
162   UCS characters are designated by the U+HHHH notation, where HHHH is a
 
163   string of from 4 to 6 hexadecimal digits representing the character
 
164   number in ISO/IEC 10646.
 
170Yergeau                     Standards Track                     [Page 3]
 
172RFC 3629                         UTF-8                     November 2003
 
177   UTF-8 is defined by the Unicode Standard [UNICODE].  Descriptions and
 
178   formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
 
180   In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
 
181   accessible range) are encoded using sequences of 1 to 4 octets.  The
 
182   only octet of a "sequence" of one has the higher-order bit set to 0,
 
183   the remaining 7 bits being used to encode the character number.  In a
 
184   sequence of n octets, n>1, the initial octet has the n higher-order
 
185   bits set to 1, followed by a bit set to 0.  The remaining bit(s) of
 
186   that octet contain bits from the number of the character to be
 
187   encoded.  The following octet(s) all have the higher-order bit set to
 
188   1 and the following bit set to 0, leaving 6 bits in each to contain
 
189   bits from the character to be encoded.
 
191   The table below summarizes the format of these different octet types.
 
192   The letter x indicates bits available for encoding bits of the
 
195   Char. number range  |        UTF-8 octet sequence
 
196      (hexadecimal)    |              (binary)
 
197   --------------------+---------------------------------------------
 
198   0000 0000-0000 007F | 0xxxxxxx
 
199   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
 
200   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
 
201   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 
203   Encoding a character to UTF-8 proceeds as follows:
 
205   1.  Determine the number of octets required from the character number
 
206       and the first column of the table above.  It is important to note
 
207       that the rows of the table are mutually exclusive, i.e., there is
 
208       only one valid way to encode a given character.
 
210   2.  Prepare the high-order bits of the octets as per the second
 
213   3.  Fill in the bits marked x from the bits of the character number,
 
214       expressed in binary.  Start by putting the lowest-order bit of
 
215       the character number in the lowest-order position of the last
 
216       octet of the sequence, then put the next higher-order bit of the
 
217       character number in the next higher-order position of that octet,
 
218       etc.  When the x bits of the last octet are filled in, move on to
 
219       the next to last octet, then to the preceding one, etc. until all
 
220       x bits are filled in.
 
226Yergeau                     Standards Track                     [Page 4]
 
228RFC 3629                         UTF-8                     November 2003
 
231   The definition of UTF-8 prohibits encoding character numbers between
 
232   U+D800 and U+DFFF, which are reserved for use with the UTF-16
 
233   encoding form (as surrogate pairs) and do not directly represent
 
234   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
 
235   to first decode the UTF-16 data to obtain character numbers, which
 
236   are then encoded in UTF-8 as described above.  This contrasts with
 
237   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
 
238   use on the Internet.  CESU-8 operates similarly to UTF-8 but encodes
 
239   the UTF-16 code values (16-bit quantities) instead of the character
 
240   number (code point).  This leads to different results for character
 
241   numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
 
244   Decoding a UTF-8 character proceeds as follows:
 
246   1.  Initialize a binary number with all bits set to 0.  Up to 21 bits
 
249   2.  Determine which bits encode the character number from the number
 
250       of octets in the sequence and the second column of the table
 
251       above (the bits marked x).
 
253   3.  Distribute the bits from the sequence to the binary number, first
 
254       the lower-order bits from the last octet of the sequence and
 
255       proceeding to the left until no x bits are left.  The binary
 
256       number is now equal to the character number.
 
258   Implementations of the decoding algorithm above MUST protect against
 
259   decoding invalid sequences.  For instance, a naive implementation may
 
260   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
 
261   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
 
262   invalid sequences may have security consequences or cause other
 
263   problems.  See Security Considerations (Section 10) below.
 
2654.  Syntax of UTF-8 Byte Sequences
 
267   For the convenience of implementors using ABNF, a definition of UTF-8
 
268   in ABNF syntax is given here.
 
270   A UTF-8 string is a sequence of octets representing a sequence of UCS
 
271   characters.  An octet sequence is valid UTF-8 only if it matches the
 
272   following syntax, which is derived from the rules for encoding UTF-8
 
273   and is expressed in the ABNF of [RFC2234].
 
275   UTF8-octets = *( UTF8-char )
 
276   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
 
278   UTF8-2      = %xC2-DF UTF8-tail
 
282Yergeau                     Standards Track                     [Page 5]
 
284RFC 3629                         UTF-8                     November 2003
 
287   UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
 
288                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
 
289   UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
 
290                 %xF4 %x80-8F 2( UTF8-tail )
 
293   NOTE -- The authoritative definition of UTF-8 is in [UNICODE].  This
 
294   grammar is believed to describe the same thing Unicode describes, but
 
295   does not claim to be authoritative.  Implementors are urged to rely
 
296   on the authoritative source, rather than on this ABNF.
 
2985.  Versions of the standards
 
300   ISO/IEC 10646 is updated from time to time by publication of
 
301   amendments and additional parts; similarly, new versions of the
 
302   Unicode standard are published over time.  Each new version obsoletes
 
303   and replaces the previous one, but implementations, and more
 
304   significantly data, are not updated instantly.
 
306   In general, the changes amount to adding new characters, which does
 
307   not pose particular problems with old data.  In 1996, Amendment 5 to
 
308   the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded
 
309   the Korean Hangul block, thereby making any previous data containing
 
310   Hangul characters invalid under the new version.  Unicode 2.0 has the
 
311   same difference from Unicode 1.1.  The justification for allowing
 
312   such an incompatible change was that there were no major
 
313   implementations and no significant amounts of data containing Hangul.
 
314   The incident has been dubbed the "Korean mess", and the relevant
 
315   committees have pledged to never, ever again make such an
 
316   incompatible change (see Unicode Consortium Policies [1]).
 
318   New versions, and in particular any incompatible changes, have
 
319   consequences regarding MIME charset labels, to be discussed in MIME
 
320   registration (Section 8).
 
3226.  Byte order mark (BOM)
 
324   The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
 
325   informally as "BYTE ORDER MARK" (abbreviated "BOM").  This character
 
326   can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but
 
327   the BOM name hints at a second possible usage of the character:  to
 
328   prepend a U+FEFF character to a stream of UCS characters as a
 
329   "signature".  A receiver of such a serialized stream may then use the
 
330   initial character as a hint that the stream consists of UCS
 
331   characters and also to recognize which UCS encoding is involved and,
 
332   with encodings having a multi-octet encoding unit, as a way to
 
338Yergeau                     Standards Track                     [Page 6]
 
340RFC 3629                         UTF-8                     November 2003
 
343   recognize the serialization order of the octets.  UTF-8 having a
 
344   single-octet encoding unit, this last function is useless and the BOM
 
345   will always appear as the octet sequence EF BB BF.
 
347   It is important to understand that the character U+FEFF appearing at
 
348   any position other than the beginning of a stream MUST be interpreted
 
349   with the semantics for the zero-width non-breaking space, and MUST
 
350   NOT be interpreted as a signature.  When interpreted as a signature,
 
351   the Unicode standard suggests than an initial U+FEFF character may be
 
352   stripped before processing the text.  Such stripping is necessary in
 
353   some cases (e.g., when concatenating two strings, because otherwise
 
354   the resulting string may contain an unintended "ZERO WIDTH NO-BREAK
 
355   SPACE" at the connection point), but might affect an external process
 
356   at a different layer (such as a digital signature or a count of the
 
357   characters) that is relying on the presence of all characters in the
 
358   stream.  It is therefore RECOMMENDED to avoid stripping an initial
 
359   U+FEFF interpreted as a signature without a good reason, to ignore it
 
360   instead of stripping it when appropriate (such as for display) and to
 
361   strip it only when really necessary.
 
363   U+FEFF in the first position of a stream MAY be interpreted as a
 
364   zero-width non-breaking space, and is not always a signature.  In an
 
365   attempt at diminishing this uncertainty, Unicode 3.2 adds a new
 
366   character, U+2060 "WORD JOINER", with exactly the same semantics and
 
367   usage as U+FEFF except for the signature function, and strongly
 
368   recommends its exclusive use for expressing word-joining semantics.
 
369   Eventually, following this recommendation will make it all but
 
370   certain that any initial U+FEFF is a signature, not an intended "ZERO
 
371   WIDTH NO-BREAK SPACE".
 
373   In the meantime, the uncertainty unfortunately remains and may affect
 
374   Internet protocols.  Protocol specifications MAY restrict usage of
 
375   U+FEFF as a signature in order to reduce or eliminate the potential
 
376   ill effects of this uncertainty.  In the interest of striking a
 
377   balance between the advantages (reduction of uncertainty) and
 
378   drawbacks (loss of the signature function) of such restrictions, it
 
379   is useful to distinguish a few cases:
 
381   o  A protocol SHOULD forbid use of U+FEFF as a signature for those
 
382      textual protocol elements that the protocol mandates to be always
 
383      UTF-8, the signature function being totally useless in those
 
386   o  A protocol SHOULD also forbid use of U+FEFF as a signature for
 
387      those textual protocol elements for which the protocol provides
 
388      character encoding identification mechanisms, when it is expected
 
389      that implementations of the protocol will be in a position to
 
390      always use the mechanisms properly.  This will be the case when
 
394Yergeau                     Standards Track                     [Page 7]
 
396RFC 3629                         UTF-8                     November 2003
 
399      the protocol elements are maintained tightly under the control of
 
400      the implementation from the time of their creation to the time of
 
401      their (properly labeled) transmission.
 
403   o  A protocol SHOULD NOT forbid use of U+FEFF as a signature for
 
404      those textual protocol elements for which the protocol does not
 
405      provide character encoding identification mechanisms, when a ban
 
406      would be unenforceable, or when it is expected that
 
407      implementations of the protocol will not be in a position to
 
408      always use the mechanisms properly.  The latter two cases are
 
409      likely to occur with larger protocol elements such as MIME
 
410      entities, especially when implementations of the protocol will
 
411      obtain such entities from file systems, from protocols that do not
 
412      have encoding identification mechanisms for payloads (such as FTP)
 
413      or from other protocols that do not guarantee proper
 
414      identification of character encoding (such as HTTP).
 
416   When a protocol forbids use of U+FEFF as a signature for a certain
 
417   protocol element, then any initial U+FEFF in that protocol element
 
418   MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE".  When a
 
419   protocol does NOT forbid use of U+FEFF as a signature for a certain
 
420   protocol element, then implementations SHOULD be prepared to handle a
 
421   signature in that element and react appropriately: using the
 
422   signature to identify the character encoding as necessary and
 
423   stripping or ignoring the signature as appropriate.
 
427   The character sequence U+0041 U+2262 U+0391 U+002E "A<NOT IDENTICAL
 
428   TO><ALPHA>." is encoded in UTF-8 as follows:
 
434   The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo",
 
435   meaning "the Korean language") is encoded in UTF-8 as follows:
 
437       --------+--------+--------
 
438       ED 95 9C EA B5 AD EC 96 B4
 
439       --------+--------+--------
 
441   The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo",
 
442   meaning "the Japanese language") is encoded in UTF-8 as follows:
 
444       --------+--------+--------
 
445       E6 97 A5 E6 9C AC E8 AA 9E
 
446       --------+--------+--------
 
450Yergeau                     Standards Track                     [Page 8]
 
452RFC 3629                         UTF-8                     November 2003
 
455   The character U+233B4 (a Chinese character meaning 'stump of tree'),
 
456   prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:
 
464   This memo serves as the basis for registration of the MIME charset
 
465   parameter for UTF-8, according to [RFC2978].  The charset parameter
 
466   value is "UTF-8".  This string labels media types containing text
 
467   consisting of characters from the repertoire of ISO/IEC 10646
 
468   including all amendments at least up to amendment 5 of the 1993
 
469   edition (Korean block), encoded to a sequence of octets using the
 
470   encoding scheme outlined above.  UTF-8 is suitable for use in MIME
 
471   content types under the "text" top-level type.
 
473   It is noteworthy that the label "UTF-8" does not contain a version
 
474   identification, referring generically to ISO/IEC 10646.  This is
 
475   intentional, the rationale being as follows:
 
477   A MIME charset label is designed to give just the information needed
 
478   to interpret a sequence of bytes received on the wire into a sequence
 
479   of characters, nothing more (see [RFC2045], section 2.2).  As long as
 
480   a character set standard does not change incompatibly, version
 
481   numbers serve no purpose, because one gains nothing by learning from
 
482   the tag that newly assigned characters may be received that one
 
483   doesn't know about.  The tag itself doesn't teach anything about the
 
484   new characters, which are going to be received anyway.
 
486   Hence, as long as the standards evolve compatibly, the apparent
 
487   advantage of having labels that identify the versions is only that,
 
488   apparent.  But there is a disadvantage to such version-dependent
 
489   labels: when an older application receives data accompanied by a
 
490   newer, unknown label, it may fail to recognize the label and be
 
491   completely unable to deal with the data, whereas a generic, known
 
492   label would have triggered mostly correct processing of the data,
 
493   which may well not contain any new characters.
 
495   Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
 
496   change, in principle contradicting the appropriateness of a version
 
497   independent MIME charset label as described above.  But the
 
498   compatibility problem can only appear with data containing Korean
 
499   Hangul characters encoded according to Unicode 1.1 (or equivalently
 
500   ISO/IEC 10646 before amendment 5), and there is arguably no such data
 
501   to worry about, this being the very reason the incompatible change
 
502   was deemed acceptable.
 
506Yergeau                     Standards Track                     [Page 9]
 
508RFC 3629                         UTF-8                     November 2003
 
511   In practice, then, a version-independent label is warranted, provided
 
512   the label is understood to refer to all versions after Amendment 5,
 
513   and provided no incompatible change actually occurs.  Should
 
514   incompatible changes occur in a later version of ISO/IEC 10646, the
 
515   MIME charset label defined here will stay aligned with the previous
 
516   version until and unless the IETF specifically decides otherwise.
 
5189.  IANA Considerations
 
520   The entry for UTF-8 in the IANA charset registry has been updated to
 
52310.  Security Considerations
 
525   Implementers of UTF-8 need to consider the security aspects of how
 
526   they handle illegal UTF-8 sequences.  It is conceivable that in some
 
527   circumstances an attacker would be able to exploit an incautious
 
528   UTF-8 parser by sending it an octet sequence that is not permitted by
 
531   A particularly subtle form of this attack can be carried out against
 
532   a parser which performs security-critical validity checks against the
 
533   UTF-8 encoded form of its input, but interprets certain illegal octet
 
534   sequences as characters.  For example, a parser might prohibit the
 
535   NUL character when encoded as the single-octet sequence 00, but
 
536   erroneously allow the illegal two-octet sequence C0 80 and interpret
 
537   it as a NUL character.  Another example might be a parser which
 
538   prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
 
539   illegal octet sequence 2F C0 AE 2E 2F.  This last exploit has
 
540   actually been used in a widespread virus attacking Web servers in
 
541   2001; thus, the security threat is very real.
 
543   Another security issue occurs when encoding to UTF-8: the ISO/IEC
 
544   10646 description of UTF-8 allows encoding character numbers up to
 
545   U+7FFFFFFF, yielding sequences of up to 6 bytes.  There is therefore
 
546   a risk of buffer overflow if the range of character numbers is not
 
547   explicitly limited to U+10FFFF or if buffer sizing doesn't take into
 
548   account the possibility of 5- and 6-byte sequences.
 
550   Security may also be impacted by a characteristic of several
 
551   character encodings, including UTF-8: the "same thing" (as far as a
 
552   user can tell) can be represented by several distinct character
 
553   sequences.  For instance, an e with acute accent can be represented
 
554   by the precomposed U+00E9 E ACUTE character or by the canonically
 
555   equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE).  Even though
 
556   UTF-8 provides a single byte sequence for each character sequence,
 
557   the existence of multiple character sequences for "the same thing"
 
558   may have security consequences whenever string matching, indexing,
 
562Yergeau                     Standards Track                    [Page 10]
 
564RFC 3629                         UTF-8                     November 2003
 
567   searching, sorting, regular expression matching and selection are
 
568   involved.  An example would be string matching of an identifier
 
569   appearing in a credential and in access control list entries.  This
 
570   issue is amenable to solutions based on Unicode Normalization Forms,
 
575   The following have participated in the drafting and discussion of
 
576   this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer,
 
577   Mark Davis, Martin J. Duerst, Patrick Faltstrom, Ned Freed, David
 
578   Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood,
 
579   Simon Josefsson, Kent Karlsson, Dan Kohn, Markus Kuhn, Michael Kung,
 
580   Alain LaBonte, Ira McDonald, Alexey Melnikov, MURATA Makoto, John
 
581   Gardiner Myers, Chris Newman, Dan Oscarsson, Roozbeh Pournader,
 
582   Murray Sargent, Markus Scherer, Keld Simonsen, Arnold Winkler,
 
583   Kenneth Whistler and Misha Wolf.
 
58512.  Changes from RFC 2279
 
587   o  Restricted the range of characters to 0000-10FFFF (the UTF-16
 
590   o  Made Unicode the source of the normative definition of UTF-8,
 
591      keeping ISO/IEC 10646 as the reference for characters.
 
593   o  Straightened out terminology.  UTF-8 now described in terms of an
 
594      encoding form of the character number.  UCS-2 and UCS-4 almost
 
597   o  Turned the note warning against decoding of invalid sequences into
 
598      a normative MUST NOT.
 
600   o  Added a new section about the UTF-8 BOM, with advice for
 
603   o  Removed suggested UNICODE-1-1-UTF-8 MIME charset registration.
 
605   o  Added an ABNF syntax for valid UTF-8 octet sequences
 
607   o  Expanded Security Considerations section, in particular impact of
 
608      Unicode normalization
 
618Yergeau                     Standards Track                    [Page 11]
 
620RFC 3629                         UTF-8                     November 2003
 
62313.  Normative References
 
625   [RFC2119]   Bradner, S., "Key words for use in RFCs to Indicate
 
626               Requirement Levels", BCP 14, RFC 2119, March 1997.
 
628   [ISO.10646] International Organization for Standardization,
 
629               "Information Technology - Universal Multiple-octet coded
 
630               Character Set (UCS)", ISO/IEC Standard 10646,  comprised
 
631               of ISO/IEC 10646-1:2000, "Information technology --
 
632               Universal Multiple-Octet Coded Character Set (UCS) --
 
633               Part 1: Architecture and Basic Multilingual Plane",
 
634               ISO/IEC 10646-2:2001, "Information technology --
 
635               Universal Multiple-Octet Coded Character Set (UCS) --
 
636               Part 2:  Supplementary Planes" and ISO/IEC 10646-
 
637               1:2000/Amd 1:2002, "Mathematical symbols and other
 
640   [UNICODE]   The Unicode Consortium, "The Unicode Standard -- Version
 
641               4.0",  defined by The Unicode Standard, Version 4.0
 
642               (Boston, MA, Addison-Wesley, 2003.  ISBN 0-321-18578-1),
 
643               April 2003, <http://www.unicode.org/unicode/standard/
 
644               versions/enumeratedversions.html#Unicode_4_0_0>.
 
64614.  Informative References
 
648   [CESU-8]    Phipps, T., "Unicode Technical Report #26: Compatibility
 
649               Encoding Scheme for UTF-16: 8-Bit (CESU-8)", UTR 26,
 
651               <http://www.unicode.org/unicode/reports/tr26/>.
 
653   [FSS_UTF]   X/Open Company Ltd., "X/Open Preliminary Specification --
 
654               File System Safe UCS Transformation Format (FSS-UTF)",
 
655               May 1993, <http://wwwold.dkuug.dk/jtc1/sc22/wg20/docs/
 
658   [RFC2045]   Freed, N. and N. Borenstein, "Multipurpose Internet Mail
 
659               Extensions (MIME) Part One: Format of Internet Message
 
660               Bodies", RFC 2045, November 1996.
 
662   [RFC2234]   Crocker, D. and P. Overell, "Augmented BNF for Syntax
 
663               Specifications: ABNF", RFC 2234, November 1997.
 
665   [RFC2978]   Freed, N. and J. Postel, "IANA Charset Registration
 
666               Procedures", BCP 19, RFC 2978, October 2000.
 
674Yergeau                     Standards Track                    [Page 12]
 
676RFC 3629                         UTF-8                     November 2003
 
679   [UAX15]     Davis, M. and M. Duerst, "Unicode Standard Annex #15:
 
680               Unicode Normalization Forms",  An integral part of The
 
681               Unicode Standard, Version 4.0.0, April 2003, <http://
 
682               www.unicode.org/unicode/reports/tr15>.
 
684   [US-ASCII]  American National Standards Institute, "Coded Character
 
685               Set - 7-bit American Standard Code for Information
 
686               Interchange", ANSI X3.4, 1986.
 
690   [1]  <http://www.unicode.org/unicode/standard/policies.html>
 
69216.  Intellectual Property Statement
 
694   The IETF takes no position regarding the validity or scope of any
 
695   intellectual property or other rights that might be claimed to
 
696   pertain to the implementation or use of the technology described in
 
697   this document or the extent to which any license under such rights
 
698   might or might not be available; neither does it represent that it
 
699   has made any effort to identify any such rights.  Information on the
 
700   IETF's procedures with respect to rights in standards-track and
 
701   standards-related documentation can be found in BCP-11.  Copies of
 
702   claims of rights made available for publication and any assurances of
 
703   licenses to be made available, or the result of an attempt made to
 
704   obtain a general license or permission for the use of such
 
705   proprietary rights by implementors or users of this specification can
 
706   be obtained from the IETF Secretariat.
 
708   The IETF invites any interested party to bring to its attention any
 
709   copyrights, patents or patent applications, or other proprietary
 
710   rights which may cover technology that may be required to practice
 
711   this standard.  Please address the information to the IETF Executive
 
718   100, boul. Alexis-Nihon, bureau 600
 
722   Phone: +1 514 747 2547
 
724   EMail: fyergeau@alis.com
 
730Yergeau                     Standards Track                    [Page 13]
 
732RFC 3629                         UTF-8                     November 2003
 
73518.  Full Copyright Statement
 
737   Copyright (C) The Internet Society (2003).  All Rights Reserved.
 
739   This document and translations of it may be copied and furnished to
 
740   others, and derivative works that comment on or otherwise explain it
 
741   or assist in its implementation may be prepared, copied, published
 
742   and distributed, in whole or in part, without restriction of any
 
743   kind, provided that the above copyright notice and this paragraph are
 
744   included on all such copies and derivative works.  However, this
 
745   document itself may not be modified in any way, such as by removing
 
746   the copyright notice or references to the Internet Society or other
 
747   Internet organizations, except as needed for the purpose of
 
748   developing Internet standards in which case the procedures for
 
749   copyrights defined in the Internet Standards process must be
 
750   followed, or as required to translate it into languages other than
 
753   The limited permissions granted above are perpetual and will not be
 
754   revoked by the Internet Society or its successors or assignees.
 
756   This document and the information contained herein is provided on an
 
757   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
 
758   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
 
759   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
 
760   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
 
761   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
 
765   Funding for the RFC Editor function is currently provided by the
 
786Yergeau                     Standards Track                    [Page 14]