[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] draft-ietf-idn-requirements-04.txt (revised)



Dear IDN,

One more set of changes (dates changed from 28 September and 28
February to 04 October and 04 March):


1.  Section 1.1

OLD:

Characters mentioned in this document are identified by their position
in the Unicode [UNICODE] character set. The notation U+12AB, for
example, indicates the character at position 12AB (hexadecimal) in the
Unicode character set. Note that the use of this notation is not an
indication of a requirement to use Unicode.

Examples quoted in this document should be considered as a method to
further explain the meanings and principles adopted by the document. It
is not a requirement for the protocol to satisfy the examples.

A character is a member of a set of elements used for organization,
control, or representation of data.

A coded character is a character with its coded representation.

A coded character set ("CCS") is a set of unambiguous rules that
establishes a character set and the relationship between the characters
of the set and their coded representation.

A graphic character, or glyph, is a visual representation of a character
which can be handwritten, printed, or displayed.  Control functions do
not have associated glyphs.

A character encoding scheme or "CES" is a mapping from one or more
coded character sets to a set of octets. Some CESs are associated with
a single CCS; for example, UTF-8 [RFC2279] applies only to ISO 10646.
Other CESs, such as ISO 2022, are associated with many CCSs.

A charset is a method of mapping a sequence of octets to a sequence of
abstract characters. A charset is, in effect, a combination of one or
more CCS with a CES. Charset names are registered by the IANA according
to procedures documented in [RFC2278].

A language is a way that humans interact. In written form, a language
is expressed in characters. The same set of characters can often be
used in many languages, and many languages can be expressed using
different scripts. A particular charset MAY have different glyphs
(shapes) depending on the language being used.

NEW:  

A language is a way that humans interact. In computerised form, a text
in a written language can be expressed as a string of characters.
The same set of characters can often be used for many written languages,
and many written languages can be expressed using different scripts.
The same characters are often shown with somewhat different glyphs (shapes)
for display of a text depending on the font used, the automatic shaping
applied, or the automatic formation of ligatures. In addition, the same
characters can be shown with somewhat different glyphs (shapes) for display
of a text depending on the language being used, even within the same font
or trough automatic font change.

A character is a member of a set of elements used for organization,
control, or representation of textual data.

A graphic character is a character, other than a control function,
that has a visual representation normally handwritten, printed, or
displayed.

Characters mentioned in this document are identified by their position
in the Unicode [UNICODE] character set.  This character set is also
known as the UCS [ISO10646]. The notation U+12AB, for example, indicates
the character at position 12AB (hexadecimal) in the Unicode character
set.  Note that the use of this notation is not an indication of a
requirement to use Unicode.

Examples quoted in this document should be considered as a method to
further explain the meanings and principles adopted by the document. It
is not a requirement for the protocol to satisfy the examples.

Unicode Technical Report 17 [UTR17] defines a character encoding
model in several levels (much of the text below is quoted from
Unicode Technical Report 17 [UTR17]):

1. A abstract character repertoire (ACR) is defined as the set of
   abstract characters to be encoded, normally a familiar alphabet
   or symbol set. The word abstract just means that these objects
   are defined by convention (such as the 26 letters of the English
   alphabet, uppercase and lowercase forms). Examples: the ASCII
   repertoire, the Latin-15 repertoire, the JIS X 0208 repertoire,
   the UCS repertiore (of a particular version).

2. A coded character set (CCS) is defined to be a mapping from a
   set of abstract characters to the set of non-negative integers.
   This range of integers need not be contiguous. An abstract
   character is defined to be in a coded character set if the coded
   character set maps from it to an integer. That integer is said
   to be the code point for the abstract character. That abstract
   character is then an encoded character. Examples: ASCII, Latin-15,
   JIS X 0208, the UCS.

3. A character encoding form (CEF) is a mapping from the set of integers
   used in a CCS to the set of sequences of code units. A code unit
   is an integer occupying a specified binary width in a computer
   architecture, such as a septet, an octet, or a 16-bit unit. The
   encoding form enables character representation as actual data in
   a computer. The sequences of code units do not necessarily have the
   same length. Examples: ASCII, Latin-15, Shift-JIS, UTF-16, UTF-8.

4. A character encoding scheme (CES) is a mapping of code units into
   serialized octet sequences. Character encoding schemes are relevant
   to the issue of cross-platform persistent data involving code units
   wider than a byte, where byte-swapping may be required to put data
   into the byte polarity canonical for a particular platform.

   The CES may involve two or more CCS's, and may include code units
   (e.g. single shifts, SI/SO, or escape sequences) that are not part
   of the CCS per se, but which are defined by the character encoding
   architecture and which may require an external registry of particular
   values (as for the ISO 2022 escape sequences). In such a case, the
   CES is called a compound CES. (A CES that only involves a single
   CCS is called a simple CES.)

   Examples: ASCII, Latin-15, Shift-JIS, UTF-16BE, UTF-16LE, UTF-8.

5. The mapping from an abstract character repertoire (ACR) to a serialised
   sequence of octets is called a Character Map (CM). A simple character
   map thus implicitly includes a CCS, a CEF, and a CES, mapping from
   abstract characters to code units to octets. A compound character
   map includes a compound CES, and thus includes more than one CCS
   and CEF. In that case, the abstract character repertoire for the
   character map is the union of the repertoires covered by the coded
   character sets involved.

   Character Maps are the things that in the IAB architecture get IANA
   charset identifiers. A sequence of encoded characters must be
   unambiguously mapped onto a sequence of octets by the charset. The
   charset must be specified in all instances, as in Internet
   protocols, where textual content is treated as a ordered sequence
   of octets, and where the textual content must be reconstructible
   from that sequence of octets.  Charset names are registered by the
   IANA according to procedures documented in [RFC2278]. In many cases,
   the same name is used for both a character map and for a character
   encoding scheme, such as UTF-16BE. Typically this is done for simple
   character maps when such usage is clear from context.

6. A transfer encoding syntax (TES) is a reversible transform of encoded
   data which may (or may not) include textual data represented in
   one or more character encoding schemes.  Examples: 8bit,
   Quoted-Printable, BASE64, UTF-7 (defunct), (UTF-5, and RACE).


2.  Section 1.2 (last paragraph)

OLD:  Those names are limited to the ASCII upper and lower-case
characters (interpreted in a case-independent fashion), the digits, and
the hyphen.

NEW:  Those names are limited to the upper- and lower-case letters a-z
(interpreted in a case-independent fashion), the digits, and the
hyphen-minus, all in ASCII.


3.  References

OLD:

[UNICODE]   The Unicode Consortium, "The Unicode Standard -- Version
            3.0", ISBN 0-201-61633-5. Described at
            http://www.unicode.org/unicode/standard/versions/
            Unicode3.0.html

[US-ASCII]  Coded Character Set -- 7-bit American Standard Code for
            Information Interchange, ANSI X3.4-1986.

[UTR15]     "Unicode Normalization Forms", Unicode Technical Report
            #15, http://www.unicode.org/unicode/reports/tr15/,
            Nov 1999, M. Davis & M. Duerst, Unicode Consortium.

[UTR21]     "Case Mappings", Unicode Technical Report #21,
            http://www.unicode.org/unicode/reports/tr21/, Dec 1999,
            M. Davis, Unicode Consortium.  Approved status.

NEW:

[ISO10646]  ISO/IEC 10646-1:2000 (note that an amendment 1 is in
            preparation), ISO/IEC 10646-2 (in preparation), plus
            corrigenda and amendments to these standards.

[UNICODE]   The Unicode Consortium, "The Unicode Standard". Described at
            http://www.unicode.org/unicode/standard/versions/.

[UNICODE30] The Unicode Consortium, "The Unicode Standard -- Version
            3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC
            10646-1:2000. Described at
            http://www.unicode.org/unicode/standard/versions/Unicode3.0.html.

[US-ASCII]  Coded Character Set -- 7-bit American Standard Code for
            Information Interchange, ANSI X3.4-1986; also: ISO/IEC
            646 (IRV).

[UAX15]     "Unicode Normalization Forms", Unicode Standard Annex #15,
            http://www.unicode.org/unicode/reports/tr15/, 2000-08-31,
            M. Davis and M. Duerst, Unicode Consortium.

[UTR17]     "Character Encoding Model", Unicode Technical Report #17,
            http://www.unicode.org/unicode/reports/tr17/, 2000-08-31,
            K. Whistler and M. Davis, Unicode Consortium.

[UTR21]     "Case Mappings", Unicode Technical Report #21,
            http://www.unicode.org/unicode/reports/tr21/, 2000-09-12,
            M. Davis, Unicode Consortium.


4.  Acknowledgements

OLD: Karlsson Kent <keka@im.se>

NEW: Kent Karlsson <keka@im.se>