[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Legacy charset conversion in draft-ietf-idn-idna-08.txt(in ksc5601-1987)



On Tue, 28 May 2002, Doug Ewell wrote:

> Soobok Lee <lsb at postel3 dot postel dot co dot kr> wrote:
> 
> > What if  proper and stable implementation of legacy encodings in
> > IDNA is not a tangilbe object due to their frequent updates/
> > additions ?
> 
> How "frequent" are updates to legacy encodings, or updates to the
> mapping tables between legacy encodings and Unicode?
> 

FYI, I enclose the following tables from iconv library documentations.
See Korean section below.
About recent additions (1998,2002), you can visit
http://www.xfree86.org/pipermail/i18n/2002-April/003209.html


Soobok Lee
===================================================================

.* New charsets (most from libiconv)

. + Japanese
    EUC-JP (csEUCPkdFmtJapanese, EUC_JP,
      Extended_UNIX_Code_Packed_Format_for_Japanese);
    ISO-2022-JP (csISO2022JP); ISO-2022-JP-1; ISO-2022-JP-2
(csISO2022JP2);
    JIS_C6220-1969-ro (csISO14JISC6220ro, ISO646-JP, iso-ir-14, jp);
    JIS_X0201 (csHalfWidthKatakana, JIS0201, JISX0201-1976,
JISX0201.1976-0,
       X0201);
    JIS_X0208 (csISO87JISX0208, ISO-IR-87, JIS0208, JIS_X0208.1983-0,
       JIS_X0208.1983-1, JIS_X0208-1990-0, JIS_X0208.1983-1, X0208);
    JIS_X0212 (csISO159JISX02121990, ISO-IR-159, JIS0212,
JIS_X0212.1990-0,
      JIS_X0212-1990, X0212);
    SJIS (csShiftJIS, MS_KANJI, SHIFT-JIS).

. + Chinese
    BIG5 (BIG-5, BIG-FIVE, BIGFIVE, CN-BIG5 csBig5);
    EUC-CN (CN-GB, csGB2312, EUC_CN, GB2312); EUC-TW (csEUCTW, EUC_TW);
    HZ (HZ-GB-2312); ISO-2022-CN (csISO2022CN); ISO-2022-CN-EXT;
    GB_1988-80 (cn, csISO57GB1988, ISO646-CN, iso-ir-57);
    GB_2312-80 (CHINESE, csISO58GB231280, GB2312.1980-0, ISO-IR-58);
    ISO-IR-165 (CN-GB-ISOIR165).

. + Korean
    JOHAB (CP1361); EUC-KR (csEUCKR, EUC_KR); GBK (CP936);
    ISO-2022-KR (csISO2022KR);
    KSC_5601 (CP949, csKSC56011987, ISO-IR-149, KOREAN, KSC5601.1987-0,
      KS_C_5601-1987, KS_C_5601-1989, KSX1001:1992).

. + Vietnamese (independently of libiconv)
    TCVN; VIQR; VISCII; VNI; VPS.

. + Other languages
    ARMSCII-8; Georgian-Academy; Georgian-PS; WINDOWS-874 (CP874);
    MuleLao-1; CP1133 (IBM-CP1133); CP1258 (WINDOWS-1258);
    TIS-620 (ISO-IR-166, TIS620, TIS620.2529-1, TIS620-0, TIS620.2533-0,
      TIS620.2533-1).

. + Apple specifics
    MacArabic; MacCentralEurope; MacCroatian; MacCyrillic; MacGreek;
    MacHebrew; MacIceland; MacRomania; MacThai; MacTurkish; MacUkraine

. + Unicode
    JAVA; UCS-2-INTERNAL; UCS-2LE (UnicodeLITTLE); UCS-2-SWAPPED; UCS-4BE;
    UCS-4-INTERNAL; UCS-4LE; UCS-4-SWAPPED; UTF-16BE; UTF-16LE.

. + Others
    CP932; CP949 (UHC); CP950; CP866 (866, csIBM866, IBM866).
    ISO-8859-16 (ISO-IR-226, ISO_8859-16:2000).

. + Recode internal
    :libiconv: (:)   [so option -x: avoids going through libiconv]

.* New aliases (from libiconv) [list to be revised]
   csASCII (for ANSI_X3.4-1968); csHPRoman8 (for hp-roman8);
   csISOLatin1 (for ISO-8859-1); csISOLatin2 (for ISO-8859-2);
   csISOLatin3 (for ISO-8859-3); csISOLatin4 (for ISO-8859-4);
   csISOLatin5 (for ISO-8859-9);
   csISOLatin6 and ISO_8859-10:1992 (for ISO-8859-10);
   csISOLatinArabic (for ISO-8859-6); csISOLatinCyrillic (for ISO-8859-5);
   csISOLatinGreek (for ISO-8859-7); csISOLatinHebrew (for ISO-8859-8);
   csKOI8R (for KOI8-R); csPC850Multilingual (for IBM850);
   csUCS4 (for ISO-10646-UCS-4);
   csUnicode, csUnicode11, UCS-2BE, UnicodeBIG (for ISO-10646-UCS-2);
   csUnicode11UTF7 (for UNICODE-1-1-UTF-7);
   csVISCII and VISCII1.1-1 (for VISCII);
   ISO-IR-179 (for ISO-8859-13); csMacintosh and MacRoman (for macintosh);
   TCVN5712-1, TCVN5712-1:1993 and TCVN-5712 (for TCVN).

.* New surfaces
   tree (experimental).

* Version 3.5 - Francois Pinard, 1999-05.

.* Incompatible changes
. + A double dot `..' should now be used instead of a colon `:'.
. + Option --force (-f) is needed to pursue recoding despite errors.
. + There is no more quoting for special characters within charsets names.
. + Auto check (`-a') and popen (`-o') options have been withdrawn.
. + Some charsets and aliases were deleted, see `Charsets & aliases'
below.

.* Extended features
. + Program messages are available in localised form for many languages.
. + Long character names are available in French, if LANGUAGE is set to
`fr'.
. + A new request syntax allows for recode chaining, and for surfaces.
. + Option --header-file (-h) accepts a language parameter, and Perl is
new.
. + Full charset listings now show the UCS-2 value for characters.
. + Option --known=PAIRS (-k) also accepts octal and hexadecimal numbers.
. + Option --list (-l) better sorts charsets and aliases, also fully
written.
. + Charset `RFC1345' implements mnemonic+ascii+38, and is now reversible.
. + HTML is not limited anymore to Latin-1, HTML 4.0 entities are
supported.

.* New features
. + Euro support.
. + Updated RFC 1345 set of tables, from Keld Simonsen.
. + Some African charsets and transliterated forms.
. + Conversions for ISO 10646 and Unicode.
. + Combining or explosion of UCS-2 diacriticized characters and
ligatures.
. + Implementation of surfaces, see `Surfaces & aliases' below.
. + Mixed mode for recoding only comments and strings in C sources or PO
files.
. + A stand-alone recoding library gets installed, often as a shared
library.
. + Option --find-subsets (-T) lists charsets which are subsets of
another.
. + The library may generate testing data, and study character
frequencies.

.* Charsets & aliases

. + New ISO 10646 and Unicode charsets
.  - combined-UCS-2: pseudo-charset.
.  - count-characters: pseudo-charset.
.  - dump-with-names: pseudo-charset.
.  - ISO-10646-UCS-2 (UNICODE-1-1, BMP, rune, u2).
.  - ISO-10646-UCS-4 (10646, ISO-10646, UCS-4, u4).
.  - UNICODE-1-1-UTF-7 (TF-7, u7).
.  - UTF-8 (UTF-2, UTF-FSS, FSS_UTF, TF-8, u8).
.  - UTF-16 (Unicode, TF-16, u6).

. + RFC 1345.bis matters
.  - Deleted charsets
     dk-us, us-dk (because of &duplicate which `recode' does not handle
yet).
.  - New charsets
     baltic (alias is iso-ir-179); CP1250 (1250, ms-ee, windows-1250);
     CP1251 (1251, ms-cyrl, windows-1251);
     CP1252 (1252, ms-ansi, windows-1252);
     CP1253 (1253, ms-greek, windows-1253);
     CP1254 (1254, ms-turk, windows-1254);
     CP1255 (1255, ms-hebr, windows-1255);
     CP1256 (1256, ms-arab, windows-1256);
     CP1257 (1257, WinBaltRim, windows-1257);
     CWI (CWI-2, cp-hu); EBCDIC-IS-FRISS (friss);
     GOST_19768-87 with aliases of previous GOST_19768-74;
     IBM256 (256, CP256, EBCDIC-INT1); IBM875 (875, CP875, EBCDIC-Greek);
     IBM1004 (1004, CP1004, os2latin1); IBM1047 (1047, CP1047);
     ISO-8859-13 (ISO_8859-13:1998, iso-baltic, iso-ir-179a, l7, latin7);
     ISO-8859-14 (ISO_8859-14:1998, iso-celtic, iso-ir-199, l8, latin8);
     ISO-8859-15 (ISO_8859-15:1998, iso-ir-203, l9, latin9);
     KOI-7; KOI-8 (GOST_19768-74); KOI8-R; KOI8-RU; KOI8-U;
     macintosh_ce (macce); mac-is;
     NeXTSTEP (next) yet previous `recode' had it outside RFC 1345.
.  - Alias promoted to charset (with previous charset becoming alias)
     ISO-646.basic (with ISO-646.basic:1983); ISO-646.irv
(ISO-646.irv:1983);
     ISO_5427-ext (ISO_5427:1981); ISO_5428 (ISO_5428:1980);
     ISO-8859-1 (ISO_8859-1:1987); ISO-8859-2 (ISO_8859-2:1987);
     ISO-8859-3 (ISO_8859-3:1988); ISO-8859-4 (ISO_8859-4:1988);
     ISO-8859-5 (ISO_8859-5:1988); ISO-8859-6 (ISO_8859-6:1987);
     ISO-8859-7 (ISO_8859-7:1987); ISO-8859-8 (ISO_8859-8:1988);
     ISO-8859-9 (ISO_8859-9:1989); ISO-8859-10 (latin6);
     NC_NC00-10 (NC_NC00-10:81); sami (latin-lap).
.  - New aliases
     037 (for charset IBM037); 038 (IBM038); 273 (IBM273); 274 (IBM274);
     275 (IBM275); 278 (IBM278); 280 (IBM280); 281 (IBM281); 284 (IBM284);
     285 (IBM285); 290 (IBM290); 297 (IBM297); 367 (ANSI_X3.4-1968);
     420 (IBM420); 423 (IBM423); 424 (IBM424); 500, 500V1 (IBM500);
     819 (ISO-8859-1); 864 (IBM864); 868 (IBM868); 870 (IBM870);
     871 (IBM871); 880 (IBM880); 891 (IBM891); 903 (IBM903); 905 (IBM905);
     912, CP912, IBM912 (ISO-8859-2); 918 (IBM918); 1026 (IBM1026);
     ECMA-113, ECMA-113:1986 (ECMA-Cyrillic); GOST_19768-74 (KOI8);
     ISO_8859-N (ISO-8859-N) for N = 1 through 10 and 13 through 15;
     ISO_8859-10:1993 (ISO-8869-10); iso-ir-170 (INVARIANT);
     KOI8_L2 (CSN_369103); pclatin2, pcl2 (IBM852); SS636127
(SEN_850200_B).

. + New African charsets
    AFRL1-101-BPI_OCIL (t-francais, t-fra);
    AFRFUL-102-BPI_OCIL (bambara, bra, ewondo, fulfulde);
    AFRFUL-103-BPI_OCIL (t-bambara, t-bra, t-ewondo, t-fulfulde);
    AFRLIN-104-BPI_OCIL (lingala, lin, sango, wolof);
    AFRLIN-105-BPI_OCIL (t-lingala, t-lin, t-sango, t-wolof).

. + Extra miscellaneous charsets
     KEYBCS2 (Kamenicky); CORK (T1); KOI-8_CS2.

. + New HTML pseudo-charsets
    HTML_1.1 (h1);  HTML_2.0 (RFC 1866, 1866, h2); HTML-i18n (RFC 2070);
    HTML_3.2 (h3) reimplemented; HTML_4.0 (h4, HTML, h);
    deleted aliases HTF, 8859, ISO 8859, Entities, SGML, WWW, w3.

.* Surfaces & aliases
   Base64 (64, b64); Quoted-Printable (qp, Quote-Printable);
   21-Permutation (swabytes); 4321-Permutation; CR; CR-LF (cl);
   Decimal-1 (d, d1); Decimal-2 (d2), Decimal-4 (d4);
   Hexadecimal-1 (x, x1); Hexadecimal-2 (x2); Hexadecimal-4 (x4);
   Octal-1 (o, o1); Octal-2 (o2); Octal-4 (o4).
   data; test7; test8; test15; test16.