[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] BIG5< -> Unicode roundtrip compatibility



I enclose an excerpt from
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT.

It says that some TC char X cannot be converted into UCS chars without losing
roundtrip compatibility. ie, In a roundtrip conversion   X --> UCS(X) --> X',  
It may arise that X != X'.

Any legacy-encoded IRI(IDN)s including X, may fail to be compared
successfully if they had undergone conversions into/from unicode.

I will appreaciate if anyone present the history of BIG5 versions and its 
round-trip compatilibity problems in more detail.

Soobok Lee
--------------------------------------------------------------------------------


# Name:             BIG5 to Unicode table (complete)
# Unicode version:  1.1
# Table version:    0.0d3
# Table format:     Format A
# Date:             11 February 1994
#
# Copyright (c) 1991-1994 Unicode, Inc.  All Rights reserved.

(snip)
 
# If you have carefully considered the fact that the mappings in
# this table are only one possible set of mappings between BIG5 and
# Unicode and have no normative status, but still feel that you
# have located an error in the table that requires fixing, you may
# report any such error to errata@unicode.org.
#
# WARNING!  It is currently impossible to provide round-trip compatibility
# between BIG5 and Unicode.  
#
# A number of characters are not currently mapped because
# of conflicts with other mappings.  They are as follows:
#
#       BIG5        Description                    Comments
#
#       0xA15A      SPACING UNDERSCORE             duplicates A1C4
#       0xA1C3      SPACING HEAVY OVERSCORE        not in Unicode
#       0xA1C5      SPACING HEAVY UNDERSCORE       not in Unicode
#       0xA1FE      LT DIAG UP RIGHT TO LOW LEFT   duplicates A2AC
#       0xA240      LT DIAG UP LEFT TO LOW RIGHT   duplicates A2AD
#       0xA2CC      HANGZHOU NUMERAL TEN           conflicts with A451 mapping
#       0xA2CE      HANGZHOU NUMERAL THIRTY        conflicts with A4CA mapping
#
# We currently map all of these characters to U+FFFD REPLACEMENT CHARACTER.
# It is also possible to map these characters to their duplicates, or to
# the user zone.  
# 
# Notes:
#
# 1. In addition to the above, there is some uncertainty about the
#       mappings in the range C6A1 - C8FE, and F9DD - F9FE.  The ETEN
# version of BIG5 organizes the former range differently, and adds
# additional characters in the latter range.  The correct mappings
# these ranges need to be determined.
#
# 2.  There is an uncertainty in the mapping of the Big Five character
# 0xA3BC.  This character occurs within the Big Five block of tone marks
# for bopomofo and is intended to be the tone mark for the first tone in
# Mandarin Chinese.  We have selected the mapping U+02C9 MODIFIER LETTER
# MACRON (Mandarin Chinese first tone) to reflect this semantic.  
# However, because bopomofo uses the absense of a tone mark to indicate
# the first Mandarin tone, most implementations of Big Five represent
# this character with a blank space, and so a mapping such as U+2003 EM
# SPACE might be preferred.  
#
# Format:  Three tab-separated columns
# Column #1 is the BIG5 code (in hex as 0xXXXX)
# Column #2 is the Unicode (in hex as 0xXXXX)
# Column #3  is the Unicode name (follows a comment sign, '#')
# The official names for Unicode characters U+4E00
# to U+9FA5, inclusive, is "CJK UNIFIED IDEOGRAPH-XXXX",
# where XXXX is the code point.  Including all these
# names in this file increases its size substantially
# and needlessly.  The token "<CJK>" is used for the
# name of these characters.  If necessary, it can be
# expanded algorithmically by a parser or editor.
#
# The entries are in BIG5 order
#
#