[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: Fwd: Unicode letter ballot



Simon Josefsson said:

> > sigh.... all 5 are beyond-BMP characters added recently.
> > if we could go back in time, we could have implemented a policy of not
> > accepting characters as stable before 2 unicode versions had gone
> > by....
> >
> > proofreading takes time.
> 
> Even two Unicode releases doesn't guarantee that the tables are
> correct.  

Of course. But in this case, because of the sensitivity of this
problem, 5 independent audits of the Plane 2 CJK compatibility
characters by CJK experts have converged on the same answer.
There are 5 mistakes (4 clerical and one visual) in the current
mappings for that set of recently added 542 supplementary characters
on Plane 2.

[There are also other known issues in the mappings, whereby
one Han variant or another might be a "better" mapping, but there
is also consensus among the experts that none of those issues
rise to the level of blatant *mistakes* that must be corrected in
the current tables -- which is why none of those is being
ballotted or will be in the future.]

The U.S. national body and the UTC are both also pushing
back hard on the proposed addition of another 122 CJK compatibility
characters because the CJK mapping experts have discovered errors
in *that* table as well. In this case, having learned our lesson,
the UTC is trying to be proactive and ensure that all errors are
removed from the table *before* such an addition is standardized.

> The only proper solution I can see is to stop modifying
> published decomposition tables.  When mistakes are discovered, new
> character codes with proper decompositions should be added and the old
> character codes declared obsolete -- which is option B in the vote,

This will lead to other interoperability problems. The 542
supplementary characters in question (and all of the ones involving
the errors) are CNS compatibility characters. They are there to
provide round-trip mappings to the CNS 11643 standard. If you
"obsolete" 5 code points and then add 5 new ones, then it is
inevitable that CNS mapping tables will get updated to use the
new code points instead of the old ones (and there will be some
inconsistency in the mappings, because of the duplications, during
this transition) -- because the old code points get normalized
away to nonsense characters. This will undoubtedly lead to further
problems, including for IDNA string matching, as one of the
duplicated pair normalizes one way, and the other -- apparently
identical -- normalizes another way. And you can't escape the
problem by just adding the 5 obsolete code points to the
stringprep prohibited list, because that, *too*, would have
destabilized your specification: a string that was valid before
you did that would be invalid after you did so.

> but unfortunately neither IDN nor IETF has any voting powers (which
> suggest a methodological problem).

Why would IDN have voting powers here? You don't expect the
UTC or SC2/WG2 to have voting powers in an IDN working group, do you?

As for the IETF, the UTC and the IETF have a *liaison* relationship.
The UTC immediately informed the IETF liaison about this ballot, because it
knew this was an important issue that IETF participants are
concerned about. That is why this discussion has migrated over
to interested parties on the IDN list who have worked on IDNA
and stringprep.

But the buck has to stop somewhere. Ultimately the UTC and WG2
are responsible for the CJK compatibility character mapping
tables. So those committees have to take the relevant votes,
and if they end up standardizing errors, also have to take
the relevant knocks when they go to fix the errors.

To influence the actual *voting* on this (or other issues),
one works through the UTC voting member representatives in
the UTC case, or one works through the national bodies
participating in SC2 in the SC2/WG2 case.

--Ken