[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Determining equivalence in Unicode DNS names





On Mon, 21 Jan 2002 22:55:34 -0500 John C Klensin <klensin@jck.com>
writes:
> Liana (and others),
> 
> Let's try to review how we have gotten here, since I partially
> disagree with Patrik (the disagreement leads, however, to the
> same conclusion, only more strongly).
> 
> --On Sunday, 20 January, 2002 23:01 -0800 liana Ye
> <liana.ydisg@juno.com> wrote:
> 
> >> So, this wg decided that we will use one and only one
> >> matching rule, just  like we decided to use only one
> >> character set. Both of these  (the rule and the charset) are
> >> created in the Unicode Consortium.
> 
> Actually, I don't believe this is correct --or that the language
> isn't precise enough-- so let me try to restate it.  The fact
> that there is only one matching rule is a consequence of the
> design of the DNS: largely because of the binary nature of the
> underlying  architecture, the only matching rule that is
> possible involves a bit-wise comparison on each octet in turn.
> Unless the labels are assumed to be "binary" (i.e., the
> case-mapping rule does not apply) that bit-wise comparison is
> made under a mask that collapses ASCII upper-case characters
> onto their lower-case counterparts.
> 

Agree, and this provides the stability of DNS system in the past
years.


> Now, we can map or transform all sorts of things together before
> they get injected into the actual DNS.  But, as far as the DNS
> is concerned, there is only one mapping rule, or two if "binary
> labels" are handled differently from "character labels", but
> that is it.
> 

I am not suggest anyother way, and the  "character labels" are
proposed to be handled the same way with current one 
matching rule in DNS, - NOT two rules.

> Are there ways around this?  Of course there are.  The WG has
> looked at many of them and given up on them long ago.  For
> example:
> 
> (i) One could introduce a completely new set of query types for
> labels that might contain non-ASCII strings, and apply a
> different matching rule for them.  Unfortunately, this would be
> hugely complex -- we are having quite enough problems with the
> three address-type records/queries we have today -- and might
> take forever to deploy.
> 
> (ii) One could introduce a single new label and query type whose
> sole purpose was to provide a new form of alias -- like CNAME
> only with Unicode (encoded somehow) in the label.  The WG looked
> at variants on that theme early on and decided to not pursue
> them.  In hindsight, the approach essentially implies "within
> DNS layering" and has all of the disadvantages of an "above DNS"
> layering scheme and few of the advantages.
> 
> (iii) One could adopt the "new class" model, using that class to
> impose a somewhat different set of matching rules by redefining
> the query types.  That was, you will recall, proposed and
> rejected (or, more accurately, ignored), I think mostly because
> people were concerned about the length of time it would take to
> deploy.
>
> (iv) One could try to use EDNS to specify alternate mapping
> rules or conventions.  Again, the problem is deployment,
> aggravated by some rather complex issues involving caching and
> secondary servers.  And, again, the WG considered some set of
> options in this group and couldn't get consensus around moving
> forward with them.
> 
> One thing that all of these --except, possibly, the last-- have
> in common is that, while they were permit a _different_ matching
> rule, none of them permit per-language or per-script rules.  You
> still get only one (or two).
> 
> > The decision of use one and only one matching rule is 
> > at false, because of there is no ONE rule can deal with 
> > hundreds of different scripts no matter how strong it 
> > appears that you defend the stand.  
> 
> The rule doesn't deal with scripts at all.  It doesn't deal with
> languages either.  It deals only with strings of bits and deals
> with them in a mathematically simplistic way.
> 
> > The charset appears in the form of a table, which gives 
> > us a better view of these chars and we have a base 
> > to work with. But when you try to use only ONE rule to 
> > deal with all of the chars, the rule has to be bias for 
> > only one type of script, currently Latin.
> 
> No.  The only "bias" is the ability to use a bit masking
> algorithm for ASCII upper-lower case matching.  There are severe
> and complex problems in dealing with the range of Latin scripts.
> 
> >  If we let this 
> > pass from this IDN wg, then all of us here are arguing
> > for nothing, as the "Prohibit CDN code points" is the 
> > only way out suggested by Kenny Huang.   
>  
> > Isn't the time for the WG to wake up from the "one rule 
> > for all" scope?
> > 
> > The only way out is to use multiple rules to treat multiple 
> > scripts, where each rule is identified by a tag. 
> 
> And where are you going to put the tags?  Again, the WG has been
> around this question many times.  There isn't a logical place to
> put the tags unless one further shortens the fraction of the
> label-length-space available for user-visible information.

The tag should be transparent to the user, and yes, should be
a part of the user-visible label.

> There are difficult problems in doing the tagging from a user
> interface basis.  Language (or even "script") tagging would
> prohibit many combinations of characters for which there is
> clear demand, and the WG has been reluctant or unwilling to
> start making the policy decisions that "one label, one script"
> implies.  
> 

Uncontroled mix of character scripts is dangerous as the 
discussion on the list already shown.  There must be some
types of control of which scripts can be combined which are 
not making sense. "one label, one script" is not accurate 
description of the tagging method.  I rather call it 
"one label, one language", ie. ja- means japanese using 
kana and Kanji two scripts; kr- means korean using 
Hangul and Hanja scripts; vi- means Vietnamese using 
Latin and Han scripts. If some group wants Latin and 
Greek mix, and there is such a language tag in [ISO639], 
then we can provide that.  Like Chinese, Latin or  Armenia 
can be  "one label, one script". Since Latin, Chinese and  
Arabic are used by multiple languages already. 

> The bottom line is that, if you really think you know how to do
> this, we are anxiously awaiting the Internet Draft.  But it
> needs to cover all of those complex cases with mixed-language
> labels and scripts that can't be neatly mapped onto languages
> and vice versa.

If there are enough interest in the idea, I will summit the next
version of idn-map today.

> My own belief is that many of these things require, not a
> complex set of binary matching rules, but enough of a notion of
> distance functions to talk, not about "matching" but about
> similarities and differences.  Cyrillic, Greek, and Latin "A"
> look similar, and are "sort of" the same letter (same origins,
> overlapping pronounciations), but don't "match".  Greek
> lower-case omega "looks somewhat like" Latin "w", but it isn't
> the same letter (even "sort of").  Really getting TC<->SC
> mappings right may require context, distance/probability
> functions, or both (the latter if there isn't enough context).
> We can posit doing those things in other sorts of systems, but
> not in the DNS, even with a broader set of matching rules.
> 
> Now, I know you don't like that answer, and that saddens me.
> But wishing, and even unhappiness, won't change the mathematics
> that underlie DNS matching.
> 
>      john

That is precise the reason we need multiple rules in IDN, so 
that we can keep one rule one charset in DNS.  

Liana