[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] RE: Normalisation and ASCII fallbacks






> -----Original Message----- 
> From: Dan Oscarsson [mailto:Dan.Oscarsson@trab.se] 
> Sent: Tuesday, February 15, 2000 11:53 AM 
> To: idn@ops.ietf.org; keka@im.se 
> Subject: Normalisation and ASCII fallbacks 
> 
> 
> Kent wrote: 
> >I suggest ignoring UTR 21.  Just downcase according to the 'default' 
> >(non-normative) in the main property table for Unicode 3.0. 
>  > 
> >Correction:  Normalisation has to be done (in principle) 
> AFTER a case change 
> >of any form, 
> >otherwise the result migth not be in any one of the defined 
> normal forms. 
> >(I can supply 
> >a detailed argument if you like.) 
> 
> Yes, please. 

Concerning normalisation forms C and KC: 

Some examples will demonstrate this (look at the "SpecialCasing.txt" file 
that is part of the "Unicode character database"): 

normalise(C, <J><combining caron above>) = <J><combining caron above> # nf C

tolower(<J><combining caron above> = <j><combining caron above> # non-norm. 
normalise(C, <j><combining caron above>) = <j with caron above> # nf C 

normalise(C, <sharp s><combining caron above>) = <sharp s><combining caron
above> 
toupper(<sharp s><combining caron above> = <S><S><combining caron above> 
normalise(C, <S><S><combining caron above>) = <S><S with caron above> 

normalise(C, <J><combining caron above>) = <J><combining caron above> 
utr21(<J><combining caron above> = <j><combining caron above> 
normalise(C, <j><combining caron above>) = <j with caron above> 

And so on for several other instances. 

> 
> I used UTR 21 because I thought it was more correct, but from what you 
> say maybe it is best to go the simplest way for many programmers. It 
> might also be best with simple rules. 

Though UTR 21 went from "proposed draft" to "draft" just a few days 
ago, I think it's fundamentally flawed, and will suggest improvements. 
(Along the lines I have suggested on this list.) 

> To do case folding to lower case, just use the main property table 
> for Unicode 3.0. This has one to one mapping only which is 
> very nice as it simplifies data handling. 
> 
> 
> In DNS international text will be used both in domain names and in 
> text data in record (like in the TXT or HINFO record). 
> The simplest way would be to say: user normalisation form C 
> everywhere. 
> This way everything is the same in all places and you have 
> not to worry 
> about differnt forms att different places. 

I think it is sufficient to normalise for comparison purposes only. 
Text that will not take part in any lookup need not be normalised 
at all, or if it is, normalisation form C should be used, NOT form 
KC for that. 

But for the lookup not to be too slow, the down-cased and normalised 
version of the names should be stored.  I would hope storing that 
belongs to the internal workings of a DNS implementation, not 
manifest in the protocol.  Is that so? 



> The next matter is doing a lookup in DNS. Here it would be 
> nice to match 
> names as a human will want it, ignoring cosmetic differenses. 
> To just say: use case folding to lower case of the form C 
> data and then 
> compare, would be the simplest. But might make a lot of names 
> for many humans being the same, compare as different. 

I'm not sure I follow your reasoning here.  My reasoning is, 
as it has been, that if case does not matter (everyone seems 
to want case insensitive, right?), compatibility variation 
should not matter either.  This is not to say that case 
or compatibility variants are immaterial in general, just 
that they would (and should) be in the case of IDNs. 

> I removed form KC from my draft because Harald Alvestrand said it was 
> difficult. But maybe that is wrong? 
> Can you convert from from C to form KC without going over a 
> decomposing step? 

No, except when you can detect that the string is already in form KC. 
Note that: 

D: (nominally) decompose recursively according to canonical 
   decompositions and then reorder combining marks to canonical order 

C: (nominally) D followed by (slightly complicated) canonical composition 

KD: (nominally) decompose recursively according to canonical AND 
    compatibility decompositions and then reorder combining marks 
    to canonical order 

KC: (nominally) KD followed by (slightly complicated) canonical composition 

D and KD are very similar in (nominal) complexity.  So are C and KC. 
Of course one need not actually decompose, and recompose, for strings 
that are already in the target normal form. 

I would add to this (see SpecialCasing.txt re sigma): 

LKD: map to lowercase, respecting SIGMA to sigma/final sigma, then KD 

LKC: map to lowercase, respecting SIGMA to sigma/final sigma, then KC 

(none of the other conceivable combinations make much sense, IMHO) 



> I prefer to say form C everywhere (and in original case), because it 
> allows cosmetic to be preserved. And that is important for 
> many people. 

For non-compared texts: don't normalise, or normalise to form D or C. 

For lookup: 
If compatibility differences are to be preserved (questionable in 
many cases) then most certainly case distinctions must be preserved too, 
i.e. case (and more) sensitive lookup.  But for case insensitive lookup, 
use form LKC (as described above). 

> Still it would be nice to say: when doing a lookup, the data should 
> be normalised from form C to KC, lower cased (or maybe this before 
> normalising) and mayby adding additional simplification before 
> doing the comparison. 
> But, can this been done easily? And are we ready to define the 
> matching rules so they need not be changed tomorrow? 

I don't know about "tomorrow", but I do hope this subissue will 
stabilise soon. 

> There might be one more problem, more systems than DNS may need to 
> compare domain names. 

Their lookups may be fuzzier, in the sense that "same" names need 
not compare equal.  The only impact should be performance (more downloads). 

> For example a webb browser need to compare 
> domain names to know if it two URLs point to the same place. With 
> form KC and other rules all that need to go into every program doing 
> domain name comparing. 
> Though for many programs just handling the folding to lower case 
> would be enough. 
> 
> - 
> In my draft I gave a way to map the internal UTF-8 encoded data 
> in DNS to an ASCII (or larger subset of UCS) for all characters not 
> handled by the subset. This was done to to reasons: 
> 1) To allow users to see and enter names not supported by their 
> local character set. 
> 2) To allow "old DNS" to get ASCII only names that also can work 
> with old SMTP and other protocols. 
> 
> Here Kent said: 
> >1) "The" local character set?  Who's local character set?  
> Such things 
> >are these days usually personal preferences, or rather just personal 
> >defaults swiftly overridden by "charset=", heuristics, or 
> plain temporary 
> >change.  Setting one non-UCS character encoding as "the" 
> local one for 
> >entire organisation or the like is highly inappropriate. 
> 
> The local character set is the one the user is using. 

I'm using about a handful of different encodings right now, 
on a single computer screen (several windows).  With a single 
menu selection I can target this message to be sent to you in 
any one of about 30 encodings, including UTF-8 (though internally 
UCS-2 is used).  (Just for fun, I'm targeting Big-5 (Traditional 
Chinese). Here are two Chinese characters: ¤@¤B. I'm using a very common 
platform to do this...) 

Now, which encoding did you say?? 

> If the user is in locale 

I'm not using any (POSIX?) 'locale' at all.  There are, to some 
extent, similar notions on this system, but no 'locale' per se. 

> ASCII, it will be used. If the user is in ISO 8859-15 
> it will be used. But of course, just like you have today, you have 
> a problem if a person of one locale gives the name to a person in 
> an other locale without remapping it to fit that locale. 
> I did not think that you should set one for the entire 
> organisation. And I do not think the problem is easy. 
> While the user using ISO 8859-15 wants domain names displayed using 
> ISO 8859-15, if possible, when sending e-mail the domain names must 
> either be in ASCII (using old SMTP) or UTF-8 (using a future SMTP). 
> Even if we say that a person using ISO 8859-15 maybe does not 
> need(want) to use a domain name in Korean (they page in Korean and 
> cannot be displayed) 

I can have it *displayed* nicely.  Me not being able to *read* it is 
another matter...  

                Kind regards 
                /kent k