[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] case folding



The answer here is for DNS, there is no "searching" involve.

We are more interested in "matching".

-James Seng

Andrew Hodgson wrote:
> 
> I have struggled with KC form for the purpose of general text searching
> (which may be different in some respects from identifier matching).  Some of
> the compatability mappings seem absolutely necessary for sensible
> searching - for example the <wide> and <narrow> mappings.  Others I feel
> cannot be used.  One particular example is:
> 
>  DIGIT 5 + VULGAR FRACTION ONE QUARTER
> 
> in KC I believe this becomes
> 
>  DIGIT 5 + DIGIT 1 + FRACTION SLASH + DIGIT 4
> 
> This now will be found by a search for "51".   I'm not sure what the
> semantics of FRACTION SLASH are.  Have I changed "5 and a quarter" into "51
> over 4"?
> 
> There are other examples of this in KC.  Most involve digits and a loss of
> clear term separation.
> 
>     Andrew Hodgson
> 
> -----Original Message-----
> From: Mark Davis <markdavis@ispchannel.com>
> To: Multiple Recipients of Unicore <unicore@unicode.org>
> Cc: idn@ops.ietf.org <idn@ops.ietf.org>
> Date: June 20, 2000 10:51 AM
> Subject: Re: [idn] case folding
> 
> >NFKC is not appropriate for general data, such as in XML. However, I agree
> with Kent that it may be appropriate for identifiers and in DNS names, since
> it erases some distinctions that are not relevant for them. For example,
> there is little purpose to distinguishing half-width and full-width
> characters in identifiers.
> >
> >Before tossing out NFKC, I'd like to see examples of some cases that the
> contras believe are problems.
> >
> >Mark
> >
> >Karlsson Kent - keka wrote:
> >
> >>
> >>
> >> I don't see what good it would be to distinguish between, e.g. "fi" and
> "<fi>"
> >> (where <fi> is the fi ligature character), and between a and (fullwidth)
> a, etc.
> >> Especially since we do not plan to distinguish between a and A (e.g.).
> >>
> >> Mostly the problem is with characters with codes in the range U+F900 to
> >> U+FFFF (but occasional others *too*).  If using KC is going too far, it
> is also
> >> clear that normal form C is insufficient, not only because it (like all
> of the
> >> defined normal forms) are case preserving, but also because undesirable
> >> compatibility distinctions are maintained by NF C (and NF D).  Some
> >> of the compatibility distinctions may be desirable to maintain, while
> most
> >> are certainly undesirable to maintain as distinctions for identifier
> comparisons,
> >> even case sensitive identifier name comparisons, and much more so for
> case
> >> insensitive ones.
> >>
> >> One could disallow certain compatibility (letter) characters to be part
> of an IDN,
> >> but that would have the undesirable effect of producing errors (of some
> kind, like
> >> "not found") for IDNs that look quite ok to the user, being nearly or
> fully
> >> indistinguishable as glyph displays relative to the "proper" name.
> >>
> >> Another possibility is for the "case folding" data file to provide
> "folding" *also*
> >> of undesirable (for identifier name identity) compatibility distinctions
> to a
> >> common "folded" character (or character sequence).  Then case folding
> >> followed by normalisation to NF C would be quite ok, in my view.
> >>
> >>                 Kind regards
> >>                 /kent k
> >>
> >> > -----Original Message-----
> >> > From: Martin J. Duerst [mailto:duerst@w3.org]
> >> > Sent: Thursday, June 15, 2000 12:41 PM
> >> > To: unicore@unicode.org; Multiple Recipients of Unicore
> >> > Cc: idn@ops.ietf.org; unicore@unicode.org
> >> > Subject: Re: [idn] case folding
> >> >
> >> >
> >> > I agree with Ken that using KC is going too far.
> >> >
> >> > Regards,   Martin.
> >> >
> >> > At 00/06/14 01:07 -0800, James Seng wrote:
> >> > >I believe this is an important issues. Maybe someone from
> >> > Unicore can share
> >> > >some insight?
> >> > >
> >> > >ps: Yes, this is what we meant by "sharing the blame" :-)
> >> > >
> >> > >-James Seng
> >> > >
> >> > > > Karlsson Kent - keka wrote:
> >> > > >
> >> > > > ...
> >> > > >
> >> > > > > But the problem is that since the beginning they should have ]
> >> > > > > considered d-o^ng and D-O^NG different labels!
> >> > > > > We cannot do this for ASCII roman letters [A-Za-z],
> >> > since we must
> >> > > > > retain backward compatibility, but nobody can stop us
> >> > from saying
> >> > > > > "if in a label there is a character outside [0-9A-Za-z-], then
> >> > > > > that label is case sensitive".
> >> > > >
> >> > > > The idea of letting a-z be case insensitive, but "everything else"
> >> > > > be case sensitive is a very strange idea.  When would the
> >> > case mapping
> >> > > > of A-Z (to lowercase) take place?  If it's done before
> >> > normalisation
> >> > > > (to normal form C or KC according to UTR 15), then any
> >> > and arbitrary
> >> > > > characters that are diacritised A-Z may or may not be
> >> > mapped to lowercase.
> >> > > > That is because there may be letters there in decomposed
> >> > form, e.g.,
> >> > > > A+ィ would be mapped to a+ィ and then normalised to
> >> >  whereas ト would
> >> > > > not be touched.
> >> > > >
> >> > > > Even if you say that case mapping must be done after normalisation
> >> > > > can produce strange effects.  Say that some orthography
> >> > required the
> >> > > > use of p with diaeresis.  There is no such precomposed letter. So
> >> > > > even if an IDN with a P with diaeresis is mapped to lowercase
> >> > > > after normalisation, then ト would not be touched but P+ィ would be
> >> > > > mapped to p+ィ.  This kind of inconsistency I find unbearable.
> >> > > >
> >> > > > Unfortunately it seems we have to carry on with case
> >> > insensitivity,
> >> > > > for backwards compatibility reasons.  This means that case
> >> > > > insensitivity has to be generalised to more than a-z, in some way
> >> > > > that is an international compromise (since case mapping is in some
> >> > > > parts dependent on orthograpy).  The only other option is to make
> >> > > > IDNs case sensitive also for a-z.  An option I find preferable,
> >> > > > since it greatly simplifies things, but I guess it's not
> >> > acceptable
> >> > > > for other reasons.
> >> > > >
> >> > > > A problem with current UTR 21 is the following:
> >> > > >
> >> > > > Say that we use normalisation form KC (which is
> >> > reasonable, given that
> >> > > > case insensitivity is desired, it would be strange to maintain all
> >> > > > of the compatibility distinctions).
> >> > > >
> >> > > > If one first does normalisation to KC, and then "case fold", the
> >> > > > result might not be in any normal form (D, C, KD, KC) at
> >> > all. E.g. ゜+エ
> >> > > > (which is in NF KC), case folds to ss+エ which is not in
> >> > NF KC since
> >> > > > there is a precomposed s with acute.
> >> > > >
> >> > > > If one instead first does "case folding", and then
> >> > normalise to form
> >> > > > KC, the result may contain uppercase letters that have a "case
> >> > > > folding" to a lowercase letter.  E.g. BLACK-LETTER CAPITAL H has
> >> > > > no mapping to lowercase. Then normalise to NF KC, and you will be
> >> > > > left with a capital H (whereas other H-es will be mapped to h).
> >> > > >
> >> > > > I think that first doing case folding together with compatibility
> >> > > > decomposition, and then do canonical composition (see UTF
> >> > 15), would
> >> > > > produce a reasonable and stable result that both ignores
> >> > compatibility
> >> > > > distinctions as well as case distinctions.  But UTR 21
> >> > does not (yet)
> >> > > > say that that's how to do it.
> >> > > >
> >> > > >                 Kind regards
> >> > > >                 /kent k
> >> >
> >> >
> >
> >