[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Matching and comparison



At 09:47 00/01/20 -0800, Paul Hoffman / IMC wrote:
> At 05:47 PM 1/20/00 +0900, Martin J. Duerst wrote:
> > > Unless we can show a need for case-insensitivity *in the
> > > internationalized characters*, we shouldn't force it.
> >
> >The largest need, already discussed, is clearly that a lot of people
> >don't want to have to register ibm/ibM/iBm/iBM/Ibm/IbM/IBm/IBM to
> >make sure nobody else registers. And three-letter companies still
> >have an easy job.
> 
> That will always be a problem, regardless of what we do with case 
> sensitivity. Using the same logic, he D$B—S(Jst company would not only have to 
> register D$B—S(Jst.com, it would have to register D$B•S(Jst.com, D$B“S(Jst.com, 
> D$B‘S(Jst.com, D$B•S(Jst.com, and D$B‘S(Jst.com, not to mention about a dozen more that 
> my Eudora MUA didn't want to type for me.

Paul - My Japanese version of Eudora doesn't let me read your
variants, but I think I can guess them. I wouldn't know why
I would want to register all these. Maybe I would want to
register Durst.com (because that's what somebody might type
if they don't have an appropriate keyboard, although I would
have to think more than twice, because Durst in German means
'thirst', and my name is not at all related to that) and Duerst.com
because that's how people in German might try to type it if
they don't have an appropriate keyboard. I wouldn't know why
I would have to register the rest; if I thought I had to do
it, I would as well need to register Dorst, Durzt, Turzd, and
so on.

> And this is just the European 
> scripts; I think that Indic and Arabaic scripts would have very similar 
> problems.

Similar, yes, but not as big as you think.


> We shouldn't pretend to fix the "too many similar names" problem by only 
> talking about capitalization.

Definitely not. But we should not throw all 'similar names' problems
in the same pot. Some of them are very productive (in particular
casing), some are much less productive. Some are highly regular
(e.g. casing, although it's not completely regular), some need
much more human judgement (e.g. traditional/simplified Chinese).
Some have the potential for spoofing on type-in (e.g. the
Unicode canonical equivalences), others have less potential for
spoofing.


> >Telling people that in an URI, domain names are case-insensitive,
> >but file names are/may be case-sensitive is already hard. Telling
> >them that a name is case-insensitive it if is ASCII only, and case-
> >sensitive otherwise would be a really hard job.
> 
> Indeed. Telling them about anything having to do with internationalization 
> will be.

Not if done the right way. We are not trying to teach the Americans
Chinese or Japanese. Chinese will understand Chinese, and so on.

> >I think we can postpone the casing issue if we agree that there
> >are no requirements in that area, i.e. if we think that we
> >can live with any solution (case-folding or not). But that's
> >not what you are saying, and that's not what I'm saying,
> >so I suggest that we put the points we came up with
> >(would like to be able to have the names in the appropriate
> >casing, would prefer not to have a strange break between
> >names containing only ASCII and others, would like to avoid
> >exponentially growing registrations to cover equivalents).
> 
> I think it would be good for us to list some of the known trickiness of 
> similar-looking script issues. So far, we have casing and Latin vowels with 
> diacritics. Looking through my I believe we also have to list Latin 
> consonants with diacritics, bidirectional names, similar-looking 
> punctuation marks, Arabic joiners,

Do you mean ZWJ/ZWNJ? There are some cases for spoofing, indeed,
and it's impossible to just rule them out because they may be
needed for Farsi (Persian).

> Devangari dependant and independant vowels

Why? It's perfectly clear where to use one or the other.

> (and conjunct formations, and half-forms...),

Some special ways of affecting conjunct formation from the
character codes have to be looked at. But general conjunct
formation is just a display issue.

> and Tamil vowel splitting.

Dealt with by Unicode TR #15. There is one problem, namely
that Tamil letter LLA (U+0BB3) and Tamil AU length mark (U+0BD7)
look the same, but this can be solved by disallowing U+0BD7,
because in the cases where this can really appear, it will
be removed by applying canonical normalization anyway.


> I probably missed about a dozen other tricky issues; Martin is 
> much more versed in these things than I am.


Well, I'll keep trying.          Regards,   Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org