[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Matching and comparison



At 07:42 00/01/21 -0800, Paul Hoffman / IMC wrote:
> At 12:33 PM 1/21/00 +0900, Martin J. Duerst wrote:
> >Paul - My Japanese version of Eudora doesn't let me read your
> >variants, but I think I can guess them. I wouldn't know why
> >I would want to register all these.
> 
> For the same reason that IBM would want to registier "ibM.com" and 
> "iBm.com" and so on: because they are similar to their "main" domain name. 
> You were the one who brought it up, yes?

Well, Durst and the variants you listed are similar to you,
because you are not familliar with them. The are not
similar to me, because I'm familliar with them.

I guess names differing by case are similar to most people.


> >  Maybe I would want to
> >register Durst.com (because that's what somebody might type
> >if they don't have an appropriate keyboard,
> 
> Keyboard? Why are we concerned about keyboards? Today, essentially no one 
> has a keyboard that doesn't do lowercase. Further, what about people who 
> enter domain names with non-keyboard entry systems such as pens or voice?

Sorry, I wrote keyboard, I should have written 'input device'.
Same as writing 'mailer' instead of 'mail user agent', maybe.

What's important here is not the physical keyboard (most systems
nowadays have software remapping), but the range of characters
that can be input by a user without having to go through a manual.


> > > We shouldn't pretend to fix the "too many similar names" problem by only
> > > talking about capitalization.
> >
> >Definitely not. But we should not throw all 'similar names' problems
> >in the same pot. Some of them are very productive (in particular
> >casing), some are much less productive.
> 
> I don't see what you mean by "productive" here. "Solvable"?

Productive in the sense it is used in linguistics:

How many variants can you produce with that phenomenon?

For casing, it's 2**n, and n can easily be 10 or more.

For ignoring accents or not (important probably for French),
it's 2**n, but n is usually very small.

So the size of the various cases differ. Depending on the size,
various solutions can be appropriate.


> >  Some are highly regular
> >(e.g. casing, although it's not completely regular), some need
> >much more human judgement (e.g. traditional/simplified Chinese).
> >Some have the potential for spoofing on type-in (e.g. the
> >Unicode canonical equivalences), others have less potential for
> >spoofing.
> 
> In my mind, Latin capitalization has very low potential for spoofing: 
> almost none of the capital letters look like their lower-case equivalences. 

Whether the letters look the same or not is one factor affecting spoofing.
Whether you can make people believe they are the same or not is another.

Let's take two cases:

1) IBM.com getting spoofed by IB<Mu>.com  (Greek mu)
2) IBM.com getting spoofed by IBm.com.

1) works if I get that in a mail and just click on it.
2) works in the above case, and it works if somebody types
   it in from a billboard.

2) Wouldn't work in both cases if people knew that case is a
   significant difference. But currently those who know know that case
   is *in*significant. And those who don't have a clue won't be
   careful enough.


> Visually spoofing using similar-looking diacritics seems like a much bigger 
> issue for Latin characters, and I believe that Arabic and Indic characters 
> have similar problems.

It can be an issue. But I think you should be aware of the fact
that users who use these diacritics are sensitized to differences
that other may easily ignore.


> > > >Telling people that in an URI, domain names are case-insensitive,
> > > >but file names are/may be case-sensitive is already hard. Telling
> > > >them that a name is case-insensitive it if is ASCII only, and case-
> > > >sensitive otherwise would be a really hard job.
> > >
> > > Indeed. Telling them about anything having to do with internationalization
> > > will be.
> >
> >Not if done the right way. We are not trying to teach the Americans
> >Chinese or Japanese. Chinese will understand Chinese, and so on.
> 
> But we will have to tell everyone (or at least developers) enough to help 
> enter internationalized characters that end users don't understand. That 
> is, if I see a URL with hiragana in it, I should at least have a chance of 
> entering it correctly even if I don't understand Japanese.

I don't think an implementer of a DNS front-end should have to
care about this. This is an operating system/window system issue.
On a Mac, you can easily install all kinds of input methods.
MS Windows comes with a lot of keyboards already available,
thought at the moment installing a Japanese inpup method
on an US MS Windows system is difficult or impossible as
far as I know. But that will change rather soon.

Next time we meet, I'll give you a chance to try and enter
some hiragana, or if you want kanji, on my system. I think
you will face various problems:

- Now which script the thing is from, and find the right
  keyboard or input method.

- Know how to activate the relevant input method and get
  a pannel where you can select the Hiragana by just picking
  it. That's easy, but only if somebody explains it.

- Trying to figure out whether the thing before you on paper
  and the thing on the screen is the same. If you don't know
  the script, that takes a bit of time.

- Trying to improve on a hunt-pecking speed of one character
  per minute or so. That will take a lot of time.

- Not for Hiragana, but e.g. for Indic scripts, trying to
  figure out what you see on paper from various components.

The above is not given to say: That's what people should do.
Just to the contrary. I don't think it's part of any IDNS
project to teach Americans Hiragana, or to teach Japanese
Arabic. What we want is to make sure that Japanese (and
people familliar with Japanese) can use Japanese domain names,
and so on.


> >Some special ways of affecting conjunct formation from the
> >character codes have to be looked at. But general conjunct
> >formation is just a display issue.
> 
> Exactly right. And it needs to be dealt with.

What do you want to specify in an IDNS protocol?
Wouldn't saying 'behave as specified or implied by
ISO 10646/Unicode' be enough?


> > > and Tamil vowel splitting.
> >
> >Dealt with by Unicode TR #15.
> 
> Only if the protocol only uses Unicode. :-)

Which I don't think we should specify in the req doc,
but which I'm sure we will come back to later.


> >  There is one problem, namely
> >that Tamil letter LLA (U+0BB3) and Tamil AU length mark (U+0BD7)
> >look the same, but this can be solved by disallowing U+0BD7,
> >because in the cases where this can really appear, it will
> >be removed by applying canonical normalization anyway.
> 
> Yes, exactly! Almost all of these problems will be dealt with fully by 
> applying canonical normalization of each character set allowed in the 
> domain name part. But that is not yet considered a requirement by this group.

I wouldn't make 'apply canonical normalization' a requirement.
if you want the relevant requirements, I suggest we take them
from http://www.w3.org/TR/WD-charreq, section 2 and 3
(or just point to them). That were, indirectly, the requirements
for canonical normalization.


Regards,   Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org