[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character equivalence mapping (was: Re: [idn] SLC minutes)



You do not seem to have read the material I suggested. There was quite a
discussion on this list on why it is hopeless AND counterproductive to try
identify all the characters whose glyphs could be visually indistinguishable
to a user, and create equivalence classes based upon that identification.

Hopeless: For this to be used in IDN, one would have to collect a massive
amount of data; and all at once -- the mappings can't simply change over
time. There are a huge number of questionable cases, where for accuracy one
would have to survey a substantial range of fonts in the common sizes used
on different platforms to determine visual distinguishability. Examples are
quotation dash and the CJK character for "one", the katakana KA and the CJK
character for power, etc. And in many of those cases, one would still have
to make a judgment call, since even if the pixels are always somewhat
different, users may perceive the glyphs as being the same.

Counterproductive: As I noted earlier, the N / V problem arises hundreds of
times -- a lowercase greek nu in common fonts (e.g. Arial Unicode MS) is
visually indistinguishable from a Latin v. That causes N, V, n, v, NU, nu to
be part of the same equivalence class, so a company could not register
NIA.com if VIA.com were already registered.


Although it is certainly "in character" for this list to endlessly (and
pointlessly) repeat earlier discussions, I'd suggest you look at the email
archives on this topic, since this subject has arisen many times. After you
have read them, if you have any new information it might be worth continuing
the discussion.

Mark

P.S. Unfortunately, there appears to be no web interface for viewing the
archive, so it is a bit clumsy to search for particular topics. I think this
topic was discussed at length sometime in 2000, with repeated comments
through 2001.

—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
Sent: Thursday, January 03, 2002 10:39


> QUIT
> QUIT
> Message-ID: <005101c19487$36192d40$0601a8c0@neteka.com>
> From: "Edmon" <edmon@neteka.com>
> To: "Mark Davis" <mark.davis@macchiato.com>
> References: <20020103160357.XKGT27210.rwcrgwc53.attbi.com@c001.snv.cp.net>
<008b01c1947d$03a4a830$08d8ea0c@c1340594a>
> Subject: Re:
> Date: Thu, 3 Jan 2002 13:48:13 -0500
> MIME-Version: 1.0
> Content-Type: text/plain;
> charset="utf-8"
> Content-Transfer-Encoding: 8bit
> X-Priority: 3
> X-MSMail-Priority: Normal
> X-Mailer: Microsoft Outlook Express 5.50.4522.1200
> X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
>
> Hi Mark,
>
> But I am not suggesting any transliteration or transformation.  I am only
> suggesting that some characters which might be "perceived" or "confused"
as
> equivalent/identical be collected together and regarded as "equivalent"
> during the DNS name matching process within the DNS server.  It should not
> hinder the ability to have different characters or representations send
> different data over the wire and to maintain the original information.
>
> Take for example the current DNS, neither uppercase or lowercase is the
> "primary" case, they are considered equal.  Which means that you can have
a
> domain that is DoMaIn.CoM, as well as a domain that is domain.COM, but
> during the matching process, the DNS will declare that they are the same.
> Not that it is mapped to (or declared to be "same as") domain.com or
> DOMAIN.COM, they are simply considered equivalent.  (at least it is what
its
> supposed to do I think)
>
> So, what is the complexity except for coming up with the exact list?  The
> "exact list" I believe should be a living document and be revised over
time,
> but we need to start somewhere.  It should be relatively uncomplicated to
> come up with a list for Latin-Greek-Cyrillic equivalent characters.  In
> fact, I think I dont mind working on a table that contains ones that could
> be "perceived" as equivalent to start with and allow other people to
> scrutinize on each's equivalence in form.  Or do you know if there is such
a
> list already?
>
> Edmon
>
>
>
> ----- Original Message -----
> From: "Mark Davis" <mark.davis@macchiato.com>
> To: <edmon@neteka.com>
> Cc: "Kenneth Whistler" <kenw@sybase.com>
> Sent: Thursday, January 03, 2002 12:00 PM
> Subject: Re:
>
>
> > It is not that simple. I suggest you review the messages on this topic
in
> > the archive for this list, and also read both of the following:
> >
> > http://www.unicode.org/unicode/standard/where/
> > http://www.unicode.org/unicode/reports/tr17/
> >
> > Mark
> > —————
> >
> > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο
> πάντα — Ὁμήρου Μαργίτῃ
> > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
> >
> > http://www.macchiato.com
> >
> > ----- Original Message -----
> > Sent: Thursday, January 03, 2002 08:03
> >
> >
> > > QUIT
> > > QUIT
> > > Message-ID: <002f01c19471$822bac00$0601a8c0@neteka.com>
> > > From: "Edmon" <edmon@neteka.com>
> > > To: "Mark Davis" <mark.davis@macchiato.com>
> > > References: <200201030151.RAA03854@birdie.sybase.com>
> > <00aa01c1940f$ac0fee30$08d8ea0c@c1340594a>
> > > Subject: Re: Character equivalence mapping (was: Re: [idn] SLC
minutes)
> > > Date: Thu, 3 Jan 2002 11:12:53 -0500
> > > MIME-Version: 1.0
> > > Content-Type: text/plain;
> > > charset="utf-8"
> > > Content-Transfer-Encoding: 7bit
> > > X-Priority: 3
> > > X-MSMail-Priority: Normal
> > > X-Mailer: Microsoft Outlook Express 5.50.4522.1200
> > > X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
> > >
> > > Hi Mark
> > >
> > > From: "Mark Davis" <mark.davis@macchiato.com>
> > > > 2. Moreover, stop and think about the implications; using both case
> > > folding
> > > > and visual confusability would have some very unpleasant
consequences.
> > For
> > > > example, it would force the ASCII letters N and V to be in the same
> > > > equivalence class:
> > >
> > > I will argue that <nu> and v have some subtle difference while <ALPHA>
> and
> > > A, <NU> and N,  are truly identical.  Perhaps character equivalence
was
> > not
> > > a good choice of word, lets try Character Identicality.  I think there
> are
> > > some characters that we can truly say that they are "identical" under
> > > scrutiny.  Do you think so?
> > >
> > > Edmon
> > >
> > >
> > >
> >
>
>
>