[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character equivalence mapping (was: Re: [idn] SLC minutes)

To: "Mark Davis" <mark.davis@macchiato.com>
Subject: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
From: "Edmon" <edmon@neteka.com>
Date: Thu, 3 Jan 2002 17:51:36 -0500
Cc: <idn@ops.ietf.org>
References: <003201c19474$16030b60$0601a8c0@neteka.com> <55201632.1010058315@P2> <004701c1947d$6c394540$0601a8c0@neteka.com> <003401c19489$83f5d0c0$08d8ea0c@c1340594a>

First of all, allow me to say that I am not proposing to have Character
Equivalence preparations in the "protocol".
In fact, I intend this topic to be more "operational", and I have mentioned
that this might not be the best list to discuss this, but I did want to
clarify what I said during the SLC meeting as was noted in the minutes
(which was what started this discussion).  Also, since the interested
parties on this topic is likely going to be on this list too.
I apologize for lingering on the discussion, but allow me to say the
following:

I do believe very much that since this issue sparks so much contention, we
should archive the problem and discussions better, perhaps into an
informational document, pointing out that there are a set of codepoints that
may be perceived as equivalent.  This in fact would include issues for
Latin-Greek-Cyrillic (LGC) characters as well as at least JPCHAR, HangulChar
and T/S Chinese.  Becuase we do not have much information on Arabic and the
other General Scripts, that is why I said that this document should be a
living document.  If it becomes apparent that some Character Equivalence
preparation is beneficial for these scripts, then we can ammend the
document.

The document will contain a discussion on the different issues surrounding
LGC and CJK characters in Unicode respectively, as well as a set of tables
that lists out the equivalent characters.

With this document, a zone operator (whether a local zone manager or a TLD
manager) can choose one of three approaches to deal with the issue (and can
choose to implment the mappings and which mappings to use for their own
zone):
1. do nothing
2. multiple registrations
3. consolidate characters before name matching within the name server

It will alarm me if anyone thinks that this is useless because they are real
issues that an implementor should be aware of.

Edmon


----- Original Message -----
From: "Mark Davis" <mark.davis@macchiato.com>
To: "Edmon" <edmon@neteka.com>
Cc: <idn@ops.ietf.org>
Sent: Thursday, January 03, 2002 2:04 PM
Subject: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)


> You do not seem to have read the material I suggested. There was quite a
> discussion on this list on why it is hopeless AND counterproductive to try
> identify all the characters whose glyphs could be visually
indistinguishable
> to a user, and create equivalence classes based upon that identification.
>
> Hopeless: For this to be used in IDN, one would have to collect a massive
> amount of data; and all at once -- the mappings can't simply change over
> time. There are a huge number of questionable cases, where for accuracy
one
> would have to survey a substantial range of fonts in the common sizes used
> on different platforms to determine visual distinguishability. Examples
are
> quotation dash and the CJK character for "one", the katakana KA and the
CJK
> character for power, etc. And in many of those cases, one would still have
> to make a judgment call, since even if the pixels are always somewhat
> different, users may perceive the glyphs as being the same.
>
> Counterproductive: As I noted earlier, the N / V problem arises hundreds
of
> times -- a lowercase greek nu in common fonts (e.g. Arial Unicode MS) is
> visually indistinguishable from a Latin v. That causes N, V, n, v, NU, nu
to
> be part of the same equivalence class, so a company could not register
> NIA.com if VIA.com were already registered.
>
>
> Although it is certainly "in character" for this list to endlessly (and
> pointlessly) repeat earlier discussions, I'd suggest you look at the email
> archives on this topic, since this subject has arisen many times. After
you
> have read them, if you have any new information it might be worth
continuing
> the discussion.
>
> Mark
>
> P.S. Unfortunately, there appears to be no web interface for viewing the
> archive, so it is a bit clumsy to search for particular topics. I think
this
> topic was discussed at length sometime in 2000, with repeated comments
> through 2001.
>
> —————
>
> Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο
πάντα — Ὁμήρου Μαργίτῃ
> [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
>
> http://www.macchiato.com
>
> ----- Original Message -----
> Sent: Thursday, January 03, 2002 10:39
>
>
> > QUIT
> > QUIT
> > Message-ID: <005101c19487$36192d40$0601a8c0@neteka.com>
> > From: "Edmon" <edmon@neteka.com>
> > To: "Mark Davis" <mark.davis@macchiato.com>
> > References:
<20020103160357.XKGT27210.rwcrgwc53.attbi.com@c001.snv.cp.net>
> <008b01c1947d$03a4a830$08d8ea0c@c1340594a>
> > Subject: Re:
> > Date: Thu, 3 Jan 2002 13:48:13 -0500
> > MIME-Version: 1.0
> > Content-Type: text/plain;
> > charset="utf-8"
> > Content-Transfer-Encoding: 8bit
> > X-Priority: 3
> > X-MSMail-Priority: Normal
> > X-Mailer: Microsoft Outlook Express 5.50.4522.1200
> > X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
> >
> > Hi Mark,
> >
> > But I am not suggesting any transliteration or transformation.  I am
only
> > suggesting that some characters which might be "perceived" or "confused"
> as
> > equivalent/identical be collected together and regarded as "equivalent"
> > during the DNS name matching process within the DNS server.  It should
not
> > hinder the ability to have different characters or representations send
> > different data over the wire and to maintain the original information.
> >
> > Take for example the current DNS, neither uppercase or lowercase is the
> > "primary" case, they are considered equal.  Which means that you can
have
> a
> > domain that is DoMaIn.CoM, as well as a domain that is domain.COM, but
> > during the matching process, the DNS will declare that they are the
same.
> > Not that it is mapped to (or declared to be "same as") domain.com or
> > DOMAIN.COM, they are simply considered equivalent.  (at least it is what
> its
> > supposed to do I think)
> >
> > So, what is the complexity except for coming up with the exact list?
The
> > "exact list" I believe should be a living document and be revised over
> time,
> > but we need to start somewhere.  It should be relatively uncomplicated
to
> > come up with a list for Latin-Greek-Cyrillic equivalent characters.  In
> > fact, I think I dont mind working on a table that contains ones that
could
> > be "perceived" as equivalent to start with and allow other people to
> > scrutinize on each's equivalence in form.  Or do you know if there is
such
> a
> > list already?
> >
> > Edmon
> >
> >
> >
> > ----- Original Message -----
> > From: "Mark Davis" <mark.davis@macchiato.com>
> > To: <edmon@neteka.com>
> > Cc: "Kenneth Whistler" <kenw@sybase.com>
> > Sent: Thursday, January 03, 2002 12:00 PM
> > Subject: Re:
> >
> >
> > > It is not that simple. I suggest you review the messages on this topic
> in
> > > the archive for this list, and also read both of the following:
> > >
> > > http://www.unicode.org/unicode/standard/where/
> > > http://www.unicode.org/unicode/reports/tr17/
> > >
> > > Mark
> > > —————
> > >
> > > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο
> > πάντα — Ὁμήρου Μαργίτῃ
> > > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
> > >
> > > http://www.macchiato.com
> > >
> > > ----- Original Message -----
> > > Sent: Thursday, January 03, 2002 08:03
> > >
> > >
> > > > QUIT
> > > > QUIT
> > > > Message-ID: <002f01c19471$822bac00$0601a8c0@neteka.com>
> > > > From: "Edmon" <edmon@neteka.com>
> > > > To: "Mark Davis" <mark.davis@macchiato.com>
> > > > References: <200201030151.RAA03854@birdie.sybase.com>
> > > <00aa01c1940f$ac0fee30$08d8ea0c@c1340594a>
> > > > Subject: Re: Character equivalence mapping (was: Re: [idn] SLC
> minutes)
> > > > Date: Thu, 3 Jan 2002 11:12:53 -0500
> > > > MIME-Version: 1.0
> > > > Content-Type: text/plain;
> > > > charset="utf-8"
> > > > Content-Transfer-Encoding: 7bit
> > > > X-Priority: 3
> > > > X-MSMail-Priority: Normal
> > > > X-Mailer: Microsoft Outlook Express 5.50.4522.1200
> > > > X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
> > > >
> > > > Hi Mark
> > > >
> > > > From: "Mark Davis" <mark.davis@macchiato.com>
> > > > > 2. Moreover, stop and think about the implications; using both
case
> > > > folding
> > > > > and visual confusability would have some very unpleasant
> consequences.
> > > For
> > > > > example, it would force the ASCII letters N and V to be in the
same
> > > > > equivalence class:
> > > >
> > > > I will argue that <nu> and v have some subtle difference while
<ALPHA>
> > and
> > > > A, <NU> and N,  are truly identical.  Perhaps character equivalence
> was
> > > not
> > > > a good choice of word, lets try Character Identicality.  I think
there
> > are
> > > > some characters that we can truly say that they are "identical"
under
> > > > scrutiny.  Do you think so?
> > > >
> > > > Edmon
> > > >
> > > >
> > > >
> > >
> >
> >
> >
>
>

References:
- Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
  - From: "Edmon" <edmon@neteka.com>
- Re: Character equivalence mapping (was: Re: [idn] SLCminutes)
  - From: John C Klensin <klensin@jck.com>
- Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
  - From: "Edmon" <edmon@neteka.com>
- Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
  - From: "Mark Davis" <mark.davis@macchiato.com>

Prev by Date: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
Next by Date: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
Previous by thread: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
Next by thread: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
Index(es):
- Date
- Thread