[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Problems in normalisation and matching



Hi Dan,

I remember the "dot issues" was extensively discussed by the Nameprep Design
Team. It is decided that dots (other than U+002E) should be included because
there are IMEs which generate these dots in place of the normal dots (it
become a hassy to switch in and out of IME just for the dot). Now, some may
say IME is out of scope but on the other hand, we really dont need to rehash
a topic which have been concluded. Lets move forward.

If we can agreed on the above, then the "many problems" you point out are
really just misunderstanding of the Nameprep/IDNA relationships.

First, the Nameprep/Stringprep is designed to handle domain names on a _per
label_ basis. Before some IDNs going thru Nameprep, it is already broken up
into its individual labels so Nameprep arent the place to fixed.

The place where IDNs get broken down into label is in IDNA. What IDNA now
specify is that to break down IDNs into their label, you look for this set
of separators (U+002E, U+3002, U+FF0E, U+FF61). (See IDNA Requirement 1)

Comparison is also done on a per label basis. A IDN is considered equivalent
if and only if all their individual labels are equivalent. The separators
during comparison is also irrelevant. (See IDNA Requirement 4)

If the individual labels need to piece back together into a FQDN, then IDNA
have already clearly specified that U+002E should be used. (See IDNA
Requirement 2)

-James Seng

----- Original Message -----
From: "Dan Oscarsson" <Dan.Oscarsson@trab.se>
To: <idn@ops.ietf.org>
Sent: Sunday, June 30, 2002 8:49 PM
Subject: [idn] Problems in normalisation and matching


> After having made a close look at IDNA and stringprep
> I see many problems in the handling of characters in domain names.
>
> - In IDNA it says that in domain names more than one dot (full stop)
> must be recognized as label separator.
> While this might be a natural thing to do in some cases,
> it is much cleaner if just U+002E is allowed. It simplifies
> parsing a lot and is more like programmes are used to.
> Also there are many more dots in UCS that could as well
> be used.
>
> In general I can see three basic contexts for domain names:
> 1) free text
> 2) standard form in protocols
> 3) comparing form.
>
> In 1) free form you could write the text any way you like.
>
> In 2) the form must be normalised.
> Here only one "dot" should be allowed to separate labels
> in a domain name so one one well defined character can
> separate labels.
> Stringprep/nameprep/idna defines one normalisation that
> do not fit to use here. Its NFKC, lower casing and character
> mapping does destroy to much of the original name.
> NFKC has several mappings, those related to letters
> are mostly well but those related to other types
> of characters like symbols or accents are doubtfull.
> If symbols and accents should be allowed in domain names,
> they should only use NFC. NFKC is irregular in its handling
> of accents, some are expanded from the compact form and some
> are retaind in the compact form.
> I would recommend NFC to be used, probably with some of
> the elementary compatibility mappings for letters
> added (like U+212B to U+00E5 and all fullwidth letters
> to standard width).
>
>
> In 3) I think Stringprep/nameprep/idna goes to far.
> Domain names should be compared case insesitivly
> using simple case folding instead of full+some Turkish
> folding as of IDNA today. For example small letter sharp s
> should not be folded into ss.
> This would simplify domain name matching a lot and
> make it quicker and easier to implement.
> ( though we need a way to do approximative matching
> in programs interacting with the user. In this case
> small letter sharp s could be matche to ss, and
> other complex matching rules be used. But this
> should not be the normal way for DNS ).
> Is it good that IDNA goes to far and make people think
> this is the way to go? It will make it difficult to
> fix later on.
>
>
>    Dan
>
>