[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Problems in normalisation and matching



After having made a close look at IDNA and stringprep
I see many problems in the handling of characters in domain names.

- In IDNA it says that in domain names more than one dot (full stop)
must be recognized as label separator.
While this might be a natural thing to do in some cases,
it is much cleaner if just U+002E is allowed. It simplifies
parsing a lot and is more like programmes are used to.
Also there are many more dots in UCS that could as well
be used.

In general I can see three basic contexts for domain names:
1) free text
2) standard form in protocols
3) comparing form.

In 1) free form you could write the text any way you like.

In 2) the form must be normalised.
Here only one "dot" should be allowed to separate labels
in a domain name so one one well defined character can
separate labels.
Stringprep/nameprep/idna defines one normalisation that
do not fit to use here. Its NFKC, lower casing and character
mapping does destroy to much of the original name.
NFKC has several mappings, those related to letters
are mostly well but those related to other types
of characters like symbols or accents are doubtfull.
If symbols and accents should be allowed in domain names,
they should only use NFC. NFKC is irregular in its handling
of accents, some are expanded from the compact form and some
are retaind in the compact form.
I would recommend NFC to be used, probably with some of
the elementary compatibility mappings for letters
added (like U+212B to U+00E5 and all fullwidth letters
to standard width).


In 3) I think Stringprep/nameprep/idna goes to far.
Domain names should be compared case insesitivly
using simple case folding instead of full+some Turkish
folding as of IDNA today. For example small letter sharp s
should not be folded into ss.
This would simplify domain name matching a lot and
make it quicker and easier to implement.
( though we need a way to do approximative matching
in programs interacting with the user. In this case
small letter sharp s could be matche to ss, and
other complex matching rules be used. But this
should not be the normal way for DNS ).
Is it good that IDNA goes to far and make people think
this is the way to go? It will make it difficult to
fix later on.


   Dan