[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Comments on IDNA/stringprep/nameprep



Kent Karlsson <kentk@md.chalmers.se> wrote:

> 1. stringprep and nameprep should be rejoined to a hostnameprep. They
> are only about host name preparation, not any other name preparation.

I think they are about domain name preparation.  Some domain names are
textual but not host names, and don't obey the host name syntax rules.
For example, _ldap._tcp.foo.net (see RFC 2782 about SRV records).  I see
no reason why IDNA should restrict its attention to host names when it
can work perfectly well for all textual domain names.

That said, there could still be arguments for recombining the stringprep
and nameprep documents.  But you haven't yet presented any.

> 2. hostnameprep should be applied to the *entire* hostname; i.e. the
> entire name should be 'mapped' in the same way *before* it is parsed
> into parts.

Can you present any arguments favoring that approach?  I can think of a
few problems with it:

Sometimes programs count the number of dots in a name.  Nameprep can
change the number of dots (see the "Co." character, for example).  This
would be asking for trouble.

After running nameprep on the whole string, then splitting into labels,
individual labels might be invalid Unicode strings.  For example, a
label might start with a combining character.  So you'd have to do
another round of checking on the individual labels anyway.

In general, software often operates on labels individually, splitting
names, joining names, comparing labels, etc.  If IDNA were to do
anything to create interdependencies and interactions between labels, it
would be asking for trouble.

> 3. Various FULL STOPs should be mapped to FULL STOP

This is motivated by the previous suggestion, so we can put it aside for
now.

> 4. Various Pd (punctuation dash) should be mapped to HYPHEN-MINUS by
> hostnameprep.

Presumably this is to avoid visual ambiguity.  But there are many many
visual ambiguities created by the addition of the Unicode repertoire
(for example, several characters that look exactly like B).  Does this
one really warrant special treatment?  Why?

> 5. Symbols/punctuation/dingbats (except the hyphen-like dashes) should
> not be allowed [in host names]...  Punctuatuation in particular, in
> contexts where hostnames are embedded, may in future syntaxes use
> non-ASCII punctuation adjacent to the hostname.

That's a reasonable argument.  I argued something similar myself long
ago, but I didn't persuade the group.  Maybe it's too late now.

> 6. Hangul syllables (with conjoining characters, not non-conjoining
> compatiblity characters) that represent the same syllable must be
> mapped to the same representation.  Due to unfortunate historic
> reasons, this does no longer happen automatically with NFKC (though
> for drafts for NFKC it did).  Mappings should be added so that
> "syllabically" equivalent Hangul conjoning characters are mapped to a
> common representation.

I know nothing about this issue, but it sounds like you want the IETF to
redo Unicode work.  The Unicode folks know more about characters than we
do; I'm skeptical that we could do a better job on that sort of thing.

> 7. No document associated with hostnameprep should make any further
> restrictions on domain/host names than hostnameprep itself.

That's simply impossible.  Nameprep has no way of knowing whether the
name is too long.  You can't know that until after Punycode has been
applied.  Also, nameprep is not designed to apply special rules at the
beginning/end of the string, like prohibiting leading/trailing hyphens,
or prohibiting the IDNA prefix.  Those checks are better left where they
are now, in ToASCII.

> 8. Note: The SC/TC issue cannot be solved at a near-impossible-
> to-change (once deployed) technical level, but should instead be
> solved at a policy level

I tend to agree.

> 9. User interfaces that encounter mixed script hostname *parts* should
> be recommended to "flag" them (ballon warning, color differentiate,
> make blinking, bounce automatic registratations, ...).

Generally a good idea, although some scripts might be mixed often.
Maybe just talk about "labels containing combinations of characters
likely to be misleading", and give a couple of examples, and leave it to
the software developers to figure out what in general is misleading and
what isn't.

AMC