[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [idn] Comments on IDNA/stringprep/nameprep





> -----Original Message-----
> From: Adam M. Costello
> Sent: den 8 februari 2002 04:06
> To: IETF idn working group
> Subject: Re: [idn] Comments on IDNA/stringprep/nameprep
> 
> 
> Kent Karlsson <kentk@md.chalmers.se> wrote:
> 
> > 1. stringprep and nameprep should be rejoined to a 
> hostnameprep. They
> > are only about host name preparation, not any other name 
> preparation.
> 
> I think they are about domain name preparation.  Some domain names are
> textual but not host names, and don't obey the host name syntax rules.
> For example, _ldap._tcp.foo.net (see RFC 2782 about SRV 
> records).  I see
> no reason why IDNA should restrict its attention to host names when it
> can work perfectly well for all textual domain names.

A *delta* of the hostnameprep whould/should be used for SRV records;
i.e. it would use hostnameprep with a specified modification:
"LOW LINE is not prohibited" and whatever other changes needed 
for SRV records compared to hostnames, only listing the differences
(that is a delta), not the full result of applying the delta.


> That said, there could still be arguments for recombining the 
> stringprep
> and nameprep documents.  But you haven't yet presented any.

I'm not sure why the document was split into "stringprep" and a
"profile".  At least I would find a "hostnameprep" with deltas
for similar domain name preps a much more manageable approach.



> > 2. hostnameprep should be applied to the *entire* hostname; i.e. the
> > entire name should be 'mapped' in the same way *before* it is parsed
> > into parts.
> 
> Can you present any arguments favoring that approach?  I can 

Point 3 was the motivation.  Also: most users are likely to see a
hostname as an entity that would be treated the same by the system.
If fullwidth letters are perfectly ok to enter (mapped to normal width
by nameprep), why is not FULLWIDHT FULL STOP ok?  I think most users
that encounter that or similar (IDEOGRAPHIC FULL STOP in particular)
will find the currently specified behaviour idiosyncratic.


> think of a
> few problems with it:
> 
> Sometimes programs count the number of dots in a name.  Nameprep can
> change the number of dots (see the "Co." character, for 
> example).  This would be asking for trouble.

The "counting of dots" for an IDN should be done (as if) after the
mapping+NFKC, not before.



> After running nameprep on the whole string, then splitting 
> into labels,
> individual labels might be invalid Unicode strings.  For example, a
> label might start with a combining character.  So you'd have to do
> another round of checking on the individual labels anyway.

Yes, so?  "The other round" here does completely different things
than the "first round".  What's the problem?


> In general, software often operates on labels individually, splitting
> names, joining names, comparing labels, etc.  If IDNA were to do
> anything to create interdependencies and interactions between 
> labels, it
> would be asking for trouble.

Where was that implied in my suggestion?  As I said, parsing into parts
must be done as if after hostnameprep (most implementations may actually
do it after hostnameprep).  



> > 3. Various FULL STOPs should be mapped to FULL STOP
> 
> This is motivated by the previous suggestion,

No, it's the other way around: This point motivates the point 2.
See above.


> > 4. Various Pd (punctuation dash) should be mapped to HYPHEN-MINUS by
> > hostnameprep.
> 
> Presumably this is to avoid visual ambiguity. But there are many many
> visual ambiguities created by the addition of the Unicode repertoire
> (for example, several characters that look exactly like B).  Does this
> one really warrant special treatment?  Why?

If you had read my point 4 in its entirety, you would have found
the motivation: "Future keyboards may generate HYPHEN rather HYPHEN-MINUS
(except perhaps in "programming language mode", which few will use).
At least, hostnameprep should not prevent such a development."
What is not clear about that motivation?  It is not the business of
the IDN WG to prevent otherwise possibly desired developments of
keyboards.  The current suggestion puts up an unnecessary stumbling
block for this at least, which is inappropriate for this WG to do.


> > 5. Symbols/punctuation/dingbats (except the hyphen-like 
> dashes) should
> > not be allowed [in host names]...  Punctuatuation in particular, in
> > contexts where hostnames are embedded, may in future syntaxes use
> > non-ASCII punctuation adjacent to the hostname.
> 
> That's a reasonable argument.  I argued something similar myself long
> ago, but I didn't persuade the group.  Maybe it's too late now.

Again, it is not the business of the IDN WG to put up stumbling blocks
for developments that may otherwise be desirable.  It is quite
conceivable that future designers of syntaxes that embed host names
may wish to do so using non-ASCII punctuation and symbols adjacent
to the hostname (before being ACE encoded).  With the current proposal
that is, without reason, prevented.


> > 6. Hangul syllables (with conjoining characters, not non-conjoining
> > compatiblity characters) that represent the same syllable must be
> > mapped to the same representation.  Due to unfortunate historic
> > reasons, this does no longer happen automatically with NFKC (though
> > for drafts for NFKC it did).  Mappings should be added so that
> > "syllabically" equivalent Hangul conjoning characters are 
> mapped to a
> > common representation.
> 
> I know nothing about this issue, but it sounds like you want 
> the IETF to
> redo Unicode work.  The Unicode folks know more about 
> characters than we
> do; I'm skeptical that we could do a better job on that sort of thing.

This is due to a compatibility mapping removal.  The compatibility mapping
was present in Unicode 2.1, but was removed for Unicode 3.0, not for
Hangul specific reasons, but for technical trouble with the NFKC definition.
A better, and for Hangul more reasonable resolution would have been to
make those compatibility mappings canonical instead, which would have
been appropriate both for Hangul and for the NFKC definition issue.

This is again an issue of not putting up stumbling blocks for future
possible developments.  Future Hangul keyboards *may* (if so desired)
generate just single letter jamos (Hangul is an alphabetic script
with 14 consonant letters and 11 vowel letters; the alphabet *itself*
does not have hundreds/thousands of characters).



> > 7. No document associated with hostnameprep should make any further
> > restrictions on domain/host names than hostnameprep itself.
> 
> That's simply impossible.  Nameprep has no way of knowing whether the
> name is too long.  You can't know that until after Punycode has been
> applied.  Also, nameprep is not designed to apply special rules at the
> beginning/end of the string, like prohibiting 
> leading/trailing hyphens,
> or prohibiting the IDNA prefix.  Those checks are better left 
> where they
> are now, in ToASCII.

Let me rephrase:  "No document associated with hostnameprep should
make any further restrictions on the principal makeup of domain/host
names than hostnameprep itself. They make make size or tagging restrictions
though (those aspects would normally not be directly visible to the user)."



> > 9. User interfaces that encounter mixed script hostname 
> *parts* should
> > be recommended to "flag" them (ballon warning, color differentiate,
> > make blinking, bounce automatic registratations, ...).
> 
> Generally a good idea, although some scripts might be mixed often.
> Maybe just talk about "labels containing combinations of characters
> likely to be misleading", and give a couple of examples, and 
> leave it to
> the software developers to figure out what in general is 
> misleading and
> what isn't.

That's fine by me.


		/Kent K

> AMC