[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: IDNA: is the specification proper,adequate, and complete?



--On Tuesday, 18 June, 2002 08:05 -0700 Mark Davis
<mark@macchiato.com> wrote:

>> The bad news is that, if this character [form] is observed in
>> a written string, or a non-Unicode coding system that does not
>> distinguish between Eszett and "small letter [Greek] beta",
>> one really needs to have language context in order to
>> determine which Unicode character to map it to, especially
>> because, given stringprep, the consequences of getting it
>> wrong are likely to be very significant.
> 
> A few items here.
> 
> 1. While there are many Unicode characters with very similar
> shapes, beta and eszed -- in normal fonts -- are really no
> more alike than y and gamma (or, for that matter, capital I,
> lowercase L, and 1!!). Beta typically has a descender, while
> eszed does not. See
> http://www.macchiato.com/unicode/beta-eszed.htm for a GIF
> image from three fonts.

As you, and I, and others, have noted in the past, there are far
better examples than beta and eszett (or eszed if you prefer).
But there are certainly fonts (which I assume you would classify
as "not normal") -- designed in isolation for the two languages
-- whose interpretations of the two characters, if placed in
text of the other language, would not catch the eye of a casual
reader as being out of place.   And I would assume that, in the
long history of manual typesetting, there have been instances of
Eszett being substituted for beta (probably in German texts
containing mathematics because a lazy typesetter was disinclined
to walk across to a different font).   None of this, of course,
challenges your fundamental argument, with which I agree and
hope I was careful enough to write my note to not challenge.

> 2. There are no "non-Unicode coding systems" that unify beta
> and eszed; the language issue is irrelevant.

Sure there are.  We call some of them "books".  Transcription of
a language into printed form involves a coding system.  And I
have to assume, although I can claim no personal knowledge, that
German schoolchildren, brought up looking at Eszett, have to be
taught, when they encounter mathematical notation that uses
Greek characters (if not sooner), that it is important to notice
either the context or the descender -- that the two characters
are not the same.  These distinctions, including getting used to
the variations and similarities in different fonts of I-l-1, are
bits of pattern recogition that lay people --as distinct from
font or character set experts-- rapidly learn, within their own
language and script contexts, to distinguish from context or by
relatively subtle clues.  I can't even spell out Arabic or Thai
scripts because I don't have enough experience with the right
set of clues -- my loss, but these are learned skills.

But this isn't the point, so whether there are, or are not,
coded character sets that unify the two is not the point either
(I'll defer to your knowledge and experience on this subject,
since I haven't studied the question, but statements that sound
like universal negatives always scare me).  Your third comment
_is_ as key part of the point.

> 3. As I pointed out some time ago on this list, it is *not*
> rocket science to provide a user interface that makes it very
> clear to people if there are mixed scripts in a domain name;
> and also simple to extend it to other confusables. For a demo,
> see
> http://www.macchiato.com/utc/show_script.html.
> 
> I have recommended and continue to recommend that the IDNA
> documents contain some wording on this, something like:
> 
> "To help prevent confusion between characters that are visually
> similar, it is recommended that implementations provide visual
> indications where a domain name contains multiple scripts. Such
> mechanisms can also be used to show when a name contains a
> mixture of simplified and traditional characters, or to
> distinguish zero and one from O and l."

Mark,

I ultimately have only three problems with IDNA and the IDN
proposals taken as a group.  The first is a technical one and
applies to IDNA specifically.  The others are problems that
would probably extend to any in-DNS substitute):

(i) It makes assertions about applicability that I believe are
over-broad and risky, to no good benefit.  I think we are
nearing consensus on that one and, in any event, that reviewing
it again wouldn't accomplish much.

(ii) It is addressed to, and solves, a very narrow problem.  We
(for some definition of "we") have not been explicit, in an
Internet context, as to what that problem is.  I believe that we
should be explicit.   Then, having carefully described that
problem, we then need to carefully evaluate the question of
whether the benefits of solving it outweigh the risks to the use
of the DNS in the Internet community that it might pose.  If we
conclude that we can't reasonably do that evaluation (e.g.,
because it isn't an IETF problem), then I think we are still
obligated to delineate the issues and risks to the best of our
ability -- at least to the extent of writing down the
implications of problems and issues we already know about.

(iii) A number of items of knowledge and recommendations have
surfaced in the working group -- of which your suggestion above
is an excellent example -- that could be used to reduce or
eliminate some of those risks to the DNS as a piece of usable
Internet infrastructure.  I think they need to be written down
as part of WG output, if only because "this risk can be
ameliorated if one does so-and-so" is a much more satisfactory
statement than "there is this horrible problem and we should
consider stopping progress until someone has a solution".

Of course, if "mixed scripts in domain names" are considered
good things, warning when they occur won't help much.   But
_that_ one, I would contend, is not an IETF problem although I
think it would be wise and responsible for us to point out that
mixed script labels pose challenges that homogeneous ones do not.

regards,
     john