[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Document Status?



Disclaimer:  In a few places in this message, when I ask if an alternate
phrasing would be less confusing, I am not offering to make changes to
the draft.  Paul and Patrik and I would have to discuss it, and I would
understand if they think it's too late for that.

"JFC (Jefsey) Morfin" <jefsey@jefsey.com> wrote:

> You also refers to external knowledge such as "punnycode".

Punycode (one n).  It is used only inside the ToASCII and ToUnicode
operations.  If you are implementing those operations, or trying to
understand their internals, you will need to refer to the Punycode spec
and the Nameprep spec.  Otherwise, you can regard ToASCII and ToUnicode
as abstract operations, and still understand IDNA.

> Should we not have a clear terminology section defining every word or
> concept we are going to use?

Yes.  That is what we intend section 2 to be.

> All I know is semantically "international names" could only means the 
> names which are unaltered by the ACE process.                         

Which ACE process?  There are two operations, ToASCII and ToUnicode,
which do very different things.  What they do is stated in the first
paragraphs of the sections where they are defined (4.1 and 4.2):

    The ToASCII operation takes a sequence of Unicode code points that
    make up one label and transforms it into a sequence of code points
    in the ASCII range (0..7F). If ToASCII succeeds, the original
    sequence and the resulting sequence are equivalent labels.

    The ToUnicode operation takes a sequence of Unicode code points that
    make up one label and returns a sequence of Unicode code points. If
    the input sequence is a label in ACE form, then the result is
    an equivalent internationalized label that is not in ACE form,
    otherwise the original sequence is returned unaltered.

> "Here I understand they mean the Unicode scripting which can be ACEd",
> so should it not be at least be "internationalizable" (no action
> occurred yet)?

Various arguments can be made that we are misusing the word
"internationalized".  For better or worse, we use it to mean roughly
"able to support the use of non-ASCII characters".  In any case, we give
definitions for "internationalized domain name" and "internationalized
label", so you can forget any preconceived ideas about what
"internationalized" ought to mean, and trust the definitions.

> Also I understand that that the RFC deals with the ACE
> process. Terminology section should therefore
>
> 1. define the ACE process

I'm not sure what you mean by "the ACE process".  There are two
operations, ToASCII and ToUnicode, which are complex enough to warrant
their own section (section 4).

> 2. define the group of the applications requiring ACE

I think what is really needed is not the set of applications requiring
ACE, but rather the set of places where ACE is needed.  That is already
there in the terminology section, under "IDN-unaware domain name slot".
Requirement 2 of section 3 (3.1 in the forthcoming idna-11 draft) states
that only ASCII characters are permitted in IDN-unaware domain name
slots.

> 3. define the international names : left unchanged by the process

Which process?  The labels left unchanged by ToASCII are simply the
ASCII labels, so we don't need a special term for that.  The labels left
unchanged by ToUnicode are the non-ACE labels, so we already have a term
for that.

> 4. define the non ACEable names

I'm not sure what you mean by ACEable.  Every ASCII label is non-ACEable
in the sense that ToASCII will not alter it.  Some ASCII labels are ACE
labels themselves, and some are not.  But all of them are left unchanged
by ToASCII.

For every valid non-ASCII label, there is an equivalent ACE label.

Maybe you are asking for a definition of valid IDNs?  That is given.

> 5. define the ACE labels

Done.

> 6. define the class of all the labels (ACE and International) which
> can be processed by an application requiring ACE

I don't know what you mean by "an application requiring ACE".  We do
give a definition of IDN.

> >    An "internationalized label" is a label composed of characters from
> >    the Unicode character set; note, however, that not every string of
> >    Unicode characters can be an internationalized label.
> 
> to me this is an Unicode label/script?  No action occurred yet?

An internationalized label is a sequence of Unicode characters.  Given
a Unicode string, can it be an internationalized label?  The answer is
provided in the definition of IDN:

    An "internationalized domain name" (IDN) is a domain name for which
    the ToASCII operation (see section 4) can be applied to each label
    without failing.

"can be applied", not "has been applied".  A Unicode string X can be an
internationalized label if and only if ToASCII(X) does not fail.

Hmmm, one fact about internationalized labels is that every ASCII label
that can be used in ASCII domain names is also an internationalized
label that can be used in internationalized domain names.  In other
words, the set of internationalized labels is an extension (superset) of
the set of valid ASCII labels.  Would it have been helpful if this fact
were stated more prominently?

> >    To allow internationalized labels to be handled by existing
> >    applications, IDNA uses an "ACE label" (ACE stands for ASCII
> >    Compatible Encoding), which can be represented using only ASCII
> >    characters but is equivalent to a label containing non-ASCII
> >    characters.
>
> IMHO, let stop at "ACE label". And let define what an ACE label,
> without "can" which implies there would be other ways.

I think the word "represented" is confusing you.  In my mind, "Foo"
cannot be represented using only lowercase ASCII letters, but "foo" is
an equivalent label that can be represented using only lowercase ASCII
letters.  Similarly, <sono><supiido><de> cannot be represented using
only ASCII characters, but IESG--d9juau41awczczp is an equivalent label
that can be represented using only ASCII characters.

Unfortunately, the document is not consistent in its usage of the word
"represent".  Would this sentence be less confusing if worded as:

    ...IDNA uses an "ACE label" (ACE stands for ASCII Compatible
    Encoding), which is composed of ASCII characters but is equivalent
    to a label containing non-ASCII characters.

?

That's less precise (because there are non-ASCII characters that can
be represented by ASCII characters, like the fullwidth characters
FF01..FF5E), but I suppose the imprecision might be tolerable, since the
rigorous definition immediately follows.

> All this added explanation is external and only adds to the text.  Let
> us try to compact it into one single initial crystal clear definition?

The only crystal clear definition is this one:

    More rigorously, an ACE label is defined to be any label that the
    ToUnicode operation would alter.

However, here's another attempt at providing some intuition:  All
non-ASCII internationalized labels are intended to be displayed to
users without change.  The same is true of most ASCII labels.  But
there are some ASCII labels, called ACE labels, that are not intended
to be displayed directly to users.  ACE labels begin with the ACE
prefix and look like gibberish.  Every ACE label is equivalent to a
non-ASCII label, which is what is intended to be displayed instead.
For every non-ASCII internationalized label there is an equivalent ACE
label.  [I'm being loose with the term "ASCII".  Unicode characters that
are compatibly equivalent to ASCII characters (like those fullwidth
characters) count as ASCII for the purposes of this paragraph.]

> >   More rigorously, an ACE label is defined to be any label that the
> >   ToUnicode operation would alter.
>
> So an ACE label is here defined negatively.  Feeling is that it means
> "when ToUnicode will fail".

I don't see what's negative about that definition.  I see no mention of
failure.  Indeed, the second paragraph of the ToUnicode section (4.2)
says "ToUnicode never fails."

> When we mean that the ToUnicode (sucessfully) transform into an ACE
> label.

ToUnicode can never output an ACE label.  The first paragraph of the
ToUnicode section says:

    If the input sequence is a label in ACE form, then the result is
    an equivalent internationalized label that is not in ACE form,
    otherwise the original sequence is returned unaltered.

Here is a math-notation version of the definition of ACE label:

    { ACE labels } = { X : X != ToUnicode(X) }

> >    For every internationalized label that cannot be directly
> >    represented in ASCII, there is an equivalent ACE label.
>
> My first reading was puzzling: "whenever the ACE process will not
> work, there will be an pre-existing equivalent ACE label".  This
> obviously does not make any sense.

Again, I think the word "represented" is confusing you, although I
thought the word "directly" would help.  Would it be less confusing if
reworded the same way as before [again using the loose sense of ASCII]:

    For every internationalized label containing non-ASCII characters,
    there is an equivalent ACE label.

?

> >    In IDNA, equivalence of labels is defined in terms of the ToASCII
> >    operation, which constructs an ASCII form for a given label.
> 
> this means ACE label.

No, ToASCII(helloworld) == helloworld, which is not an ACE label.  The
output of ToASCII is always an ASCII label, but not always an ACE label.
The output is an ACE label only if the input was non-ASCII, or the input
was an ACE label.  [Here again I'm using the loose sense of ASCII.]

> >   Labels are defined to be equivalent if and only if their ASCII
> >   forms produced by ToASCII match using a case-insensitive ASCII
> >   comparison.
>
> Then International names - ie non modified names by the ACE process
> (now reduced to ToASCII only(?), should it not also be reversible and
> the ToUNICODE results in the original scripting?) - cannot be compared
> (I know it is wrong, but this is what I read here.

That does not follow.  The definition of equivalence says:

    For all X,Y  define  X ~ Y  to mean  ToASCII(X) ~ ToASCII(Y)

You are wondering about "non modified names", that is, labels Z for
which Z == ToASCII(Z).  That does not impede us from performing the
comparison.  We compute ToASCII(X) (which returns X), and we compute
ToASCII(Y) (which returns Y), and we compare the results using a
case-insensitive ASCII comparison (which we can do because the output of
ToASCII is an ASCII string).

> >    Traditional ASCII labels
> 
> What is "traditional".  Has it been defined?

Just a regular English word, not a term.  "Traditional ASCII labels"
means "the ASCII labels that we've all grown familiar with over the
years".

> >    already have a notion of equivalence: upper case and lower case
> >    are considered equivalent.  The IDNA notion of equivalence is an
> >    extension of the old notion.
>
> Old notion?  Is that the correct wording?  Is that not the DNS current
> and stable notion?

Yes, the current notion is old, having been with us for many years.
Would "older notion" be less alarming?

> > The Japanese phrase <sono><supiido><de> (pretend I wrote it using
> > kana, which are non-ASCII characters) could be an internationalized
> > label.  It is not an ACE label, because it cannot be represented in
> > ASCII.
>
> Well I thought that ACE label resulted from the ACE process and
> not the labels left identical by the ACE process (IMHO both ways:
> "iesg---name.com" is perfect ASCII, but if ToUNICODEd it will have no
> meaning).

ToUnicode will not alter the label IESG---name (remember that ToUnicode
takes a single label as input).  The output will be IESG---name.  In
this particular case, the Punycode decoding step will fail.  But see
below for a more interesting example, IESG--world.

> > If you feed it to ToUnicode, it will not be altered, because the
> > check for the ACE prefix will fail.
>
> If the registering process has prevented the creation of the "iesg--"
> names.

I was speaking of <sono><supiido><de>, which does not begin with the ACE
prefix.  It's true that if someone registers IESG--world, it will be
displayed by IDNA-conformant applications as three Chinese characters
(U+53DF U+53E0 U+53D9), because IESG--world is an ACE label.  Maybe
the user really wants a label that will display as "IESG--world", in
which case the user will be dismayed at the behavior of IDNA-conformant
applications.  Or maybe the user really intends to register the Chinese
label, but used the ACE form in the registration process because the
browser wouldn't let them enter Chinese characters, or because a clever
clipboard decided to paste the ACE form, or because the registrar has
not yet upgraded to support IDNA.  Those are all legitimate scenarios;
people who compute the ACE form themselves and register it directly
rather than relying on the registrar to compute it might in fact know
what they're doing.  Registrars that support IDNA and receive ACE forms
as input would probably do well to display them back in both ACE and
non-ACE forms and ask for confirmation.

When several labels are equivalent, you can regester any one of them
and you own them all.  If I register "example.com" or "Example.com"
or "EXAMPLE.com", no one else can register any of them, they're all
mine.  Similarly, no matter whether I register <sono><supiido><de> or
IESG--d9juau41awczczp, no one else can register either one, they're both
mine.

> The punnycode is no part of the document and should be introduced.

It is mentioned at the end of section 1, and in the steps of ToASCII,
and in steps ToUnicode.  Nameprep is mentioned in exactly the same
places, and nowhere else.  You don't need to understand them to
understand IDNA, you can think of ToASCII and ToUnicode as abstract
operations.

> > The label IESG--3ba is not an ACE label, even though it begins with
> > the ACE prefix and the Punycode part is valid, because it is not
> > equivalent to any non-ASCII label (because it is not nameprepped;
>
> is namepreparation part of the process. If yes it has to be included
> in the ACE or ToASCII conditions above. If not this restriction
> does not apply as such, it can only be noted. There may be a lot of
> variations in the Unicode scripting the users/developers may want to
> result in the same ACE_label.
>
> > it decodes to a capital A with grave accent).  If you feed it to
> > ToUnicode, it will not be altered, because the comparison in step 7
> > will fail.
>
> You understand that this is as long as DNS does not support &agrave;
> but that other applications may. The definition given above talking
> about "existing applications" - there are a lot of existing
> application supporting it, and at any given time in the future there
> will be more.
>
> So it means that we also want to define the ACE character set.

I don't understand what your concern is here.

IESG--3ba is not an ACE label because it is not equivalent to
any non-ASCII label.  We have defined equivalence.  IESG--3ba is
equivalent to X if and only if ToASCII(IESG--3ba) matches ToASCII(X).
ToASCII(IESG--3ba) is IESG--3ba, because ToASCII does not alter ASCII
labels.  So we're looking for a non-ASCII label X such that ToASCII(X)
is IESG--3ba.  There is no such X.  If you want to understand why there
is no such X, you'll need to examine the internal details of ToASCII,
and then you'll discover that ToASCII performs Nameprep, and Nameprep
can never output a capital A with grave (and to understand why that is
so, you'll need to examine the internals of Nameprep, and then you'll
discover that Nameprep performs case-folding).

By the way, even though there is no ACE label that decodes to a capital
A with grave, that doesn't mean ToASCII can't accept a capital A with
grave as input.  It can, and Nameprep will map it to small a with
grave.  The output of ToASCII will be IESG--0ca, which is an ACE label,
which ToUnicode will transform into small a with grave.  Notice that
ToUnicode(ToASCII(X)) is not always X, it is Nameprep(X) (which is
equivalent to X).

Another way to see that IESG--3ba is not an ACE label is to observe
that ToUnicode does not alter it.  If you want to understand why
ToUnicode does not alter it, you'll need to examine the internal details
of ToUnicode, and then you'll discover that ToUnicode decodes it to
capital A with grave, then applies ToASCII, which produces IESG--0ca (as
mentioned above), and then ToUnicode notices that IESG--0ca does not
match IESG--3ba, and so ToUnicode returns the original input.

AMC