[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: IDN WG Last Call on two major changes to Stringprep

To: <paul.hoffman@imc.org>,<Marc.Blanchet@viagenie.qc.ca>,"Simon Josefsson" <jas@extundo.com>
Subject: Re: [idn] Re: IDN WG Last Call on two major changes to Stringprep
From: "Mark Davis" <mark@macchiato.com>
Date: Fri, 26 Jul 2002 21:17:47 -0700
Cc: "IETF/IDN WG" <idn@ops.ietf.org>,"Matitiahu Allouche" <matial@il.ibm.com>
References: <012501c2258d$44c78ca0$7400a8c0@JAMESSONYVAIO> <ilur8hpn91v.fsf@latte.josefsson.org>

There are two issues. (a) rationale, (b) NFKC interaction.

For the rationale, I include at the bottom an old email. It proposed a
somewhat more complicated solution than eventually appeared in the
text; I think what is in StringPrep is better (just for simplicity).

For the NFKC interaction, you bring up a good point. The conditions on
the string, if the goal is to always have a consistent appearance
(both on keyboard entry, and when displaying a name fetched from the
server) should be in effect for the string both before and after
StringPrep. In practice, it only has any effect in a few isolated
cases, since there are very few characters whose BIDI class changes.

Mark

----- Original Message -----
From: "Mark Davis" <mark@macchiato.com>
To: <ietf-bidi@imc.org>; "Paul Hoffman / IMC" <phoffman@imc.org>
Sent: Saturday, September 15, 2001 19:03
Subject: Re: First attempt at problem statement


BIDI IDN

I will try to recap some of the discussions in the ad hoc on BIDI IDN
last week.

The BIDI algorithm is designed to deal with normal text. Within any
string, sequences of LTR (left to right) and RTL characters will
always appear in the correct order. However, the order of other
characters (such as a period) will depend on their context. For
details, see http://www.unicode.org/reports/tr9/.

URLs are not normal text, and thus may have odd display. This is
complicated by the fact that the overall paragraph direction has an
effect on the display. Whether the URL is displayed in a RTL or LTR
context will change the order of the components. [In this and other
examples we will use the convention that uppercase letters stand for
right-to-left characters: Arabic, Hebrew, etc.] Example:

Memory:         http://SOME.LARGE.mixed.CORP.org
Display (LTR):  http://EGRAL.EMOS.mixed.PROC.org
Display (RTL):  org.PROC.mixed.EGRAL.EMOS//:http

Notice that "SOME.LARGE" always appears from RTL: the period adopts
the order of the surrounding characters. Characters on boundaries
(such as "//:" take on the overall display direction.

For example:

(1) characters in different fields may mix across fields:

Memory:         http://SOME.veryLARGE.CORP.org
Display (LTR):  http://EMOS.veryPROC.EGRAL.org
Display (RTL):  org.PROC.EGRALvery.EMOS//:http


(2) two different sequences of characters in the same field can have
the same order when displayed. Thus a user would not know how to type
a URL that he sees printed.

Memory1:        http://123CORP.org
Memory2:        http://CORP123.org
Display (LTR):  http://123PROC.org


The following are proposed requirements.

1. Consistent fields.

(a) If you have fields such as http://XXX.YYY.ZZZ, characters from the
same field should not be displayed in different fields and vice versa.
(b) It should be possible to deduce the order of the backing-store
fields from the order of the display fields.

2. Order within fields.

Within a field, tt should be possible to deduce the order of the
backing-store characters from the order of the display fields.

3. No algorithm change

This should require no change to BIDI algorithm. Using a separate
algorithm for display of URLs would be difficult, since they are found
within flowing text. Getting people to update the BIDI algorithm would
also be quite difficult (and changes to make URLs work might have
repercussions on other text).

4. Simplicity

Whatever solution we have should have a simple algorithm (according to
Paul).

Ideally, some reasonable restrictions on the contents of a field would
meet all of these requirements.


 * * *

During the meeting, I thought that the most straightforward method was
to force the periods to be LTR. However, after looking at the results
in both a RTL and LTR context, I concluded that Mati's approach would
be better. The results both in terms of field order and order within a
field would still be determinate. With any complete URL (with http://,
ftp://, etc.) it would be easy to recognize the order in context,
since the position of those initial letters would show the order. The
only bad case would be where the end of the string reversed because of
its surroundings, e.g.:

Memory:         the url http://SOME.LARGE.mixed.CORP.org is the one
Display (LTR):  the url http://EGRAL.EMOS.mixed.PROC.org is the one
Display (RTL):  org is the one.PROC.mixed.EGRAL.EMOS//: the url http

Thus in flowing text, users would be recommended to bracket any BIDI
URL with RLM (or embed the URL). E.g.

Memory:         the url <RLM>http://SOME.LARGE.mixed.CORP.org<RLM> is
the one
Display (LTR):  the url http://EGRAL.EMOS.mixed.PROC.org is the one
Display (RTL):  is the one org.PROC.mixed.EGRAL.EMOS//:http the url

Software, such as email clients, that recognizes URLs could do this
automatically.


Here is what I have now for an explicit algorithm to be added as a
step to NamePrep, after the regular Prohibition step.

A. Characters are classified into RTL, LTR, DIGIT, OTHER.

These categories are drawn from the BIDI algorithm. The precise lists
of characters in each category would be added to NamePrep as an
appendix. The composition is as follows (See
http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types).

LTR   := L ; # including LRM

RTL   := R | AL ;

DIG   := EN | AN ;

OTH := all other characters: NSM, ON, etc.

Note: The characters in categories LRM, RLM, LRO, RLO, LRE, RLE, PDF,
B, S, and some other BIDI categories are prohibited anyway.


B. In any field that contains any RTL characters:
B0. no LTR characters can occur.
C1. a sequence of characters of type DIG can only occur at the end.
C2. a sequence of characters of type OTHER can occur only between
characters of type RTL.


The following is an example of an algorithm that implements (B). EOS
stands for "end of string".

1. Let S be 0
2. Get the next character, then get its numeric type T
3. Let S be Map[S, T]
4. If S = T or F, exit with OK or FAIL respectively
5. Goto Step 2

Map is defined by the following table:

  T  LTR RTL DIG OTH EOS
S +---------------------
B |   L,  R,  L,  L,  F   // begin
L |   L,  F,  L,  L,  T   // left
R |   F,  R,  D,  O,  T   // right
O |   F,  R,  F,  O,  F   // right + other
D |   F,  F,  D,  F,  T   // right + digit


Mark
__________
http://www.macchiato.com
◄  “Eppur si muove” ►

----- Original Message -----
From: "Simon Josefsson" <jas@extundo.com>
To: <paul.hoffman@imc.org>; <Marc.Blanchet@viagenie.qc.ca>
Cc: "IETF/IDN WG" <idn@ops.ietf.org>
Sent: Friday, July 26, 2002 20:28
Subject: [idn] Re: IDN WG Last Call on two major changes to Stringprep


> Quoting the draft:
>
> ,----
> | In any profile that specifies bidirectional character handling,
all
> | three of the following requirements MUST be met:
> ...
> | 2) If a string contains any Right-to-Left character (defined as
> | belonging to Unicode bidirectional categories "R" and "AL"), the
string
> | MUST NOT contain any Left-to-Right character (defined as belonging
to
> | Unicode bidirectional category "L").
> |
> | 3)  If a string contains any Right-to-Left character (as defined
above),
> | a Right-to-Left character MUST be the first character of the
string, and
> | a Right-to-Left character MUST be the last character of the
string.
> `----
>
> There is little rationale for the last two requirements.  Without
> knowing the rationale, it is difficult to understand how to
implement
> this, not to speak of understanding and evaluating the
specification.
>
> It is not difficult to construct various strings that violates these
> requirements, but seem like valid identifiers to me (e.g., U+05D0
> U+0966, contemplate it being written by a mathematically inclined
> writer in India).  Why is U+05D0 a R/AL character but U+2135 not?
> U+2135 is NFKC'd into U+05D0.  It thus seems like the identifier is
a
> valid IDN if NFKC is not used, but if NFKC is used, it is not a
valid
> identifier.  A bidi user thus seem to require NFKC not to be used in
> order to have the bidi string accepted.
>
>
>

Follow-Ups:
- [idn] Re: IDN WG Last Call on two major changes to Stringprep
  - From: Simon Josefsson <jas@extundo.com>

References:
- [idn] IDN WG Last Call on two major changes to Stringprep
  - From: "James Seng" <jseng@pobox.org.sg>
- [idn] Re: IDN WG Last Call on two major changes to Stringprep
  - From: Simon Josefsson <jas@extundo.com>

Prev by Date: [idn] Re: IDN WG Last Call on two major changes to Stringprep
Next by Date: [idn] Re: IDN WG Last Call on two major changes to Stringprep
Previous by thread: [idn] Re: IDN WG Last Call on two major changes to Stringprep
Next by thread: [idn] Re: IDN WG Last Call on two major changes to Stringprep
Index(es):
- Date
- Thread