[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Editorial comments on stringprep



Patrik Fältström <paf@cisco.com> wrote:

> Historically, Unicode Consortium has always stated that mapping to
> lowercase were more consistent and better than mapping to uppercase.
> At least for Unicode before version 2. Don't remember any exact
> reasons, but we can dig it up if you want us to.

and later:

> Anyway, my point is that previously UTC has always recommended
> mapping to lowercase, and people seems to be happy with it.

I couldn't remember UTC ever saying such a thing, so when Mark Davis
<mark@macchiato.com> wrote:

> For most case-insensitive binary comparison, the recommendation is
> to use case folding, as defined in the Standard. For more info, see
> http://www.unicode.org/unicode/reports/tr21/.

naturally I took a look.  The only thing I found that could be construed
as a preference for lowercase was in the CaseFolding details:

> Each equivalence class is completely disjoint from all the others,
> and together they form a partition of the entire Unicode code space.
> From each class, one representative element (a single lowercase
> letter where possible) is chosen to be the common form.
> [CaseFolding] thus contains the mappings from other characters in
> the equivalence characters to their common forms.

This passage describes an *internal process* only.  This has nothing to
do with the Unicode Consortium or Technical Committee preferring
lowercase forms over uppercase for user-visible operations, and it
definitely does *not* claim that mapping to lowercase is more consistent
or "better" than mapping to uppercase.  There is no explanation given
for the phrase "a single lowercase letter where possible" to explain why
lowercase was selected.  It appears to be an arbitrary choice.

> There are also a number of codepoints which are lowercase which
> doesn't have uppercase versions.

Which ones?  I can think of a character that looks uppercase but has no
lowercase form (U+04C0 CYRILLIC LETTER PALOCHKA).  But such letters,
despite their appearance, are neither uppercase nor lowercase; they are
caseless, and immune to the effects of any casing operation.

> Last, some codepoints (like the german sharp-s, ß) turns to "SS" in
> uppercase, and my guess is (with my limited knowledge of German,
> only 2 years of studies) that one when comparing don't want that
> similarities.

German speakers are forced to deal with that mapping every day.  It is a
natural part of the language.

> And, personally, I rather see bq-asdqwe123 than BQ-ASDQWE the few
> times I hope I see a domain name used in protocols natively in its
> ACE encoding.

No argument there.  All-lowercase is widely recognized as being easier
to read than all-uppercase, primarily because of the greater variation
in letterforms.  But again, there doesn't seem to be any evidence that
the Unicode Consortium has made any of the claimed statements about
"preferring" lowercase or about the mapping to lowercase being more
"consistent."  Please dig up the relevant references, if possible.

-Doug Ewell
 Fullerton, California