[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Legacy charset conversion in draft-ietf-idn-idna-08.txt



From section 4 "Conversion operations" of draft-ietf-idn-idna-08.txt
(also discussed in section 6.7):

,----
| An application converts a domain name put into an IDN-unaware slot or
| displayed to a user. This section specifies the steps to perform in the
| conversion, and the ToASCII and ToUnicode operations.
| 
| The input to ToASCII or ToUnicode is a single label that is a sequence
| of Unicode code points (remember that all ASCII code points are also
| Unicode code points). If a domain name is represented using a character
| set other than Unicode or US-ASCII, it will first need to be transcoded
| to Unicode.
`----

This last sentence seem to brush a practical problem under the rug.
Most systems aren't Unicode based today, so in fact most systems will
have to implement this unspecified transcoding.  The Unicode
consortium has not specified how to transform Unicode to/from legacy
encodings.  There are some unofficial mappings for ISO 8859-1 charsets
on www.unicode.org/Public/MAPPINGS/, but even unofficial mappings for
other charsets (in particular CJK) is not present.

Real world scenario: My machine uses ISO-8859-1.  I enter 0xB5.  How
is this transcoded into Unicode?  U+00B5 or U+03BC?  There are many
similar examples.

I think the third paragraph of the security consideration should more
clearly express that IDNA actually is vulnerable to the attack if
machines, like most machines on the Internet, use legacy encodings.

Some high-level insight on the problem:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#conv