[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] character tables



However, one avenue that might be worth exploring some more is to check each registry's character table (for those that have one) and see what the Unicode category is for each character. The Japanese Katakana middle dot U+30FB has the category "Pc" which means "punctuation, connector" and LDH's hyphen U+002D has the category "Pd" which means "punctuation, dash".

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

If it turns out that all or most of the registries that have tables are using characters with only a small number of Unicode categories, then we may wish to consider moving IDNA to that set of categories (disallowing all others). This would keep the registries happy while keeping *some* of the phishy characters out of DNS.

Even if we do not end up prohibiting a larger number of characters in nameprep-bis, it might still be a good idea to have the results of the investigation proposed above, since these Unicode character categories could then be entered into the guidelines for the registries.


So, these two sub-projects (nameprep-bis and registry table investigation) could proceed in parallel. I think it would be good to divide and conquer, since one person cannot do all of this. Perhaps we could invite volunteers to work on sub-projects?

As I indicate at nameprep.org, I found some character tables at the IANA site, but I found even more at the GNU libidn site. One of the first things to do is to agree on a single machine-readable format. The tables do not all use the same format yet, it seems. Then we would also need to have the latest and most official tables from the registries themselves (instead of possibly out of date IANA tables and possibly embellished unofficial GNU libidn tables).

http://nameprep.org/#related-work

Erik