[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] homograph attacks



Hi Michel,

I don't think we are so far off. My concern is that many people are abusing the term 'language' for these tables. I am just saying that creating exclusive subset of Latin characters in European context is not necessarily a bad idea but will result in future problems because they will always discover that few characters are missing from the subset.


I do not think that a few characters being missing from a particular version of a language table is that big a deal. After all, characters are still being added to Unicode. If the registry finds that there is a real demand for the missing characters, it can decide to revise the table with those characters included with little effort (as compared to shrinking a table.)

It is reasonably easy for .de to establish a table as they did and again it is ok. It is much more challenging for a worlwide TLD such as .com to establish registration rules.

Agreed. However, I think it is a worthwhile effort for a combined effort among gTLDs and relevant language experts to develop language tables that are appropriate for use in a gTLD context, and can be shared by other registries. In some cases, it may not be so difficult as ccTLDs that share a common language with each other often work together while establishing the set of characters to be allowed in their own IDN rollouts (eg. .at/.ch/.de for German and .ca/.ch/.fr for French).

Typically script is a much better selector than language to establish those tables and associated rules.


A script is a very convenient selector for publishing tables and in other applications such as localization. I don't agree that it applies "typically", although in some cases it does work quite well (.museum and .pl are fine examples.)

Even among the 92 characters that DENIC allows, which are all characters from the Latin script, as Roozbeh pointed out in the last ICANN IDN workshop that there is a visual resemblance between U+00D0 (the capital version of U+00F0) and U+0110 (capital version of U+0111). This would be an intra-script conflict (though the characters belong to different code blocks), and compared to the IDN-posing-as-ASCII-domain attack it is arguably less urgent. This serves to illustrate the importance of restricting the characters to the smallest practicable subset in IDN implementations. For a gTLD, the language tag comes in handy, so we could apply different language tables to IDNs according to their intended language. And using a conservative subset of characters for each language table is part of the deal, in order to reduce the possibilities of phishing attack.

Best regards,
wil.