[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] upstream and downstream



Regarding the IDN spoofing issue, I'd like to offer the following analysis, and hope that people will point out anything that's missing.

For IDN names, there are 2 different times:

1. registration time
2. lookup time

At each of these times, we start with textual symbols, give them codes (Unicodes), perform nameprep and encode into punycode. After a lookup, we might also display the name for the user. There are other details and various cases to consider, but this is the big picture. These are the basic stages that a name progresses through.

Now, the IDN spoofing issue is one of look-alikes. We can attack this look-alike problem by focussing on an early stage, an intermediate stage or a late stage. The terms often used for stages that progress from one end to another are upstream and downstream.

If we were to try to solve the look-alike problem all the way upstream, we would focus on Unicode. This character encoding gives distinct codes to textual symbols that look similar or are even identical, in both form and function. Michel even admitted that many of the Latin letters in the Cyrillic block were added only to keep people from having to switch scripts midstring. So these characters are identical in appearance and use. Ignoring for the moment whether this is realistic, one way to solve the look-alike problem is to have Unicode get rid of these look-alikes. In Unicode parlance, it would then become a glyph encoding, rather than a character encoding.

Going further downstream, we have nameprep. This is another possible place to solve the look-alike problem. We could have a new version of nameprep that maps homographs to base characters. I.e. the look-alikes are folded. After this kind of nameprepping, we have characters that all look different. Again, no statement about whether such a change to the spec is realistic. This is just an analysis.

Downstream some more, we have the rules applied by the registry. The .jp registry seems to be well aware of the look-alike problem, and they have strict rules, with a well-defined table of characters to filter the name through. If the name contains characters outside that table, that name is rejected.

Going all the way downstream at lookup time, we have the string that is displayed to the user. Various people have been suggesting various ways to display the look-alikes so that the end-user will notice when something phishy is going on.

This concludes my analysis. It is rough, of course, but this is intended to be the big picture, and I hope it's complete. If this big picture is not complete, please point out what's missing.

Having concluded my analysis, I will now discuss which solutions are realistic. As far as I'm concerned, Unicode is an immovable object. There is probably zero chance that anybody could talk them into getting rid of look-alikes.

So what about nameprep? Should nameprep have chosen Unicode or some other character encoding? This part of nameprep is probably unchangeable. I doubt we could get consensus on switching nameprep from Unicode to some other encoding.

But what about the other parts of nameprep? Would it be possible to add another kind of mapping to it, namely from homographs to base characters? This would be a rather large change, and might even require a new prefix (i.e. something other than xn-- to allow migration). I don't really know whether this kind of change is realistic.

Next we have the registries. Kudos to .jp and all the others that carefully avoid look-alikes. But what about .com? As it turns out, in their current set-up, when VeriSign does not have a table for the requested language, the registry allows *any* Unicode to be registered. This is particularly egregious. They ought to fix this part, at least. Is VeriSign an immovable object? I don't know.

However, even if VeriSign were to have tables for all the languages, we would still have a look-alike problem. Consider a Russian name that consists entirely of Latin homographs. One solution is to make the language tag available to the application, possibly via a new DNS record. This is just an example, but the application could then put an icon of the Russian flag next to the domain name or URI. An American user *might* then realize that something phishy is going on. (Or they might not notice the icon, or ignore it.) The app could put up a dialog with a warning. It could do that the first time. But this is not a seamless user experience. Most app vendors aim for seamless.

Or the app could have different colors for different Unicode scripts. Or different colors for different language tags. They would do something special for the color-blind.

But, to me, all this just seems like we are foisting the problem on the end-user. Why oh why should they see any of this?

Or we could use heuristics, as Adam suggested, to try to detect phishyness. We could simply display the raw Punycode when the name is determined to be phishy. Maybe it's just me, but this is not very satisfying.

Can't we solve the problem upstream?

Erik