[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] upstream and downstream
Regarding the IDN spoofing issue, I'd like to offer the following
analysis, and hope that people will point out anything that's missing.
For IDN names, there are 2 different times:
1. registration time
2. lookup time
At each of these times, we start with textual symbols, give them codes
(Unicodes), perform nameprep and encode into punycode. After a lookup,
we might also display the name for the user. There are other details and
various cases to consider, but this is the big picture. These are the
basic stages that a name progresses through.
Now, the IDN spoofing issue is one of look-alikes. We can attack this
look-alike problem by focussing on an early stage, an intermediate stage
or a late stage. The terms often used for stages that progress from one
end to another are upstream and downstream.
If we were to try to solve the look-alike problem all the way upstream,
we would focus on Unicode. This character encoding gives distinct codes
to textual symbols that look similar or are even identical, in both form
and function. Michel even admitted that many of the Latin letters in the
Cyrillic block were added only to keep people from having to switch
scripts midstring. So these characters are identical in appearance and
use. Ignoring for the moment whether this is realistic, one way to solve
the look-alike problem is to have Unicode get rid of these look-alikes.
In Unicode parlance, it would then become a glyph encoding, rather than
a character encoding.
Going further downstream, we have nameprep. This is another possible
place to solve the look-alike problem. We could have a new version of
nameprep that maps homographs to base characters. I.e. the look-alikes
are folded. After this kind of nameprepping, we have characters that all
look different. Again, no statement about whether such a change to the
spec is realistic. This is just an analysis.
Downstream some more, we have the rules applied by the registry. The .jp
registry seems to be well aware of the look-alike problem, and they have
strict rules, with a well-defined table of characters to filter the name
through. If the name contains characters outside that table, that name
is rejected.
Going all the way downstream at lookup time, we have the string that is
displayed to the user. Various people have been suggesting various ways
to display the look-alikes so that the end-user will notice when
something phishy is going on.
This concludes my analysis. It is rough, of course, but this is intended
to be the big picture, and I hope it's complete. If this big picture is
not complete, please point out what's missing.
Having concluded my analysis, I will now discuss which solutions are
realistic. As far as I'm concerned, Unicode is an immovable object.
There is probably zero chance that anybody could talk them into getting
rid of look-alikes.
So what about nameprep? Should nameprep have chosen Unicode or some
other character encoding? This part of nameprep is probably
unchangeable. I doubt we could get consensus on switching nameprep from
Unicode to some other encoding.
But what about the other parts of nameprep? Would it be possible to add
another kind of mapping to it, namely from homographs to base
characters? This would be a rather large change, and might even require
a new prefix (i.e. something other than xn-- to allow migration). I
don't really know whether this kind of change is realistic.
Next we have the registries. Kudos to .jp and all the others that
carefully avoid look-alikes. But what about .com? As it turns out, in
their current set-up, when VeriSign does not have a table for the
requested language, the registry allows *any* Unicode to be registered.
This is particularly egregious. They ought to fix this part, at least.
Is VeriSign an immovable object? I don't know.
However, even if VeriSign were to have tables for all the languages, we
would still have a look-alike problem. Consider a Russian name that
consists entirely of Latin homographs. One solution is to make the
language tag available to the application, possibly via a new DNS
record. This is just an example, but the application could then put an
icon of the Russian flag next to the domain name or URI. An American
user *might* then realize that something phishy is going on. (Or they
might not notice the icon, or ignore it.) The app could put up a dialog
with a warning. It could do that the first time. But this is not a
seamless user experience. Most app vendors aim for seamless.
Or the app could have different colors for different Unicode scripts. Or
different colors for different language tags. They would do something
special for the color-blind.
But, to me, all this just seems like we are foisting the problem on the
end-user. Why oh why should they see any of this?
Or we could use heuristics, as Adam suggested, to try to detect
phishyness. We could simply display the raw Punycode when the name is
determined to be phishy. Maybe it's just me, but this is not very
satisfying.
Can't we solve the problem upstream?
Erik