[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Unicode security issues (fwd)

To: idn@ops.ietf.org
Subject: [idn] Unicode security issues (fwd)
From: Bill Manning <bmanning@ISI.EDU>
Date: Fri, 25 Aug 2000 13:10:21 -0700 (PDT)
Delivery-date: Fri, 25 Aug 2000 13:10:47 -0700
Envelope-to: idn-data@psg.com
 A break from our discussions of registrars experimentation with 
 one form or another of multilingual support.

 It seems that use of unicode itself will inject new integrity/security
 concerns if/as it is deployed in the DNS.

 From the Intrusion Detection wg...  

% Borrowed (with permission) from Bruce Schneier's Crypto-Gram newsletter,
% and relevant to our proposed use of UTF-8.
% 
% Stuart.
% 
% From: Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>
% Subject: Re: Security Risks of Unicode
% 
%  > I don't know if anyone has considered the security implications of this.
% [...]
%  > - Somebody uses UTF-8 or UTF-16 to encode a conventional character in a
%  > novel way to bypass validation checks?
% 
% Thanks for reminding your readers about the security issues surrounding the 
% UTF-8 encoding of Unicode and ISO 10646 (UCS).
% 
% For some time, this and related issues have been of considerable concern to 
% us folks on the linux-utf8 at nl.linux.org mailing list, who try to guide 
% and accelerate the eventually inevitable migration of the Unix world from 
% ASCII and ISO 8859 to UTF-8 (which the Plan9 operating system has 
% demonstrated it successfully almost a decade ago). New UTF-8 decoders 
% deployed in for instance GNU glibc 2.2, XFree86 4.0 xterm, and various 
% other standard tools have been carefully designed to reject so-called 
% overlong UTF-8 sequences as malformed sequences, in order prevent that 
% these UTF-8 decoders can be abused by attackers to by-pass critical ASCII 
% substring tests that are applied earlier in the processing pipeline.
% 
% It is still very unfortunate that even the latest Unicode 3.0 standard 
% (ISBN 0-201-61633-5) contains at the end of section 3.8 on page 47 the 
% following paragraph: "When converting from UTF-8 to a Unicode scalar value, 
% implementations do not need to check that the shortest encoding is being 
% used. This simplifies the conversion algorithm."
% 
% This paragraph encourages the fielding of sloppy and dangerous UTF-8 
% decoders that will for example convert all of the following five UTF-8 
% sequences into a U+000A line-feed control character:
% 
%    0xc0 0x8A
%    0xe0 0x80 0x8A
%    0xf0 0x80 0x80 0x8A
%    0xf8 0x80 0x80 0x80 0x8A
%    0xfc 0x80 0x80 0x80 0x80 0x8A
% 
% A "safe UTF-8 decoder" should reject them just like malformed sequences for 
% two reasons: (1) It helps to debug applications if overlong sequences are 
% not treated as valid representations of characters, because this helps to 
% spot problems more quickly. (2) Overlong sequences provide alternative 
% representations of characters, that could maliciously be used to bypass 
% prior ASCII filters. For instance, a 2-byte encoded line feed (LF) would 
% not be caught by a line counter that counts only 0x0A bytes, but it would 
% still be processed as a line feed by an unsafe UTF-8 decoder later in the 
% pipeline.
% 
% UTF-8 is known to be ASCII compatible, because every existing ASCII file is 
% already a correct UTF-8 file and non-ASCII characters do not introduce 
% additional occurrences of ASCII bytes. But from a security point of view, 
% ASCII compatibility of UTF-8 sequences must also mean that ASCII characters 
% are *only* allowed to be represented by ASCII bytes in the range 0x00-0x7F 
% and not by any other byte combination. To ensure this often neglected 
% aspect of ASCII compatibility, use only "safe UTF-8 decoders" that reject 
% overlong UTF-8 sequences for which a shorter encoding exists, for example 
% by substituting it with the U+FFFD replacement character.
% 
% It is not true that the check for overlong UTF-8 sequences would add any 
% significant speed penalty or complexity to the UTF-8 decoder, as for 
% example my implementation of the decoder found in the XFree86 4.0 xterm 
% version illustrates. The key to understanding how to implement a safe UTF-8 
% decoder both simply and efficiently lies in realizing that an UTF-8 
% sequences is overlong if and only if it contains one of the following one 
% or two byte long bit patterns:
% 
%    1100000x (10xxxxxx)
%    11100000 100xxxxx (10xxxxxx)
%    11110000 1000xxxx (10xxxxxx 10xxxxxx)
%    11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
%    11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)
% 
% A UTF-8 decoder robustness test file that allows developers to check 
% quickly an UTF-8 decoder for its safety is available on
% 
%    <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>
% 
% For instance, major Web browsers still fail the test in section 4.1.1.
% 
% More information on UTF-8 under Unix are available on
% 
%    <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
% 
% 
% From: Curt Sampson <cjs@cynic.net>
% Subject: Re: Security Risks of Unicode
% 
% I have to say I'm rather appalled by your "Security Risks of Unicode" 
% article.  You have identified a type of security vulnerability in some 
% systems, and pointed out that Unicode may increase the incidence of this 
% type of vulnerability, but completely missed the source of the
% vulnerability.
% 
% As we've seen from your examples of non-Unicode systems that have 
% experienced security failures, these problems do not stem from using any 
% particular character set or character set interpretation.  They stem from 
% doing what I like to call "validity guessing," rather than true validity 
% checking.
% 
% The key factor in all of these cases is that we have two separate programs 
% (the validity checker and the application itself) using two separate 
% algorithms to interpret data.  This is what introduces the potential for a 
% security breach: if ever the two programs do not interpret a data stream in 
% exactly the same way (and this can easily happen if the two programs are 
% not maintained by the same person or group), it may become possible to 
% convince the application to do something the validator does not want to
% allow.
% 
% When it comes to security, guessing just isn't good enough.  This is why, 
% when we have parameters from external sources, we use the exec() system 
% call to run programs under Unix rather than the system() library 
% function.  We don't pass random data to the shell for interpretation 
% because we can never be sure how a particular implementation of a 
% particular shell on a particular system will interpret it.  (We can't even 
% be sure of what shell we're using -- /bin/sh may be any of a number of 
% different programs.)
% 
% As long as we shift the blame for badly designed security systems to 
% external standards that are not the source of the problem, we will have 
% insecure systems.  Security is something that needs to be built in to 
% systems from the beginning, not tacked on with separate programs at the
% end.
% 
% 
% From: Henry Spencer <henry@spsystems.net>
% Subject: Re: Security Risks of Unicode
% 
% You have a point about potential input-validation attacks in Unicode, given 
% the much greater complexity of the character set... but I think you have 
% missed a couple of more important points.
% 
% Trying to analyze the input string for metacharacters, odd delimiters, etc. 
% is basically a mistake.  I speak as someone who's written code to do this, 
% by the way -- it always smelled like a kludge to me, and now I understand
% why.
% 
% First, prepending an input validator to a complex interpreter is a 
% fundamentally insecure approach.  Unless you are prepared to impose truly 
% severe restrictions on which features of the interpreter are available -- 
% in which case, why bother with the interpreter at all? -- the validator 
% becomes an attempt to reinvent the interpreter's parser and some of its 
% semantic analysis.  This is an inherently error-prone approach, as shown by 
% various successful input-validation attacks.  The validator is a complex 
% piece of software which must achieve and maintain an exact relationship 
% with the interpreter, which is all the more difficult if the interpreter is 
% ill-documented (as most complex interpreters are) and constantly changing 
% (ditto).
% 
% The right way -- the *only* right way -- to deal with this problem is to 
% insist that such interpreters include a show-only mode ("process this input 
% and tell me what it would make you do BUT DON'T DO IT").  This can be 
% awkward for interpreters with complex programmability and interactions with 
% their environment; it may amount to actually running the interpreter, but 
% in a controlled and monitored environment with dummy resources.  There can 
% still be bugs -- unintended differences between the show-only mode and the 
% real mode -- but if the interpreter is well organized, almost all of the 
% show-only work is being done by the real code rather than a cheap 
% independently-maintained fake, and there is at least a fighting chance that 
% the behaviors will match.
% 
% (A do-only-safe-things mode is also of interest, but not as satisfactory. 
% Definitions of safety may not match, and interpreter bugs are arguably more 
% likely to affect the outcome.)
% 
% Second, less confidently, I have to wonder whether elaborate parsing isn't 
% a mistake anyway.  When the context is program talking to program, it would 
% be better to define the simplest format possible, so that parsing becomes 
% trivial and there is no room for misunderstandings.  This need not imply 
% either binary data formats or simple semantics; for example, one can send a 
% complex tree structure in prefix or postfix notation, one node per (text) 
% line.  Of course, all too often the option isn't available because the 
% format is predefined by a 700-page standard, but the possibility is worth 
% bearing in mind.
% 
% 
% From: Michael Smith <smithmb@usa.net>
% Subject: Re: Security Risks of Unicode
% 
% Speak of the devil...
% 
% Apparently, the dangers of Unicode you discussed in the latest Crypto-Gram 
% are not far off.  It's already going into use for domain names: 
% "Asian-language domain names now available," at 
% <http://www.cnn.com/2000/TECH/computing/07/17/asian.domains.idg/index.html>.
% 
% 
% -- 
% Stuart Staniford  ---  President  ---  Silicon Defense
%                    stuart@silicondefense.com
% (707) 445-4355                     (707) 445-4222 (FAX)
% 
% 


-- 
--bill
Prev by Date: RE: [idn] NSI Multilingual Testbed Information (fwd)
Next by Date: Re: [idn] Unicode security issues (fwd)
Prev by thread: [idn] [icann-announce] ICANN Comment on NSI Registry Multilingual DomainName Testbed (fwd)
Next by thread: Re: [idn] Unicode security issues (fwd)
Index(es):
- Date
- Thread