[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Fwd: Re: [idn] Requirements I-D]



I want to point out that the algorithm to case-fold, while not free, is pretty fast. The two issues are:

- using a large, fairly sparse, mapping table
- sometimes mapping one character to multiples*

The common structure for the first problem is a trie. The following is a 2-level, which compresses reasonably but retains a fast lookup. For the second problem, the easiest is to reserve a range of values that cannot be the result of a case-fold, and use them as indexes into an exception table of null-terminated strings.

[The code sample is off the top of my head, so I haven't checked for compilation, and it could probably be optimized further.]

while (true) {
  uint16 ch = *p++;
  if (ch == 0) break;
  uint16 entry = data[index[ch>>8] + (ch & 0xFF)];
  if (entry == 0) {
    *q++ = ch;
  } else if (entry < EXCEPTION_LOW || entry > EXCEPTION_HIGH) {
    *q++ = entry;
  } else {
    r = exceptionData[entry-EXCEPTION_LOW];
    while (ch = *r++) *q++ = ch;
  }
}

Handling codepoints above FFFF (there are none in Unicode 3.0) takes a fairly simple modification of the above. The number of multiple mappings is fairly small: see http://www.unicode.org/unicode/reports/tr21/charts/

Mark

James Seng wrote:

> RJ Atkinson wrote:
> > Its already out of control that folks feel that they have to register
> > foo.ORG, foo.COM, and foo.NET just so they can be found by normal users.
>
> [Assuming foo contains a non-ASCII character somewhere]
>
> Well, now the difference is you have to register FOO.COM, Foo.COM, FoO.COM etc
> . And of course, to manage this in the zone and config file somehow.
>
> For example,
>
> zone "foo.com" in {
>     type primary;
>     file "db.foo.com";
>     variant "FOO.com" "FOo.com" "FoO.com" "fOO.com" "fOo.com" "foO.com";
> }
>
> [named will then fold all variant into foo.com automatically]
>
> And on the parent zone, in this case, .com zonefile, you do
>
> foo    NS     some.name.server.
> FOO    DNAME  foo
> FOo    DNAME  foo
> FoO    DNAME  foo
> ...
>
> Is it nice? No. In fact, hell for operations. (I want to see the face of the
> people running large zone :-)
>
> But heck, it works.
>
> It can also be optimized if they can reduce it to some algorithm, which make
> life less painful for memory usage, by taking up more cpu cycle,
>
> For example,
>
> foo    NS    some.name.server.
>        FOLD  { s/I/*/g; tr/[A-Z]/[a-z]/; # * = dotless i }
> ; In this case, FOLD is not a RR, but some algo use for matching
>
> Therefore, essentially, we can have "per domain" folding algorthim.
>
> Or perhaps more simple,
>
> foo    NS    some.name.server.
>        FOLD  tolower+utr21.
>
> bar    NS    some.other.server.
>        FOLD  tolower+simplifiedtraditional.
>
> ie, with pre-defined algo.
>
> How fun :-)
>
> -James "playing the devil advocate" Seng