[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: psamp vocabulary


Rae McLellan wrote:
> >> Invariance frees psamp from specifying the order and allows different
> >> vendors to implement the selectors in different way w/o affecting the
> >> results.
> >
> > that's exactly the point: it seems that "the results" for you means
> > "the resulting selected sample".  With my comment above, I was saying
> > that "the results" should be on the contrary
> > "1) the size of the selected sample" and
> >  2) the results you derive from the analysis of the selected sample".
> I wish the results as you define them were all that mattered.  But in
> a world of multi-vendor interoperability.  It so much easier to verify
> that the report streams coming from two different vendor's boxes are
> identical between different vender's boxes than the final analysis
> being similar.  Indeed, vendor differentiation may well be in the
> report analysis.
> > Note that it's not only a terminology issue. If we require that varying
> > the selector ordering (which is something that we may desire to ease
> > implementations) we get the same selected sample, then we have to
> > exclude the whole "third group" of samplers you outlined in your
> > previous e-mail, i.e. random samplers and samplers based on packet
> > position. To this last category belongs e.g. the simple 1 out of N
> > sampler implemented by decrementing a counter, which is the simplest
> > we can think of. Do we want to exclude it?
> I realize it's a radical approach.  But, *if* the functionality (by your
> definition of results) of the "third group" of random selectors can be
> provided by deterministic hash functions on the packet header...
> then yes, I'm suggesting the psamp standard exclude the "third group"
> of random selectors.  I'm transferring effort in the standardization
> process from specifying the order of selectors and worrying about
> their grouping syntax/semantics, to a few short paragraphs explaining
> how the deterministic hash functions can provide similar results.

having only deterministic selectors is an intriguing idea, but I have 
two misgivings:

1) we need to allow the simplest implementations in order for PSAMP
to be ubiquitous. Decrementing a counter is very simple: if we exclude
it then some devices might find it difficult to do PSAMP sampling.

2) if all selection operations are deterministic on packet content, it
would be easier to construct packets to evade selection (although having
a strong hash function with an obscure selection criterion makes this
more difficult). Or even without malice, with a weak hash function you 
might have an unlucky traffic mix where you entirely miss a large 
bunch of traffic. Having the option of random selection guards against 

I have some comments on the hash functions that you mentioned in a

>    - IP ID & <mask> == <value>
>    - IP Checksum & <mask> == <value>
>    - Checksum(IP header w/o TTL) & <mask> == <value>
>    (note that these last 3 can be used to generate an almost uniform
>     sample of the IP packets, yet they're still based on IP header)

Have you done any experiments on the statistical quality of these as
hash functions for packet selection? As hash functions go these are very
weak. Having good statistical properties of selection would rely on
a tame distribution of the field contents of the packets. (This can't be
relied upon: we looked at traces, and there are gotchas there for the ID 
field in particular due, it seems, to bad implementations). I'm
that they would be easy to evade.

And even with a tame distribution of packet fields, having a uniform 
selection distribution is not the only desirable property. We also
want small correlations between selection decisions of successive 
packets, including selection of packets from the same IP level flow
packets with same IP src/dst address). The input of these hashes doesn't 
change much from packet to packet of the flow, and the hash function is 
weak, so there will be a lot of correlation.

A strong hash function should have the property, roughly speaking, that 
flipping a bit of the input gives a big change in the hash function.
This gives the statistical properties of selection some robustness
against correlations in the packet contents. The IP checksum does not
have this property. IP ID increments, so there's not much variation in 


> Is this possible?  I dunno.  I was just pointing out that this might
> be a path for psamp to persue.  Is there some type of sampling
> results (your definition) that this approach is precluding?
> or perhaps those few short paragraphs I mentioned aren't possible?
>                                 Rae McLellan
> --
> to unsubscribe send a message to psamp-request@ops.ietf.org with
> the word 'unsubscribe' in a single line as the message text body.
> archive: <http://ops.ietf.org/lists/psamp/>

to unsubscribe send a message to psamp-request@ops.ietf.org with
the word 'unsubscribe' in a single line as the message text body.
archive: <http://ops.ietf.org/lists/psamp/>