[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Comments on reliable accounting draft (was RE: Strawman RADIUSEXTWG charter - Take Two)



Thank you for your review!

> In the specific case of server overload due to
> network-wide reboot (which is the best argument I've heard for tweaking
> the RADIUS retransmission method), the primary problem isn't that
> packets are being dropped by intermediate nodes due to network
> congestion (although in severe cases congestion might be a seen).

It is possible, with a large number of NASes, to encounter what is
sometimes called "convergent congestion".  That is, the RADIUS traffic
sent by each NAS is small, but when it is aggregated near the RADIUS
server, it is large enough to congest one or more links.  This would be
most likely to happen in an ISP network.  In corporate network situations
(where I've seen the problem occur), the backbone is typically 100 Mbps
Ethernet or greater, so congestion of the backbone is relatively unlikely
unless there are a large number of NASes.

> Instead, the problem is that too many RADIUS messages are getting
> through the network simultaneously; in this case, network congestion
> might be the RADIUS server's friend, giving it a chance to catch up!

Or the messages might be dropped in the RADIUS server itself.

In general, the principle of "conservation of packets" (where additional
packets are not injected into the system until an existing packet is
determined to have left it) is designed to ensure that the response of the
system to packet loss is a gain of less than 1 -- that is, that dropped
packets do not result in even more packets being sent, but rather, in the
rate decreasing to the point where the RADIUS server (and network) can
handle the load.

Unless the system is designed to be conservative in this way, packet loss
won't necessary help the RADIUS server catch up -- because it could result
in even more packets being sent, not less.

That's the function of Additive Increase, Multiplicate Decrease (AIM) rate
control.

> Rather than an exponential back-off, the introduction of transmission
> jitter might be a more effective strategy.

Jittering is indeed important in the power loss case.  Spreading requests
out over a minute (as opposed to concentrating requests within the first
10 seconds) would decrease the initial rate by a factor of 6.

> For "how many times to
> retry", the answer is also unspecified.  The rest of the questions are
> also answered in simplistic and/or unhelpful ways.  For example,
> "failback" occurs after the expiration of an apparently static timer.

This is a violation of "Conservation of Packets", because it could result
in additional packets being sent into the network when there is no
evidence that the original ones had left it.

> Since the failback operation is not based upon any indication of the
> failed server's health, this could very easily result in the abandonment
> of a functional server in favor of a server that is down.  It seems like
> some type of metric of server responsiveness could be developed instead
> so that the most responsive server would always be used; a simpler
> method yet might be a timer-based RADIUS "ping" to discover whether a
> given server is alive.

One problem with a "ping" based on existing RADIUS functionality is that
it too could be proxied.  So if the goal is to get an idea of the health
of the next hop, then this doesn't help.  Heartbeat appears to me to
be one of those things that is intrinsically done better in Diameter --
with its TCP/SCTP transport, non-proxiable heartbeats, etc.

> The draft also makes a couple of assumptions that are novel, at least to
> me.  Is it true that RADIUS clients regularly choose proxies based upon
> NAI or some other piece of authentication data?

I've never seen this, but perhaps someone else has.

> The last time I
> checked, routing of RADIUS packets was the job of proxies, not clients,
> but I won't claim to be familiar with the state of the art.  In
> addition, I was not aware that RADIUS proxies regularly implemented the
> timeout and retry algorithm.  If so, this seems like it would
> _increase_, rather than decrease network traffic.

Timeout and retry is pretty dangerous on a proxy because NASes already do
retransmission.  Thus typically, RADIUS transport dynamics is end-to-end
between the NAS and ultimate RADIUS server -- which makes it very
difficult for the NAS to know why a server isn't responding.  It's much
the same with a conversation between two Internet hosts -- did a router
somewhere in the network die, or was the server not up?

Like routers, RADIUs proxies only make a forwarding decision, they do not
do timeout and retry, because doing so not only change the end-to-end
dynamics, but would add gain to the system -- and would violate conservation
of packets.

--
to unsubscribe send a message to radiusext-request@ops.ietf.org with
the word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/radiusext/>