Re: Chat server : Threading in select call
- From: David Schwartz <davids@xxxxxxxxxxxxx>
- Date: Tue, 12 Feb 2008 18:55:42 -0800 (PST)
On Feb 12, 5:16 pm, William Ahern <will...@xxxxxxxxxxxxxxxxxxxxxxxxx>
wrote:
David Schwartz <dav...@xxxxxxxxxxxxx> wrote:
No, that's not true. The page fault will cause another thread to run
on the same CPU. In fact, you will have the problem (at least in
principle) unless you have *more* threads than CPUs.
So when does the kernel get around to handling the fault?
Typically, when the disk interrupt arrives.
Say, for instance,
if the process hits the end of its stack, triggering a page fault, which the
kernel then has to attend to. That's _additional_ CPU instructions above and
beyond those which were executing when the process was in a presumptively
steady state. (You previously you mentioned faulting in error handling code,
IIRC.)
The cost of the CPU handling the page fault is negligible compared to
the time actual I/O takes. Even so, that's another argument for multi-
threading because the CPU cost of the page fault will be borne on one
CPU while another can continue doing useful work for your process.
I'm failing to see how one sort of process could so easily fall behind in
this case, while in another sort of process (or set of processes) the
introduction of additional CPU instructions or I/O operations magically fits
within the existing resource caps (assuming you're operating near peak
throughput).
In one case, *nothing* happens until the physical I/O can take place.
In the other case, the server continues to make forward progress on
everything not dependent on that I/O. That means in one case,
typically one client is stalled while in the other, typically, 10,000
are stalled.
I can only presume you're alluding to the way modern kernels map in shared
libraries, and otherwise manage a shared disk buffer cache and virtual
memory. But if we're talking about long lived daemon processes servicing
tremendous amounts of network traffic, and operating at peak instructions
and I/O throughput, in many cases its not particularly hard to arrange it
(if necessary at all) so that parts of the process aren't evicted.
Right, unfortunately, eviction is not the only problem.
And unless you're on a real-time kernel, and have at your disposal all
manner of other neat guarantees, if your peering architecture is so
susceptiple to latency that quality perceptibly suffers there are other
issues.
I don't think anyone has found a perfect general solution to the
problem of latency spirals. They are one of the most difficult and
important performance problems. They are, however, orders of magnitude
worse on single-threaded servers compared to multi-threaded.
Likewise, I can't imagine, for instance, a user would or should perceive
anything odd about receiving an IM message out of order by a few
milliseconds. (What does that even mean?)
That's not the problem. The problem is that a burst of latency results
in the server taking longer to process requests (usually due to
caching effects). As a result, the server can fall further behind. The
typical solution is to limit the server to load many times less than
the actual peak load it can service.
For example, in the typical case, the server handles each request as
it comes in. It receives a request from the network card, then the
program receives the request from the kernel, then the program parses
the request and then it sends a response. The response is put in the
kernel's queue, and in most cases, sent over the network immediately
(because most clients have an empty send queue). In this immediate
case, the data tends to stay in caches, and so processing over the
bulk of the data is very fast.
If there is a burst of latency, data will tend to be out of the cache
(from the checksum calculation) when you call 'read'. And because you
are handling a lot of outbound connections in a burst, it will be
impossible to transfer the outbound data to the network card (because
the card's send queue will be full) until later. As a result, each
operation is more expensive. Data has to be fetched from main memory
more times. More data has to be sent in a slow path (when the queue
empties) than in a fast path (inline with the main program).
Typically, these slow paths are about 3-5 times slower than the fast
paths. As a result, the server has to be limited to about 1/5th the
load it could otherwise carry. Failure to limit the load in this way
leads to lag bubbles and lag cascades.
This fact is somewhat unique to chat servers, though it can happen in
any server where there is tight integration between connections (one
connection frequently sends to another) and especially serious when a
single command from a single connection can often require "spinning
up" a large number of previously idle connections. Another source of
problems is when a network link close to the server (netwise) fails
and then comes back. The large number of reconnecting clients can
start a latency spiral.
But really, that's just one major reason. The other major reason is
that it's just ridiculously difficult to have to make sure that every
single line of code can't ever lag no matter what on pain of death.
80% of the lines of code aren't critical and a design choice that
makes them critical is just bone-headed.
DS
.
- Follow-Ups:
- Re: Chat server : Threading in select call
- From: James Antill
- Re: Chat server : Threading in select call
- References:
- Chat server : Threading in select call
- From: rahul
- Re: Chat server : Threading in select call
- From: David Schwartz
- Re: Chat server : Threading in select call
- From: James Antill
- Re: Chat server : Threading in select call
- From: David Schwartz
- Re: Chat server : Threading in select call
- From: James Antill
- Re: Chat server : Threading in select call
- From: David Schwartz
- Re: Chat server : Threading in select call
- From: James Antill
- Re: Chat server : Threading in select call
- From: David Schwartz
- Re: Chat server : Threading in select call
- From: William Ahern
- Re: Chat server : Threading in select call
- From: David Schwartz
- Re: Chat server : Threading in select call
- From: William Ahern
- Chat server : Threading in select call
- Prev by Date: Re: Chat server : Threading in select call
- Next by Date: Re: Voice chat
- Previous by thread: Re: Chat server : Threading in select call
- Next by thread: Re: Chat server : Threading in select call
- Index(es):
Relevant Pages
|