Re: Fwd: 5-STABLE kernel build with icc broken

From: Peter Jeremy (PeterJeremy_at_optushome.com.au)
Date: 03/31/05

  • Next message: Bernd Walter: "Re: So, who makes this one run FreeBSD? ;-)"
    Date: Thu, 31 Mar 2005 20:46:35 +1000
    To: Bruce Evans <bde@zeta.org.au>
    
    

    On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote:
    >>>On the i386 (and probably most other CPUs), you can place the FPU into
    >>>am "unavailable" state. This means that any attempt to use it will
    >>>trigger a trap. The kernel will then restore FPU state and return.
    >>>On a normal system call, if the FPU hasn't been used, the kernel will
    >>>see that it's still in an "unavailable" state and can avoid saving the
    >>>state. (On an i386, "unavailable" state is achieved by either setting
    >>>CR0_TS or CR0_EM). This means you avoid having to always restore FPU
    >>>state at the expense of an additional trap if the process actually
    >>>uses the FPU.
    >
    >I remember that you (Peter) did extensive benchmarks of this.

    That was a long time ago and I don't recall them being that extensive.
    I suspect the results are in my archives at work - I can't quickly
    find them here. From memory the tests were on 2.2 and just counted
    the number of context switches, FP saves and restores.

    > I still
    >think fully lazy switching (c2) is the best general method.

    I think it depends on the FP workload. It's a definite win if there
    is exactly one FP thread - in this case the FPU state never needs to
    be saved (and you could even optimise away the DNA trap by clearing
    the TS and EM bits if the switched-to curthread is fputhread).

    The worst case is two (or more) FP-intensive threads - in this case,
    lazy switching is of no benefit. The DNA trap overheads mean that
    the performance is worse than just saving/restoring the FP state
    during a context switch.

    My guess is that the current generation workstation is closer to the
    second case - current generation graphical bloatware uses a lot of
    FP for rendering, not to mention that the idle task has a reasonable
    chance of being an FP-intensive distributed computing task (setiathome
    or similar). It's probably time to do some more measuring (I'm not
    offering just now, I have lots of other things on my TODO list).

    SMP adds a whole new can of worms. (I originally suspected that lazy
    switching had been lost during the SMP transition). Given CPU (FPU)
    affinity, you can just add "per CPU" to the above but I'm not sure
    that changes my conclusion.

    > Maybe FP state should be loaded in advance based on FPU affinity.

    Pre-loading the FPU state is an advantage for FP-intensive threads -
    if the thread will definitely use the FPU before the next context
    switch, you save the cost of a DNA trap by pre-loading the FPU state.

    > It might be
    >good for CPU affinity to depend on FPU use (prfer not to switch
    >threads away from a CPU if they own that CPU via its FPU).

    FPU affinity is only an advantage if full lazy switching is implemented.
    (And I thought we didn't even have CPU affinity working well). The
    first step is probably gathering some data on whether lazy switching
    is any benefit.

    >BTW, David and I recently found a bug in the context switching in the
    >fxsr case, at least on Athlon-XP's and AMD64's.

    I gather this is not noticable unless the application is doing its
    own FPU save/restore. Is there a solution or work-around?

    -- 
    Peter Jeremy
    _______________________________________________
    freebsd-hackers@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
    To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
    

  • Next message: Bernd Walter: "Re: So, who makes this one run FreeBSD? ;-)"

    Relevant Pages

    • Re: Adds and computational complexity
      ... could be done in some 200-400 clock cycles with the CPU. ... a FPU this could be reduced to tens of cycles. ... on the CPU it took a few cyles, in the FPU it took one. ... inaccurate way of evaluating the complexity. ...
      (comp.soft-sys.matlab)
    • Other Results
      ... ArcCos Extended FPU 1894 ... ArcCos Single FPU 1941 ... CharPos CPU 366 ... Floor Extented FPU 1146 ...
      (borland.public.delphi.language.basm)
    • Re: OT: IA64s speed beaten by orders of magnitude...
      ... What I find most significant is that they are able to run that chip at ... transistor *can* switch at 350Ghz doesn't mean that any CPU built from ... The logic gate switching time would be relative to ... the number of components in series, but each like transistor would have ...
      (comp.os.vms)
    • Re: Catalyst 3750G drops packets with IPv4 options
      ... IPv4 packets per second when these packets contained IP options. ... A few entries show less than 0.15% CPU usage. ... can fall back to CPU routing under certain circumstances. ... 87% total CPU of which 29% is fast switching packets. ...
      (comp.dcom.sys.cisco)
    • Re: Critical Sections for userland.
      ... Setting the scheduling class to real-time and using SCHED_FIFO ... you relinquish the CPU. ... Are you trying to prevent switching out of the thread ... almost like a "soft spl" for userland. ...
      (freebsd-hackers)