Re: Possible instruction pipelining problem between HT's on the same die ? (fwd)

From: Keir Fraser (Keir.Fraser_at_cl.cam.ac.uk)
Date: 06/04/05

  • Next message: Hans Petter Selasky: "playing audio CD's on modern laptops"
    Date: Sat, 4 Jun 2005 09:17:57 +0100
    To: Kip Macy <kmacy@netapp.com>, dillon@apollo.backplane.com
    
    

    Hi,

    I did a fair amount of lock-free programming during my PhD and for Xen,
    so I may be able to shed some light on this situation. OTOH I may also
    be confused: the x86 memory model is poorly specified and the reference
    manuals are often badly written and misleading. I'll address the points
    and questions out of order....

    > But I'm beginning to think that it isn't working as advertised.
    > I've
    > read the manuals over and over again and they seem to only
    > guarentee
    > write ordering between physical cpus, not between logical HT cpus,
    > and
    > even then it appears that a cpu can do a speculative read and
    > thus get an old value for A even after getting a new value for B.

    The ordering guarantees between HTs are identical to those between
    physical cpus. I'm referring to Section 7.6.19 of IARM (Intel IA-32
    Reference Manual) Vol 3. It's slightly confusing that it says "can
    further be defined as 'write-ordered with store buffer forwarding'" but
    this forwarding only occurs separately *within* each logical cpu (the
    store buffer is statically partitioned between the two HTs), and this
    phrase is identical to the one describing physical cpu behaviour in
    Section 7.2.2 (ie. it is redundant to reiterate it in this later
    section).

    Reads can be speculatively executed out-of-order, but this property
    isn't unique to HTs. This race could in theory happen across physical
    cpus.

    > Now I was depending on the presumed write ordering, so if a foreign
    > cpu sees that B is updated it can assume that A has also been
    > updated.

    You *can* depend on write ordering. But this ordering is no help if
    CPU#1 has already executed, and is retiring, the read from A by the
    time it executes the read from B. It's CPU#1 that is screwing up, not
    CPU#0.

    > I looked at the various SFENCE/LFENCE/MFENCE instructions and they
    > do not seem to guarentee ordering for speculative accesses at all.
    > They all say that they do not protect against speculative reads.
    > Bus-locked instructions don't seem to avoid speculative reads
    > either.

    I think the reference manual is being almost wilfully misleading by
    referring to the speculative prefetch mechanism and its total
    independence from the fence instructions: "data could be speculatively
    loaded into the cache just before, during, or after the execution of an
    MFENCE instruction". It is important to realise that speculative
    execution of a memory-reading instruction is quite different from
    speculative prefetch into a cache. The latter should not matter to the
    programmer: the cache coherency protocol hides it. Consider the code
    example in the original email:

    > cpu #0 write A
    > write B
    >
    > (HT)cpu #1 read B
    > if (B)
    > read A <---- gets OLD data in A, not new data

    If CPU#1 prefetches A into its cache before it reads B, it may indeed
    see the old value of A; *but* when CPU#0 writes A it will invalidate
    that cacheline in all remote caches; *furthermore* CPU#0 cannot commit
    its update of B until after it has committed its update of A (x86
    guarantees write order). So, if CPU#1 reads the new value of B, then
    any stale value of A in its cache has been invalidated by that point.
    All you need to ensure is that CPU#1 hasn't speculatively executed the
    read from A: precisely the purpose of MFENCE and LFENCE.

    This is more complicated if both CPUs are sharing their memory
    hierarchy. However, either cache lines are tagged with an HT identifier
    and so the cache logically operates as two separate variable-sized
    caches (in which case normal cache coherency rules apply as described
    above), or there is true cacheline sharing (in which case there is no
    stale data to worry about, as CPU#0 will directly update the cache data
    that CPU#1 will read from). Either way, there's no weakening of the
    memory model.

      -- Keir

    _______________________________________________
    freebsd-hackers@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
    To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"


  • Next message: Hans Petter Selasky: "playing audio CD's on modern laptops"

    Relevant Pages

    • Re: Hyperthreading vs. SMP
      ... >> How is memory contention maintained ... sharing the same cache. ... > the superscaler processor has multiple instructions in flight already ... > processor may also have speculative execution when conditional ...
      (linux.redhat)
    • Re: Add My Idea to the C++ Compiler
      ... Inlining functions removes the execution overhead (what you ... They have a coponent that fetches instructions until ... The idea is that the cache is faster ...
      (comp.lang.cpp)
    • Re: New libc malloc patch
      ... > someone steps up to change the way mmap and brk interact within the ... > be allocated with brk. ... we already have systems running with enough CPUs that this is an issue. ... > address space and the cache: the mapping of logical pages (what you ...
      (freebsd-current)
    • Re: Purchasing the correct hardware: dual-core intel? Big cache?
      ... there's not enough IO to stress the disk subsystem. ... with more CPUs by getting true dual-core pentiums. ... The question this all pivots on is will 8M of cache be a significant ... We're looking hard at getting either Intel dual-core procs, ...
      (freebsd-questions)
    • Re: Atmel releasing FLASH AVR32 ?
      ... a cache doesn't impact other accesses to non-cacheable ... Branch prediction cost is chasing an ever eluding target. ... There are few wasted cycles on modern embedded CPUs. ... Hardware multithreading doesn't give much performance on a high ...
      (comp.arch.embedded)