Re: bug in calcru() in kernel: integer overflow computing user time

From: Doug Ambrisko (ambrisko_at_ambrisko.com)
Date: 01/26/05

  • Next message: ctodd_at_chrismiller.com: "RE: Problem with FastTrak S150 SX4-M"
    To: Chris Landauer <cal@rushg.aero.org>
    Date: Wed, 26 Jan 2005 11:11:09 -0800 (PST)
    
    

    Chris Landauer writes:
    |
    | hihi, all -
    |
    | well, i have an "almost" fix for the problem - read on, ...
    | (this is for discussin before send-pr submission)
    |
    | description
    |
    | in FreeBSD 5.3-RELEASE (and 5.2.1R, and 5.1R),
    | in file /usr/src/sys/kern/kern_resource.c,
    | lines 657-750 define a function calcru(),
    | and there is a bug in that function -
    |
    | line 713 in that file is
    |
    | uu = (tu * ut) / tt,
    |
    | which is trying to compute a proportion (since ut<=tt) - the problem
    | is that for my application, the values of tt and ut are very large and
    | the value of the intermediate product overflows, leading to incorrect
    | values (i reported the symptoms to freebsd-hackers and a very helpful
    | description and localization of the problem to calcru() was provided
    | by peter jeremy, who also wrote that it is a very old, but only partly
    | known, problem)
    |
    | repetition
    |
    | use time in csh or /usr/bin/time or getrusage() to time any program
    | that takes more than a week to run
    |
    | fix (almost)
    |
    | it turns out that this problem can be (partly) fixed by replacing that
    | one line above with the following lines:
    |
    | /* we know 0 <= ut <= tt and 1 <= tt */
    | if (tu >= tt)
    | {
    | **whatever type they need to be** q, r;
    | q = tu / tt;
    | r = tu % tt;
    | uu = (q * ut) + (r * ut) / tt;
    | }
    | else uu = (tu * ut) / tt
    |
    | this change does not solve all the arithmetic problems (especially
    | when ut is very large), and a similar change is needed for system
    | time, computing su in line 714 of the file, but it suffices for me -
    |
    | analysis (yup, proof that it should work 8-)) -
    |
    | i expect that all of these counters are increasing throughout the life
    | of the process -
    | tu is total time in microseconds
    | ut is user 128hz ticks
    | tt is total 128hz ticks
    | i assume therefore that
    | tu ~ tt * 10^6/128
    | strictly because of the clock rates
    | (machines with other kinds of clock rate ratios will need a different
    | analysis, but the same idea should work there, too) -
    |
    | in my case ut ~ tt, so we see overflow in the old computation when
    | tu * ut >= 2^64
    | tt^2 * 10^6/128 >= 2^64
    | tt * 10^3/8*sqrt(2) >= 2^32 ~ 4 * 10^9
    | tt >= 32*sqrt(2)*10^6,
    | which is about
    | sqrt(2)*10^6 / 4 ~ 3.54*10^5 seconds,
    | or
    | ~ 98 hours
    | (this is the phenomenon i observed)
    |
    | in the new computation offered above, since we know that
    | ut <= tt,
    | we also know that
    | uu <= tu,
    | and
    | (q * ut) <= uu,
    | so there can be no overflow in the (q * ut) term -
    | therefore, this changed code will overflow only when r is large
    | r ~ tt,
    | and then only when
    | r * ut >= 2^64
    | tt^2 >= 2^64
    | tt >= 2^32,
    | which is about
    | ~ 2^25 seconds
    | ~ 9300 hours
    | ~ 388 days
    | (i can live with that for now)
    |
    | for everyone else, it adds the one test in the if statement to every
    | call to calcru() (or two, if system time is similarly fixed) -
    | <philosophy> instrumentation is costly, and correct instrumentation is
    | even more costly, but it's worth it, every time, to get the right
    | answer </philosophy>
    |
    | i'm about to try it out this week - i will report the results when i
    | get some data (which will be a few weeks)
    |
    | i'm thinking about how to solve the problem correctly for really
    | long-running programs, but it's all pretty special case ugly so far

    This is my fix:

    Index: kern_resource.c
    ===================================================================
    RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/kern_resource.c,v
    retrieving revision 1.55.2.5
    diff -u -p -r1.55.2.5 kern_resource.c
    --- kern_resource.c 3 Nov 2001 01:41:08 -0000 1.55.2.5
    +++ kern_resource.c 26 Jan 2005 19:03:27 -0000
    @@ -552,10 +552,10 @@ calcru(p, up, sp, ip)
                     tu = ptu;
             }
     
    - /* Subdivide tu. */
    - uu = (tu * ut) / tt;
    - su = (tu * st) / tt;
    - iu = tu - uu - su;
    + /* Subdivide tu. try to becareful of overflow */
    + su = tu * (st * 1024 / tt) / 1024;
    + iu = tu * (it * 1024 / tt) / 1024;
    + uu = tu - (su + iu);
     
             /* Enforce monotonicity. */
             if (uu < p->p_uu || su < p->p_su || iu < p->p_iu) {

    If people like it I can commit it. It works for us. We had problems with
    this as well. It's pretty simple fix. I used 1k since usage of this
    tends to be % so rounding should effect that much.

    Doug A.
    _______________________________________________
    freebsd-hackers@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
    To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"


  • Next message: ctodd_at_chrismiller.com: "RE: Problem with FastTrak S150 SX4-M"