SUMMARY: "kernel stack not valid halt" / CDROM device name corruption problem

From: Iain Barker (ibarker_at_aastra.com)
Date: 05/27/04

  • Next message: Mary Hunt: "Problems with ArcServe Client on Tru64"
    Date: Thu, 27 May 2004 16:33:44 -0400
    To: tru64-unix-managers@ornl.gov
    
    

    Helpful replies from Dr Tom and Michael Polnick.

    Seems that the boot linker was corrupting SRM state held in ram, which is persistent across inits.

    Dr.Tom suggested it may be need NHD7 on this system, and indeed the problem went away when I upgraded to 5.1b latest patchkit 3 + NHD7.

    -----Original Message-----

    I have a problem on a DS10L while attempting to boot the Tru64 5.1b install CDROM.
    If I boot from the hard disk, the CDROM drive works OK when mounted from Tru64.

    When I power on the system, the CD drive is reported correctly by 'show dev' at SRM:

            dqb0.0.1.13.0 DQB0 CD-224E 9.5B

    If I try to boot the GENERIC 5.1b boot-linked kernel from CDROM, I get an error "kernel stack not valid halt":

    >>>boot dqb0 -fl a -fi GENERIC
            (boot dqb0.0.1.13.0 -file GENERIC -flags a)
            block 0 of dqb0.0.1.13.0 is a valid boot block
            reading 15 blocks from dqb0.0.1.13.0
            bootstrap code read in
            base = 2c0000, image_start = 0, image_bytes = 1e00(7680)
            initializing HWRPB at 2000
            initializing page table at 1ffee000
            initializing machine state
            setting affinity to the primary CPU
            jumping to bootstrap code
                                                                                            
            UNIX boot - Wednesday October 16, 2002
                                                                                            
            Loading GENERIC ...
            Loading at fffffc0000310000
            Linking 205 objects: 205
            halted CPU 0
                                                                                            
            halt code = 2
            kernel stack not valid halt
            PC = 0

    Now if I do 'show dev' again, I get a weird corruption of the device name:

            dqb0.0.1.13.0 DQB0 CD/224G " " " " " " " " ;.7B" "

    If I do an 'init' after the problem has occurred, I get a subsequent self-test failure:

            Testing the Disks (read only)
                                                                                    
            *** Hard Error - Error #8 -
            Diagnostic Name ID Device Pass Test Hard/Soft 1-JAN-2000
            exer_kid 00000317 dqb0.0.1.13.0 0 0 1 0 12:00:01
            Buffer counts differ - buf1:0, buf2:512, location:2a00
                                                                                    
            *** End of Error ***
                                                                                    

    I thought maybe this was a problem with the CDROM drive, so I changed it for a drive on another system that works OK, but the problem still remains. I've also tried changing the drive ribbon cable but that didn't help either.

    The only way to get back to a 'valid' device name is to cycle the power supply.

    Any ideas? I'm wondering if this is a fault on the motherboard.


  • Next message: Mary Hunt: "Problems with ArcServe Client on Tru64"