Summary: File Corruption casuing many problems

From: Ron Bramblett (bramblet_at_fuller.com)
Date: 11/10/03

  • Next message: David.Knight_at_clubcorp.com: "Command to show number of threads of a process"
    Date: Mon, 10 Nov 2003 10:15:26 -0600
    To: Unix Managers <tru64-unix-managers@ornl.gov>
    
    

    I asked:
    I have a AS2000 running 4.0g PK3, 512 MB memory, 2 300Mhz CPU's

            In short,
    I had a system that would not let me boot and the / filesystem was corrupt.

    Thanks to the fine answers from
    Dr. Tom
    Ian Baker
    Allan Rollow

    Dr. Tom's answer

    It sounds like the initial tape drive problem (was it "rmt0"?) lead
    to a sequence of mis-steps. I doubt you've been hacked. You need to
    just work through getting the system stable again. Any time you make
    almost ANY change in a production environment, you have the risk of
    having something down-stream "break" because of a dependency on how
    things were working before that wasn't fully understood.

    There is a probably a relatively simple explanation for each of the
    symptoms you've hit. For instance, it's possible to have "osf_boot"
    be missing because it never got restored from a backup. Or you can
    hit other problems (like your bad /etc/fstab which probably happened
    as you were re-building your boot disk from the prior problems). It
    is just a messy process and you just have to keep finding things and
    fixing them until things stabilize again.

    Ian said to recreate the disklabel. I plan on doing that this weekend
    but there is more involved.

    Allan's Comments deserve reading also. Very good.

    Regarding the SCSI adapter that was a wonder it worked
            at all... Actually, it looks like it wasn't working.
            Or at least only enough to cause problems with the
            devices it was presenting.

            Regarding reformatting the page/swap space... The
            page/swap space isn't a format, other than the low
            level format of the underlying disk that makes the
            disk usable. Page/swap space is just blocks. No
            file system. Don't bother making one since it will
            just get overwritten as it is used.

            The absense of osf_boot is usually the result of it
            not being there, or something having happened to the
            boot blocks of the disk. Someone in the last week or
            so was changing partition tables. My vague recollection
            (the list gets lots of questions) is that it might have
            been you. If so, the disk may not have a boot block.

            If the disk is failing or the SCSI adapter to which it
            is connected is going insane, then that could cause the
            content of the boot blocks to be overwritten or quietly
            fail to read.

            Unless a special device has become corrupted, its major
            and/or minor number changed, recreating them will have
            no affect on the underlying device working. The special
            file merely encodes the major/minor device numbers and
            provides access control.

            I would track down a CDROM distribution of V4.0G and
            boot it. To the extent possible, non-destructively
            exercise all the devices on the system to verify they
            seem to work. For disks with unused or page/swap
            partitions, a read/write test is safe, if you can
            manage not to touch other partitions. Check the
            partition tables before doing anything that writes
            to ensure they address the parts of the disk they're
            supposed to.

            For devices with removable media (tapes), do write/read
            testing on those to ensure they're working correctly.
            Fix any hardware problems you encounter before going
            further.

            Mount the root file system with the standalone system
            and verify it looks intact. Compare the top of the
            root with that of the CDROM and a file listing of the
            backup. For minor damage see if you can copy the missing
            files from the CDROM. For anything else, restore from the
            last known good backup.

            If you have removable disks, you might also consider
            a clean installation on a spare disk. Use that to help
            check the rest of the system.

            Be methodical.

    So to summarize the whole thing.
            Basically the scsi controller failed / came loose from the box causing
    software corruption on the / file system.
            Everything else that happened was on me. (Not building the disklabel
    correctly, moving osf_boot off of main partition, etc)

    -- 
    Ron Bramblett
    Sys Admin
    Fuller Brush Company
    

  • Next message: David.Knight_at_clubcorp.com: "Command to show number of threads of a process"

    Relevant Pages

    • Summary: File Corruption casuing many problems
      ... I had a system that would not let me boot and the / filesystem was corrupt. ... symptoms you've hit. ... Ian said to recreate the disklabel. ...
      (Tru64-UNIX-Managers)
    • Re: missing ntldr
      ... Your second (corrupt) hard drive will be set as slave. ... automatically boot from that drive so you can then scan your 'slave' drive ... >> statement' At this stage has windows xp setup gone through the initial ...
      (microsoft.public.windowsxp.general)
    • RE: Safe Mode boot question
      ... It would not boot into any mode from the boot menu, ... I am suggesting to her that she ship me her corrupt HD so that I might ... into Safe Mode on my computer with her HD to see if I have reason to ... prompt mess up the HD necessitating activation once I sent it back to her? ...
      (microsoft.public.windowsxp.general)
    • Re: NTLDR is missing
      ... or if one or more of the following Windows boot files ... > Corrupt boot sector / master boot record. ... you may need to do a repair install. ...
      (microsoft.public.windowsxp.general)
    • Re: rescue of hosed solaris 10 installation?
      ... Ian wrote: ... my system has not been able to boot. ... > official release from Sun) and was able to go into single user mode. ... Not quite sure tough as I don't have a Solaris installation at hand. ...
      (comp.unix.solaris)