SUMMARY: AdvFS Read Errors

From: Chris Knorr (cknorr_at_trapsystems.com)
Date: 07/21/05

  • Next message: Chris Knorr: "Determining last login time for a user"
    Date: Thu, 21 Jul 2005 13:17:25 -0400
    To: tru64-unix-managers@ornl.gov
    
    

    Many thanks to Dr. Tom Blinn and Roberto Mackun for their responses.

    Original Question:
    What type of corrective action should be taken if you are seeing AdvFS READ
    errors?

    Dr. Tom:

    By the time you see the read error, the OS has already exhausted the retries
    and given up. You should issue the command given in the messages file:
            /sbin/advfs/tag2name /stripe1/.tags/53146
    which will show you the path to the affected file. Unless there is
    more than one file involved, simply copying that ONE file to a new file (if
    this is a striped fileset, you need to do it following the procedures in the
    AdvFS documentation for creating and populating a striped file); you should
    probably use a tool like "dd" that can copy the file in appropriate sized
    chunks and recover from the read error you are likely to get part way
    through the file. The impact of the read error depends on the application
    that experienced it; most applications exit abnormally on disk read errors
    because in most cases there is no way to recover, but depending on the file
    and the application reading it, it may have kept going; that's why you need
    to figure out what file is impacted.

    If you are getting hard read errors, it might be a good idea to just replace
    the disk that's failing; typically, if the data can be read from the disk
    when it's first starting to fail, the disk itself will relocate the data
    (transparently) and report a soft failure to the OS for logging purposes.
    When enough errors happen on the same disk,
    you can get to the point where the disk can no longer recover from "soft"
    errors and starts to report hard errors. Once this happens, you may need to
    replace the disk itself (which in a multi-disk AdvFS domain can be a
    challenge, but AdvFS has ways to cope once you get them into your knowledge
    base).

    Bobby Mackun:

    I would check the binary.errlog to see the type and number of errors logged
    for the RAID5 disk. 90% of AdvFS I/O failures indicate that AdvFS is unable
    to communicate with the underlying storage. If this is indeed a H/W problem
    then you'll need to check the HSG80 logs to find out what may be causing
    this. If it's a RAID5 set then it may be possible to suspect that 2 disks
    are bad.

    Unless an AdvFS domain panic occurs as a result of the I/O error then yes
    the OS will retry a READ or WRITE operation.


  • Next message: Chris Knorr: "Determining last login time for a user"

    Relevant Pages

    • i/o error when mounting Advfs filesets
      ... without mounting any of the AdvFS partitions. ... The MSA1000 is at Firmware 4.48 build 342. ... The disk set is set to 13 disk in ... UNIX System Administrator ...
      (Tru64-UNIX-Managers)
    • SUMMARY: Advfs query
      ... just a partition of a disk to an existing domain. ... use the balancecommand so AdvFS can re-balance its used space ... You can "addvol" ANYTHING that looks like a disk partition. ...
      (Tru64-UNIX-Managers)
    • Re: Mounting problems, tru64 neophyte
      ... > We have an alpha server who's main disk had died. ... > the replica to a new disk and get the other dead alpha running. ... > So, I successfully dd'ed partitions a,c,g I believe (the ones that are ... > designated advfs). ...
      (comp.unix.tru64)
    • Advfs problem
      ... I came to work this morning to find my 4100 4.0g machine with an advfs ... I tried a reboot and the volume came back. ... and then failed with i/o errors. ... the bad disk and then bring the volume up minus the data on the bad ...
      (Tru64-UNIX-Managers)