Re: TIMEOUT - WRITE_DMA and smart questions

From: Eduard Martinescu (martines_at_rochester.rr.com)
Date: 10/11/04

  • Next message: uidzero: "Re: Adding network & IP to hosts.deny"
    To: Ion-Mihai Tetcu <itetcu@apropo.ro>
    Date: Mon, 11 Oct 2004 08:19:07 -0400
    
    

    Ion-Mihai,

    For more information on smartmontools (smartctl,smartd), check out the
    Source Forge site, http://smartmontools.sourceforge.net

    If you have specific questions, you can email the support list (link on
    the page above).

    Ed

    On Mon, 2004-10-11 at 07:09, Ion-Mihai Tetcu wrote:
    > [ please reply only on questions@ if this is not appropriate for current@ ]
    >
    > Hi,
    >
    > While doing nothing special the system start printing TIMEOUT -
    > WRITE_DMA erros and eventually after an atacontrol mode 0 PIO4 PIO4
    > hanged completely at 04:20.
    >
    > After restart I've got a few TIMEOUT .. but no hung, however the machine
    > is idle.
    >
    > SMART was enabled as seen bellow, but smartd wasn't running (stupid, huh
    > :-/ ).
    >
    > Obvious question: is the hdd dying ?
    >
    > Second question, as I'm not familiar with SMART: how much can one trust
    > SMART reports ?
    >
    > Third question: could you suggest some settings for smartd ? I'm, asking
    > this because I don't fully understand the man pages for smartctl and
    > smartd; a link explaining more about smart would also be appreciated.
    >
    >
    > System details:
    >
    > Local system status (last daily mail):
    > 3:01AM up 2 days, 11:56, 2 users, load averages: 1.04, 1.07, 0.95
    >
    > % uname -a
    > FreeBSD it.buh.cameradicommercio.ro 5.3-BETA7 FreeBSD 5.3-BETA7 #3: Mon Oct 4 21:57:25 EEST 2004 root@it.buh.tecnik93.com:/usr/obj/usr/src/sys/IT53_d i386
    >
    > Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020
    > Oct 11 04:07:02 it kernel: ata0: reiniting channel ..
    > Oct 11 04:07:02 it kernel: ata0: reset tp1 mask=03 ostat0=d0 ostat1=d0
    > Oct 11 04:07:02 it kernel: ad0: stat=0xd0 err=0xd0 lsb=0xd0 msb=0xd0
    > Oct 11 04:07:02 it last message repeated 95 times
    > Oct 11 04:07:02 it kernel: ad0: stat=0x50 err=0x01 lsb=0x00 msb=0x00
    > Oct 11 04:07:02 it kernel: ata0-slave: stat=0x00 err=0x01 lsb=0x00 msb=0x00
    > Oct 11 04:07:02 it kernel: ata0: reset tp2 stat0=50 stat1=00 devices=0x1<ATA_MASTER>
    > Oct 11 04:07:02 it kernel: ata0: resetting done ..
    > Oct 11 04:07:02 it kernel: ad0: pio=0x0c wdma=0x22 udma=0x45 cable=80pin
    > Oct 11 04:07:02 it kernel: ad0: setting PIO4 on VIA 8235 chip
    > Oct 11 04:07:02 it kernel: ad0: setting UDMA100 on VIA 8235 chip
    > Oct 11 04:07:02 it kernel: ata0: device config done ..
    > Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): error 22
    > Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): Unretryable Error
    > Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): error 22
    > Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): Unretryable Error
    > .........
    >
    > # grep LBA /var/log/messages
    > Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020
    > Oct 11 04:07:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165839908
    > Oct 11 04:08:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165849220
    > Oct 11 04:09:12 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165851556
    > Oct 11 04:09:32 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165859748
    > Oct 11 04:10:44 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103
    > Oct 11 04:11:23 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210916
    > Oct 11 04:11:36 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186211044
    > Oct 11 04:11:58 it kernel: acd0: FAILURE - ATA_IDENTIFY status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=0
    > Oct 11 04:13:21 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309294340
    > Oct 11 04:14:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421156
    > Oct 11 04:14:24 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=175421156
    > Oct 11 04:15:04 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421796
    > Oct 11 04:15:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=130261540
    > Oct 11 04:16:10 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421892
    > Oct 11 04:16:53 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=173918724
    > Oct 11 04:18:50 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309924420
    > Oct 11 04:19:14 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4920283
    > Oct 11 04:40:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4918975
    > Oct 11 04:40:56 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6067199
    > Oct 11 10:46:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103
    >
    > # grep sw /var/log/messages
    > Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s1e, blkno: 14841, size: 4096
    > Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 14381, size: 4096
    > Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 60732, size: 4096
    > Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33481, size: 4096
    > Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33488, size: 4096
    >
    >
    >
    > The disk is:
    > # atacontrol cap 0 0
    > ATA channel 0, Master, device ad0:
    >
    > Protocol ATA/ATAPI revision 6
    > device model WDC WD1600JB-00EVA0
    > serial number WD-WCAEK1298992
    > firmware revision 15.05R15
    > cylinders 16383
    > heads 16
    > sectors/track 63
    > lba supported 268435455 sectors
    > lba48 supported 312579695 sectors
    > dma supported
    > overlap not supported
    >
    > Feature Support Enable Value Vendor
    > write cache yes no
    > read ahead yes yes
    > dma queued no no 0/0x00
    > SMART yes yes
    > microcode download yes yes
    > security yes no
    > power management yes yes
    > advanced power management no no 0/0x00
    > automatic acoustic management yes yes 254/0xFE 128/0x80
    >
    > # smartctl -a /dev/ad0
    > smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
    > Home page is http://smartmontools.sourceforge.net/
    >
    > === START OF INFORMATION SECTION ===
    > Device Model: WDC WD1600JB-00EVA0
    > Serial Number: WD-WCAEK1298992
    > Firmware Version: 15.05R15
    > Device is: In smartctl database [for details use: -P show]
    > ATA Version is: 6
    > ATA Standard is: Exact ATA specification draft version not indicated
    > Local Time is: Mon Oct 11 12:37:32 2004 EEST
    > SMART support is: Available - device has SMART capability.
    > SMART support is: Enabled
    >
    > The SMART RETURN STATUS return value (smartmontools -H option/Directive)
    > can not be retrieved with this version of ATAng, please do not rely on this value
    > === START OF READ SMART DATA SECTION ===
    > SMART overall-health self-assessment test result: PASSED
    >
    > General SMART Values:
    > Offline data collection status: (0x05) Offline data collection activity
    > was aborted by an interrupting command from host.
    > Auto Offline Data Collection: Disabled.
    > Self-test execution status: ( 40) The self-test routine was interrupted
    > by the host with a hard or soft reset.
    > Total time to complete Offline
    > data collection: (5061) seconds.
    > Offline data collection
    > capabilities: (0x79) SMART execute Offline immediate.
    > No Auto Offline data collection support.
    > Suspend Offline collection upon new
    > command.
    > Offline surface scan supported.
    > Self-test supported.
    > Conveyance Self-test supported.
    > Selective Self-test supported.
    > SMART capabilities: (0x0003) Saves SMART data before entering
    > power-saving mode.
    > Supports SMART auto save timer.
    > Error logging capability: (0x01) Error logging supported.
    > No General Purpose Logging support.
    > Short self-test routine
    > recommended polling time: ( 2) minutes.
    > Extended self-test routine
    > recommended polling time: ( 67) minutes.
    > Conveyance self-test routine
    > recommended polling time: ( 5) minutes.
    >
    > SMART Attributes Data Structure revision number: 16
    > Vendor Specific SMART Attributes with Thresholds:
    > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    > 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
    > 3 Spin_Up_Time 0x0007 155 147 021 Pre-fail Always - 2775
    > 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 464
    > 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 8
    > 7 Seek_Error_Rate 0x000b 200 199 051 Pre-fail Always - 0
    > 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3360
    > 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
    > 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
    > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 462
    > 194 Temperature_Celsius 0x0022 124 253 000 Old_age Always - 26
    > 196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
    > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
    > 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
    > 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 2
    > 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0
    >
    > SMART Error Log Version: 1
    > No Errors Logged
    >
    > SMART Self-test log structure revision number 1
    > Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    > # 1 Extended captive Interrupted (host reset) 80% 77 -
    > # 2 Extended offline Aborted by host 90% 77 -
    > # 3 Conveyance offline Completed without error 00% 76 -
    > # 4 Short offline Completed without error 00% 76 -
    > # 5 Conveyance offline Completed without error 00% 233 -
    > # 6 Short captive Interrupted (host reset) 90% 233 -
    >
    > SMART Selective self-test log data structure revision number 1
    > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    > 1 0 0 Not_testing
    > 2 0 0 Not_testing
    > 3 0 0 Not_testing
    > 4 0 0 Not_testing
    > 5 0 0 Not_testing
    >
    > Selective self-test flags (0x0):
    > After scanning selected spans, do NOT read-scan remainder of disk.
    > If Selective self-test is pending on power-up, resume after 0 minute delay.
    >
    >
    > Thanks,

    -- 
    Eduard Martinescu <martines@rochester.rr.com>
    _______________________________________________
    freebsd-questions@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-questions
    To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"
    

  • Next message: uidzero: "Re: Adding network & IP to hosts.deny"

    Relevant Pages

    • [opensuse] Smartd strange values
      ... SMART support is: Available - device has SMART capability. ... Offline data collection status: ... Auto Offline Data Collection: ... Self-test routine in progress... ...
      (SuSE)
    • Ergebnisse von smartctl und badblocks (was: Was ist hier kaputt? (Dringend))
      ... SMART support is: Available - device has SMART capability. ... Offline data collection status: ... SMART Attributes Data Structure revision number: ... Self-test execution status: The self-test routine was interrupted ...
      (de.comp.os.unix.linux.hardware)
    • SCSI opcode 0x80 and 3ware Escalade 7000 ATA RAID
      ... > SMART support is: Enabled ... > Offline data collection status: ... > Extended self-test routine ... > After command completion occurred, ...
      (Linux-Kernel)
    • Re: TIMEOUT - WRITE_DMA and smart questions
      ... you can email the support list (link on ... > Offline data collection status: ... > Extended self-test routine ... > SMART Attributes Data Structure revision number: ...
      (freebsd-current)
    • Check smarterror
      ... SMART support is: Available - device has SMART capability. ... Offline data collection status: ... Self-test execution status: The self-test routine was ...
      (alt.os.linux)