Re: ATA_DMA errors - [ workaround for me ]

From: Johny Mattsson (lonewolf-freebsd_at_earthmagic.org)
Date: 06/25/05

  • Next message: Daniel O'Connor: "Re: 5.4 Installer + Promise FT100TX2 = Loader crash"
    Date: Sat, 25 Jun 2005 17:35:02 +1000
    To: freebsd-stable@freebsd.org
    
    

    Hi all,

    Today I've taken a fresh stab at the problem (I'm never at my best at
    5am in the morning having worked through the night), and I have managed
    to come up with what appears to amount to a successful workaround. It
    would be good if my observations could be confirmed by someone else.

    Basically, the problem seems to be related to using more than one
    channel on the IDE controller. Data points for this are:

    [ SiI 0680 ]
      Channel 1: 40 GB Seagate
      Channel 2: 60 GB Seagate + 160 GB Western Digital
    Result: 200k worth of "DMA_READ timed out" and "DMA_WRITE UDMA ICRC
    error" messages, inability to obtain SMART info from the WD drive, WD
    drive info garbled, and WD drive being removed/detached from the config.
    The errors only appeared after a few hours operation, but once they were
    there, no amount of reboots would get rid of them/improve the situation.

    To attempt to save the data on the WD disk before the FS got completely
    hammered, I pulled it out, and observed the following:

    [ SiI 0680 ]
      Channel 1: 40 GB Seagate
      Channel 2: 60 GB Seagate
    Result: DMA_READ timed out errors for both drives, and "DMA_WRITE UDMA
    ICRC error" messages for the 60 GB Seagate.

    Since I had an older ATA-100 controller available, I tried with it (it
    can't handle >120GB drives though, so I couldn't as many combinations as
    I would have liked):

    [ CMD 649 ]
      Channel 1: 40 GB Seagate
      Channel 2: 60 GB Seagate
    Result: DMA_READ timed out errors, but only when both drives are in use
    at the same time. Running fsck on a slice on either drive in parallell
    reliably reproduced the DMA_READ errors. Whenever an error was reported
    for one drive, another error for the other drive always followed right
    after.

    [ CMD 649 ]
      Channel 1:
      Channel 2: 40 GB Seagate + 60 GB Seagate
    Result: No error messages.

    [ CMD 649 ]
      Channel 1: 40 GB Seagate + 60 GB Seagate
      Channel 2:
    Result: No error messages.

    Encouraged by these findings, I swapped back to the SiI controller to
    test the 160 GB drive:

    [ SiI 0680 ]
      Channel 1:
      Channel 2: 160 GB WD
    Result: No error messages

    [ SiI 0680 ]
      Channel 1: 160 GB WD
      Channel 2:
    Result: No error messages

    Finally, I tried everything together:

    [ SiI 0680 ]
      Channel 1: 160 GB WD
      Channel 2:
    [ CMD 649 ]
      Channel 1: 40 GB Seagate + 60 GB Seagate
      Channel 2:
    Result: No errors messages.

    What I haven't mentioned in the above is that I also tried some
    combinations with different cables, and also at reduced speed (UDM66 vs
    UDMA100). Neither changes had any effect on the behaviour.

    With the WD drive alone on the SiI 0680, I was also able to retrieve
    SMART information from it, and it's showing no errors for the drive at
    all. Likewise so for the 60 GB Seagate drive. All drives pass their
    self-tests without any errors.

    As mentioned in my previous email, my system drive is hanging off the
    built-in PIIX4 controller, as a single drive and only one channel on the
    controller used. I never saw any errors for that drive throughout my
    testing.

    My conclusion is thusly that there is something that's crept in that's
    affecting stability when multiple channels are used on the same
    controller. I'm not versed enough in driver internals to know if it's
    IRQ, DMA, ISR or anything-else related though. Below are my latest dmesg
    and pciconf listings - hopefully this will help someone locate the
    culprit. (Soren?)

    So, now I'm stuck with a system with three IDE controllers and one SCSI
    controller, and a motherboard that is utterly confused when I ask it
    boot off an external controller... (i.e. I can only boot off the
    built-in controller now).

    Please let me know if there's some other info I can get for you; I'll
    have limited ability to move drives around since this is the file server
    and people get annoyed when it's unavailable, but do ask if you think it
    will help you! :)

    Cheers,
    /Johny

    ======= dmesg ========
    Copyright (c) 1992-2005 The FreeBSD Project.
    Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
             The Regents of the University of California. All rights reserved.
    FreeBSD 5.4-RELEASE #0: Sun May 8 10:21:06 UTC 2005
         root@harlow.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC
    Timecounter "i8254" frequency 1193182 Hz quality 0
    CPU: Pentium II/Pentium II Xeon/Celeron (467.73-MHz 686-class CPU)
       Origin = "GenuineIntel" Id = 0x665 Stepping = 5
     
    Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PA
    T,PSE36,MMX,FXSR>
    real memory = 805240832 (767 MB)
    avail memory = 778231808 (742 MB)
    npx0: <math processor> on motherboard
    npx0: INT 16 interface
    acpi0: <AWARD AWRDACPI> on motherboard
    acpi0: Power Button (fixed)
    Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000
    acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0
    cpu0: <ACPI CPU (3 Cx states)> on acpi0
    acpi_throttle0: <ACPI CPU Throttling> on cpu0
    acpi_button0: <Power Button> on acpi0
    pcib0: <ACPI Host-PCI bridge> port
    0x5000-0x500f,0x4000-0x4041,0xcf8-0xcff on acpi0
    pci0: <ACPI PCI bus> on pcib0
    agp0: <Intel 82443BX (440 BX) host to PCI bridge> mem
    0xe0000000-0xe3ffffff at device 0.0 on pci0
    pcib1: <PCI-PCI bridge> at device 1.0 on pci0
    pci1: <PCI bus> on pcib1
    isab0: <PCI-ISA bridge> at device 7.0 on pci0
    isa0: <ISA bus> on isab0
    atapci0: <Intel PIIX4 UDMA33 controller> port
    0xf000-0xf00f,0x376,0x170-0x177,0x
    3f6,0x1f0-0x1f7 at device 7.1 on pci0
    ata0: channel #0 on atapci0
    ata1: channel #1 on atapci0
    uhci0: <Intel 82371AB/EB (PIIX4) USB controller> port 0x9000-0x901f irq
    11 at device 7.2 on pci0
    usb0: <Intel 82371AB/EB (PIIX4) USB controller> on uhci0
    usb0: USB revision 1.0
    uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
    uhub0: 2 ports with 2 removable, self powered
    pci0: <bridge> at device 7.3 (no driver attached)
    atapci1: <SiI 0680 UDMA133 controller> port
    0xa400-0xa40f,0xa000-0xa003,0x9c00-0
    x9c07,0x9800-0x9803,0x9400-0x9407 mem 0xea001000-0xea0010ff irq 11 at
    device 9.0 on pci0
    ata2: channel #0 on atapci1
    ata3: channel #1 on atapci1
    atapci2: <CMD 649 UDMA100 controller> port
    0xb800-0xb80f,0xb400-0xb403,0xb000-0xb007,0xac00-0xac03,0xa800-0xa807
    irq 9 at device 10.0 on pci0
    ata4: channel #0 on atapci2
    ata5: channel #1 on atapci2
    pci0: <display, VGA> at device 11.0 (no driver attached)
    ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0xbc00-0xbcff mem
    0xea000000-0xea000fff irq 10 at device 12.0 on pci0
    aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
    rl0: <RealTek 8139 10/100BaseTX> port 0xc000-0xc0ff mem
    0xea002000-0xea0020ff irq 11 at device 13.0 on pci0
    miibus0: <MII bus> on rl0
    rlphy0: <RealTek internal media interface> on miibus0
    rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
    rl0: Ethernet address: 00:40:f4:28:9d:20
    sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on
    acpi0
    sio0: type 16550A
    sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
    sio1: type 16550A
    ppc0: <ECP parallel printer port> port 0x778-0x77b,0x378-0x37b irq 7 drq
    3 on acpi0
    ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
    ppc0: FIFO with 16/16/16 bytes threshold
    ppbus0: <Parallel port bus> on ppc0
    plip0: <PLIP network interface> on ppbus0
    lpt0: <Printer> on ppbus0
    lpt0: Interrupt-driven port
    ppi0: <Parallel I/O> on ppbus0
    atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
    atkbd0: <AT Keyboard> irq 1 on atkbdc0
    kbd0 at atkbd0
    psm0: <PS/2 Mouse> irq 12 on atkbdc0
    psm0: model IntelliMouse Explorer, device ID 4
    orm0: <ISA Option ROMs> at iomem 0xd0000-0xd07ff,0xc0000-0xc7fff on isa0
    pmtimer0 on isa0
    fdc0: cannot allocate I/O port (6 ports)
    sc0: <System console> at flags 0x100 on isa0
    sc0: VGA <16 virtual consoles, flags=0x300>
    Timecounter "TSC" frequency 467728754 Hz quality 800
    Timecounters tick every 10.000 msec
    ad0: 8207MB <ST38641A/3.29> [16676/16/63] at ata0-master UDMA33
    ad4: 152627MB <WDC WD1600JB-00DUA3/75.13B75> [310101/16/63] at
    ata2-master UDMA100
    ad8: 57241MB <ST360021A/3.05> [116301/16/63] at ata4-master UDMA100
    ad9: 76319MB <ST380021A/3.19> [155061/16/63] at ata4-slave UDMA100
    Waiting 15 seconds for SCSI devices to settle
    sa0 at ahc0 bus 0 target 4 lun 0
    sa0: <HP HP35470A 1009> Removable Sequential Access SCSI-2 device
    sa0: 5.000MB/s transfers (5.000MHz, offset 8)
    sa1 at ahc0 bus 0 target 6 lun 0
    sa1: <SUN DLT4000 CC2E> Removable Sequential Access SCSI-2 device
    sa1: 10.000MB/s transfers (10.000MHz, offset 15)
    cd0 at ahc0 bus 0 target 5 lun 0
    cd0: <TEAC CD-ROM CD-532S 1.0A> Removable CD-ROM SCSI-2 device
    cd0: 20.000MB/s transfers (20.000MHz, offset 15)
    cd0: Attempt to query device size failed: NOT READY, Medium not present
    Mounting root from ufs:/dev/ad0s1a
    -----------------------

    ======= pciconf -lv ==========
    # pciconf -lv
    agp0@pci0:0:0: class=0x060000 card=0x00000000 chip=0x71908086 rev=0x02
    hdr=0x00
         vendor = 'Intel Corporation'
         device = '82443BX/ZX 440BX/ZX CPU to PCI Bridge (AGP Implemented)'
         class = bridge
         subclass = HOST-PCI
    pcib1@pci0:1:0: class=0x060400 card=0x00000000 chip=0x71918086 rev=0x02
    hdr=0x01
         vendor = 'Intel Corporation'
         device = '82443BX/ZX 440BX/ZX AGPset PCI-to-PCI bridge'
         class = bridge
         subclass = PCI-PCI
    isab0@pci0:7:0: class=0x060100 card=0x00000000 chip=0x71108086 rev=0x02
    hdr=0x00
         vendor = 'Intel Corporation'
         device = '82371AB/EB/MB PIIX4/4E/4M ISA Bridge'
         class = bridge
         subclass = PCI-ISA
    atapci0@pci0:7:1: class=0x010180 card=0x00000000 chip=0x71118086
    rev=0x01 hdr=0x00
         vendor = 'Intel Corporation'
         device = '82371AB/EB/MB PIIX4/4E/4M IDE Controller'
         class = mass storage
         subclass = ATA
    uhci0@pci0:7:2: class=0x0c0300 card=0x00000000 chip=0x71128086 rev=0x01
    hdr=0x00
         vendor = 'Intel Corporation'
         device = '82371AB/EB/MB PIIX4/4E/4M USB Interface'
         class = serial bus
         subclass = USB
    none0@pci0:7:3: class=0x068000 card=0x00000000 chip=0x71138086 rev=0x02
    hdr=0x00
         vendor = 'Intel Corporation'
         device = '82371AB/EB/MB PIIX4/4E/4M Power Management Controller'
         class = bridge
    atapci1@pci0:9:0: class=0x010400 card=0x36801095 chip=0x06801095
    rev=0x02 hdr=0x00
         vendor = 'Silicon Image Inc (Was: CMD Technology Inc)'
         device = 'SiI 0680 (Was: PCI-0680) Ultra ATA133 EIDE Controller'
         class = mass storage
         subclass = RAID
    atapci2@pci0:10:0: class=0x010400 card=0xf5ffffff chip=0x06491095
    rev=0x02 hdr=0x00
         vendor = 'Silicon Image Inc (Was: CMD Technology Inc)'
         device = 'PCI-649 Ultra ATA/100 PCI to IDE/ATA Controller'
         class = mass storage
         subclass = RAID
    none1@pci0:11:0: class=0x030000 card=0x00000000 chip=0x0519102b
    rev=0x01 hdr=0x00
         vendor = 'Matrox Electronic Systems Ltd.'
         device = 'MGA-2064W Storm (Millennium board)'
         class = display
         subclass = VGA
    ahc0@pci0:12:0: class=0x010000 card=0x00000000 chip=0x81789004 rev=0x00
    hdr=0x00
         vendor = 'Adaptec Inc'
         device = 'AHA-2940U/UW/2940D Ultra/Ultra Wide/Dual SCSI Host Adapter'
         class = mass storage
         subclass = SCSI
    rl0@pci0:13:0: class=0x020000 card=0x813910ec chip=0x813910ec rev=0x10
    hdr=0x00
         vendor = 'Realtek Semiconductor'
         device = 'RT8139 (A/B/C/810x/813x/C+) Fast Ethernet Adapter'
         class = network
         subclass = ethernet
    --------------------------

    -- 
    Johny Mattsson - Making IT work  ,-.   ,-.   ,-.  When all else fails,
    http://www.earthmagic.org     _.'  `-'   `-'   Murphy's Law still works!
    _______________________________________________
    freebsd-stable@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-stable
    To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
    

  • Next message: Daniel O'Connor: "Re: 5.4 Installer + Promise FT100TX2 = Loader crash"