Spontaneous reboots

From: Erik Norgaard (norgaard_at_math.ku.dk)
Date: 05/06/05

  • Next message: Roldán: "wich sound driver do i use?"
    Date: Fri, 6 May 2005 11:08:32 +0200 (CEST)
    To: questions@freebsd.org
    
    

    Hi,

    I am experiencing tremendous problems keeping my FBSD 5 up and
    happy, yet I keep experiencing spontaneous reboots and crashes.

    This is a looong story, I have been trying to figure out what's
    causing the problem for two weeks now. I really appreciate
    your patience and response if you make it all to the end :-)

    The setup:

      FBSD---DSL---Internet

    The DSL is a Thomsom 510 ADSL router doing 1-1 NAT, no firewall.
    The FBSD is configured with IPFilter firewall and running named,
    postfix, cyrus-imap22 with virtual domains and apache with
    virtual hosts, also to serve the local net (behind the DSL) it
    runs dhcpd, ntpd and mysql.

    Postfix, Cyrus-Imap and Apache are all configured with TLS
    support and I have generated certificates using OpenSSL. This
    system was installed in november, and upgraded begning january. I
    have had no problems for months.

    Then - from the beginning:

    On April 15, FreeBSD 5.3-p5, I had two simultaneous+/- events:

    1) A huge number of incoming mail delivery attempts to addresses
       of the type randomchars@mydomain.com
    2) Kernel panic, fatal trap 12

    I had done no prior system tuning or changes.

    Since then, uptime has been anywhere between 0 and >3 days - the
    last obtained by stopping all services and disconnecting the
    machine from the network.

    1) By huge, I mean enough to suck up a 512kbps DSL connection,
    but this should be far from enough to make FBSD cough or even
    panic. Also, system load is always close to 0.00.

    I have postfix handling mail and use cyrus-imap with virtual
    domains as backend. Since postfix didn't know hosted addresses,
    cyrus rejects the mail. I created a list of existing addresses so
    mail could be rejected faster.

    The illicit mail delivery attempts persists.

    2) I followed the handbook to investigate the panic:
    Following the kernel panic faq:

    Fatal trap 12: Page fault while in kernel mode
    Fault virtual address = 0xc
    Fault code = supervisor read, page not present
    instruction pointer = 0x8:0xc053d638
    stack pointer = 0x10:0xcb4ddaec
    frame pointer = 0x10:0xcb4ddaf8
    code segment = base 0x0, limit 0xffff, type 0x1b
                             = DPL 0, pres 1, def32 1, gran 1
    processor eflags = interrupt enabled, resume, IOPL=0
    current process = 28 (swi1:net)
    trap number = 12
    panic: page fault

    # nm -n /boot/kernel/kernel | grep c053d6
    c053d610 T m_copydata
    c053d670 T m_dup

    I no longer get this panic, however my system does not deserve the
    predicate -STABLE. Somehow, I prefered the panic, at least it gave
    some info for debugging. But now it reboots without a blip.

    Disk errors:

    The crashes _always_ causes disk errors that cannot be recovered
    by the background fsck, particularly on /var where mail resides.
    This may result in new reboots.

    To solve this I have tried mounting drives read-only, unless
    write permission was necesary. It turns out that postfix requires
    write access to /, /usr and /var - the first two appears to be
    related to tls(?).

    Also, I have set fsck_y_enable="yes" in rc.conf, so the disk is
    thorouly checked on boot after a crash.

    I had dumpon set in my rc.conf but this just made the partition
    full making things even worse. I have removed all kernel dumps and
    also unnecessary data as I understood diskperformance may drop
    when diskspace is below 15%.

    The kernel:

    The first kernel was a 5.3-p5 custom kernel. To make it easier to
    debug I updated to -p8, GENERIC. No change. No change. Following
    suggestions by Kris K. I upgraded to 5.4-RC2.

    This solved the panic - but the system still crashes, also after
    updating to RC3 and RC4.

    The system:

    Upgrading to 5.4, RC2, I built world also. I then realized that
    some ports may have been built against the old base causing new
    problems.

    I have now deinstalled all ports. The system has been completely
    updated, kernel and base, to 54RC4. I have reinstalled the
    minimal set of ports needed to serve my needs, version to -CURRENT
    as of may 3.

    I still experience crashes.

    Postfix:

    I tried to limit the amount of simultaneaous deliveries handles.
    No change.

    When a connection is made postfix sends a lot of dns queryes to
    verify that the sender hostname resolves to the ip, that sender
    domain exists, and that it is not in a block list.

    IPFilter:

    I have restricted access to port 25, now only a handfull of
    servers are permitted by the firewall. This has helped, uptime is
    now hours rather than minutes, but I still have crashes.

    I have reduced all timeouts to prevent state table from
    saturating, but no change.

    If I open up for incoming mail, for a (any) /8 segment, the number
    of connections explode. Due to the limitation of simultaneous
    postfix threads, many time out. No change.

    I am working on a black list based on the maillog, but this is
    another project.

    DNS:

    Since mail to mydomain.com is currently useless I have decided to
    set the MX record to 127.0.0.1. This has stopped the illicit mail,
    but also all other legitimate mail to that domain - mostly this
    gives me peace and bandwith.

    Hardware: (dmesg below)

    I have tried to change the disk cable, I have a 2.5" disk with a
    converter cable to standard IDE.

    Also, I have tried the disk in my laptop and it appears stable,
    but testing period was limited.

    I have tried both IDE connectors on the MB and both NIC's. No
    change.

    Summary:

    Despite all my attempts to solve the problem, my system is far
    from STABLE. I still experience spontaneous crashes, allthough
    less often.

    It is my personal belief that there may be a hardware problem,
    or persistent disk errors.

    The reason is that despite the traffic load satturates the
    connection it should not be enough to crash even limited hardware.
    I have no more ideas on how to debug this.

    Questions:

    * Is there a disk tool for analysing the disk, marking sectors bad
      etc?
    * How do I find the file if I know the Inode number (as reported
      by fsck)?
    * Can malformed packets cause FBSD crash? Could Thomson510 be
      accountable for such packets?
    * Did I miss the obvious?
    * Any ideas where to go now?

    All help is highly appreciated.

    Thanks, Erik

    Disk space: df
    Filesystem 1K-blocks Used Avail Capacity Mounted on
    /dev/ad0s1a 507630 76966 390054 16% /
    devfs 1 1 0 100% /dev
    /dev/ad0s1g 30859916 14228272 14162852 50% /home
    /dev/ad0s1f 507630 42 466978 0% /tmp
    /dev/ad0s1d 12186190 2134420 9076876 19% /usr
    /dev/ad0s1e 12186190 7689462 3521834 69% /var
    devfs 1 1 0 100% /var/named/dev

    last (24h):
    norgaard ttyp0 x.x.x.x Fri 6 May 10:09 still logged in
    norgaard ttyp0 x.x.x.x Fri 6 May 09:22 - 09:25 (00:03)
    norgaard ttyp0 charm Fri 6 May 08:28 - 08:42 (00:13)
    norgaard ttyp0 charm Fri 6 May 07:48 - 08:00 (00:11)
    reboot ~ Fri 6 May 04:16
    norgaard ttyp1 charm Thu 5 May 22:45 - 23:18 (00:32)
    norgaard ttyp0 charm Thu 5 May 22:09 - crash (06:07)
    reboot ~ Thu 5 May 22:05
    norgaard ttyp0 charm Thu 5 May 21:45 - crash (00:20)
    reboot ~ Thu 5 May 21:20
    norgaard ttyp1 charm Thu 5 May 21:11 - crash (00:09)
    norgaard ttyp0 charm Thu 5 May 20:45 - crash (00:35)
    reboot ~ Thu 5 May 18:57
    norgaard ttyp0 x.x.x.x Thu 5 May 18:23 - 18:23 (00:00)
    reboot ~ Thu 5 May 18:22
    norgaard ttyp0 x.x.x.x Thu 5 May 16:44 - crash (01:37)
    norgaard ttyp0 x.x.x.x Thu 5 May 15:44 - 16:13 (00:28)
    norgaard ttyp0 x.x.x.x Thu 5 May 13:57 - 13:58 (00:00)
    norgaard ttyp0 x.x.x.x Thu 5 May 13:38 - 13:51 (00:12)
    norgaard ttyp0 x.x.x.x Thu 5 May 13:06 - 13:27 (00:21)
    norgaard ttyp0 x.x.x.x Thu 5 May 10:53 - 11:00 (00:06)
    reboot ~ Thu 5 May 10:43
    norgaard ttyp0 x.x.x.x Thu 5 May 10:37 - crash (00:06)
    norgaard ttyp0 x.x.x.x Thu 5 May 10:14 - 10:22 (00:08)
    reboot ~ Thu 5 May 10:06
    norgaard ttyp0 charm Thu 5 May 08:38 - crash (01:27)
    reboot ~ Thu 5 May 08:38
    norgaard ttyp0 charm Thu 5 May 07:53 - 07:54 (00:00)
    norgaard ttyp0 charm Thu 5 May 07:52 - 07:52 (00:00)
    reboot ~ Thu 5 May 07:17
    reboot ~ Thu 5 May 04:59
    norgaard ttyp0 charm Thu 5 May 04:17 - crash (00:41)
    reboot ~ Thu 5 May 04:16
    shutdown ~ Thu 5 May 04:14
    norgaard ttyp0 charm Thu 5 May 03:45 - shutdown (00:28)
    reboot ~ Thu 5 May 03:42
    reboot ~ Thu 5 May 03:40
    norgaard ttyp0 charm Thu 5 May 03:40 - crash (00:00)
    reboot ~ Thu 5 May 03:31
    reboot ~ Thu 5 May 03:27
    reboot ~ Thu 5 May 03:13
    reboot ~ Thu 5 May 03:03
    reboot ~ Thu 5 May 02:58
    reboot ~ Thu 5 May 02:51
    reboot ~ Thu 5 May 02:47
    reboot ~ Thu 5 May 02:41
    reboot ~ Thu 5 May 02:35
    reboot ~ Thu 5 May 02:29
    reboot ~ Thu 5 May 02:25
    reboot ~ Thu 5 May 02:20
    reboot ~ Thu 5 May 02:09
    reboot ~ Thu 5 May 01:58
    reboot ~ Thu 5 May 01:53
    reboot ~ Thu 5 May 01:50
    reboot ~ Thu 5 May 01:46
    reboot ~ Thu 5 May 01:42
    reboot ~ Thu 5 May 01:33
    reboot ~ Thu 5 May 01:30
    reboot ~ Thu 5 May 01:27
    reboot ~ Thu 5 May 01:13
    reboot ~ Thu 5 May 01:08
    reboot ~ Thu 5 May 01:05
    reboot ~ Thu 5 May 00:58
    reboot ~ Thu 5 May 00:53
    reboot ~ Thu 5 May 00:44
    reboot ~ Thu 5 May 00:34
    reboot ~ Thu 5 May 00:24
    reboot ~ Thu 5 May 00:20
    reboot ~ Thu 5 May 00:13
    reboot ~ Wed 4 May 23:58
    reboot ~ Wed 4 May 23:43
    reboot ~ Wed 4 May 23:40
    reboot ~ Wed 4 May 23:36
    norgaard ttyp0 charm Wed 4 May 20:57 - 23:29 (02:31)

    Note the reboots from Wed 4, 23.36 - Thu 5 7.52 appeared to be
    caused by postfix throtling due to a read only mounted /usr.

    dmesg.today:

    Copyright (c) 1992-2005 The FreeBSD Project.
    Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992,
    1993, 1994
            The Regents of the University of California. All rights
    reserved.
    FreeBSD 5.4-RC4 #0: Tue May 3 14:07:30 CEST 2005
        root@top.daemonsecurity.com:/usr/obj/usr/src/sys/GENERIC
    Timecounter "i8254" frequency 1193182 Hz quality 0
    CPU: VIA C3 Nehemiah+RNG (1002.28-MHz 686-class CPU)
      Origin = "CentaurHauls" Id = 0x694 Stepping = 4

    Features=0x380b03d<FPU,DE,PSE,TSC,MSR,MTRR,PGE,CMOV,MMX,FXSR,SSE>
    real memory = 251592704 (239 MB)
    avail memory = 236548096 (225 MB)
    npx0: <math processor> on motherboard
    npx0: INT 16 interface
    acpi0: <VT9174 AWRDACPI> on motherboard
    acpi0: Power Button (fixed)
    Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
    acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
    cpu0: <ACPI CPU (3 Cx states)> on acpi0
    acpi_throttle0: <ACPI CPU Throttling> on cpu0
    acpi_button0: <Power Button> on acpi0
    pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
    pci0: <ACPI PCI bus> on pcib0
    agp0: <VIA 862x (CLE266) host to PCI bridge> mem 0xd0000000-0xd7ffffff at device 0.0 on pci0
    pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0
    pci1: <ACPI PCI bus> on pcib1
    pci1: <display, VGA> at device 0.0 (no driver attached)
    vr0: <VIA VT6105 Rhine III 10/100BaseTX> port 0xd000-0xd0ff mem 0xde000000-0xde0000ff irq 12 at device 15.0 on pci0
    miibus0: <MII bus> on vr0
    ukphy0: <Generic IEEE 802.3u media interface> on miibus0
    ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
    vr0: Ethernet address: 00:40:63:d4:89:72
    uhci0: <VIA 83C572 USB controller> port 0xd400-0xd41f irq 11 at
    device 16.0 on pci0
    usb0: <VIA 83C572 USB controller> on uhci0
    usb0: USB revision 1.0
    uhub0: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
    uhub0: 2 ports with 2 removable, self powered
    uhci1: <VIA 83C572 USB controller> port 0xd800-0xd81f irq 11 at device 16.1 on pci0
    usb1: <VIA 83C572 USB controller> on uhci1
    usb1: USB revision 1.0
    uhub1: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
    uhub1: 2 ports with 2 removable, self powered
    uhci2: <VIA 83C572 USB controller> port 0xdc00-0xdc1f irq 9 at device 16.2 on pci0
    usb2: <VIA 83C572 USB controller> on uhci2
    usb2: USB revision 1.0
    uhub2: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
    uhub2: 2 ports with 2 removable, self powered
    pci0: <serial bus, USB> at device 16.3 (no driver attached)
    isab0: <PCI-ISA bridge> at device 17.0 on pci0
    isa0: <ISA bus> on isab0
    atapci0: <VIA 8235 UDMA133 controller> port 0xe000-0xe00f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0
    ata0: channel #0 on atapci0
    ata1: channel #1 on atapci0
    pci0: <multimedia, audio> at device 17.5 (no driver attached)
    vr1: <VIA VT6102 Rhine II 10/100BaseTX> port 0xe800-0xe8ff mem 0xde002000-0xde0020ff irq 11 at device 18.0 on pci0
    miibus1: <MII bus> on vr1
    ukphy1: <Generic IEEE 802.3u media interface> on miibus1
    ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
    vr1: Ethernet address: 00:40:63:d4:89:71
    fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0
    fd0: <1440-KB 3.5" drive> on fdc0 drive 0
    sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
    sio0: type 16550A
    sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
    sio1: type 16550A
    ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0
    ppc0: Generic chipset (EPP/NIBBLE) in COMPATIBLE mode
    ppbus0: <Parallel port bus> on ppc0
    plip0: <PLIP network interface> on ppbus0
    lpt0: <Printer> on ppbus0
    lpt0: Interrupt-driven port
    ppi0: <Parallel I/O> on ppbus0
    sio2: <16550A-compatible COM port> port 0x3e8-0x3ef irq 5 on acpi0
    sio2: type 16550A
    sio3: <16550A-compatible COM port> port 0x2e8-0x2ef irq 10 on acpi0
    sio3: type 16550A
    orm0: <ISA Option ROM> at iomem 0xc0000-0xcdfff on isa0
    pmtimer0 on isa0
    atkbdc0: <Keyboard controller (i8042)> at port 0x64,0x60 on isa0
    atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0
    sc0: <System console> at flags 0x100 on isa0
    sc0: VGA <16 virtual consoles, flags=0x300>
    vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
    Timecounter "TSC" frequency 1002278507 Hz quality 800
    Timecounters tick every 10.000 msec
    ad0: 57231MB <IC25N060ATMR04-0/MO3OAD4A> [116280/16/63] at ata0-master UDMA100
    Mounting root from ufs:/dev/ad0s1a
    WARNING: /home was not properly dismounted
    WARNING: /tmp was not properly dismounted
    WARNING: /usr was not properly dismounted
    WARNING: /var was not properly dismounted
    IP Filter: v3.4.35 initialized. Default = pass all, Logging =
    enabled
    Accounting enabled

    GnuPG: http://www.locolomo.org/home/norgaard/norgaard.gpg.asc
    pub 1024D/11D11F9E 2003-08-15 Erik Norgaard <norgaard@locolomo.org>
         Key fingerprint = C394 81C4 D137 EEE5 39BE 82D5 3E6B FB3E 11D1 1F9E

    _______________________________________________
    freebsd-questions@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-questions
    To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"


  • Next message: Roldán: "wich sound driver do i use?"