Re: [SOLARIS 10][X86] metadb problem
From: Juergen Keil (jk_at_tools.de)
Date: 12/21/04
- Next message: Dave Uhring: "Re: "Torn between two OS" - Solaris vs Linux"
- Previous message: ps: "Re: [SOLARIS 10][X86] metadb problem"
- In reply to: Guillaume Clauzon: "[SOLARIS 10][X86] metadb problem"
- Next in thread: Guillaume Clauzon: "Re: [SOLARIS 10][X86] metadb problem"
- Reply: Guillaume Clauzon: "Re: [SOLARIS 10][X86] metadb problem"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 21 Dec 2004 16:37:30 +0100
Guillaume Clauzon <hotplug@pornbsd.org> writes:
> When i try a metadb under my solaris 10 x86 , i get the
> following error :
>
> (solaris 702)# metadb -a -f /dev/dsk/c3d0s7
> metadb: solaris: /dev/rdsk/c3d0s7: Invalid argument
On the console and in /var/adm/messages, do you get something like
this?
rootnex: WARNING: Please file a ***BUG*** for driver ata, dma_*_bind()
will return a failure to ata. sgllen = 17, cookie count = 32, and
DDI_DMA_PARTIAL is *not* set.
If that is the case, read the "FAQ on the rootnex bind warning
message" (Copied without permission from
<URL:http://groups.yahoo.com/group/solarisx86/message/16887>)
========================================================================
o description of the problem - what's wrong
The x86 root nexus's ddi_dma_*_bind_handle() implementation
can wrongly return more cookies than a driver specifies is
the maximum that it can handle (when DDI_DMA_PARTIAL is not
specified). This can be a very serious problem, causing silent
data corruption. There can be a lot of corner cases depending
on memory fragmentation, making it difficult to test for.
Unfortunately, the fix for these bugs also can/has exposed
minor bugs in existing drivers, which could be hard to identify
and debug. An example is the elxl driver which specified that
it could only handle 1 cookie for the tx data buffers, but
really could handle 2 (and was getting 2 occasionally).
o does the problem happen on sparc? why not?
This problem does *not* happen on SPARC. This is a bug
in the x86 rootnex driver. This part of the code is not
shared with any SPARC nexus driver code.
o what rootnex was doing wrong
The rootnex driver would return success from a DMA bind
operation when the cookie count was larger than the maximum
the driver specified that it could handle, and the driver
specified that it couldn't handle partial mappings.
o what rootnex is now doing right
The rootnex driver fails a DMA bind operation when
the cookie count would be larger than the maximum
that a driver specifies that it can handle, and the
driver specifies that it cannot handle partial mappings.
o what drivers were doing wrong
The vast majority of drivers aren't doing anything wrong.
There may be a few which are not handling the DMA bind
operation correctly. For example, if a driver cannot
handle partial DMAs, it should be able to handle the
following # of cookies ((max possible bind size / page size) + 1).
If it cannot handle that number of cookies, it must be
able to expect and correctly handle a failed bind
operation.
o what they need to do right.
If a driver cannot handle partial DMAs, it should be able
to handle the following # of cookies
((max possible bind size / page size) + 1).
If it cannot handle that number of cookies, it must be
able to expect and correctly handle a failed bind
operation.
o does this mean a driver which was working ok before
could now broken, because a driver may have been
written to work around the rootnex bug?
No. A driver didn't need to workaround the bug. It does
mean that a driver which was working or was appearing to
work before, could now be broken. This could be because the
code path for the failing bind operation has never been
tested or the driver never expected the bind operation
to fail (i.e. the driver has a bug which they never
debugged since it never failed before unless they noticed
memory corruption).
o so you've got a driver binary. what can you do
to test that the driver is correctly written and
works properly with the corrected rootnex?
o instructions for using flags
There are two patchables rootnex_bind_fail & rootnex_bind_warn. The
following table explains the behavior of the new bind operation
failure. The current default behavior is to fails the bind and
print one warning message per major number.
rootnex_bind_fail | rootnex_bind_warn | Results if sgllen < *ccount
| | && !DDI_DMA_PARTIAL
---------------------------------------------------------------------
0 | 0 | behaves like code today (bind succeeds, no warning message)
0 | 1 | bind succeeds, print one warning/major#
1 | 0 | fails the bind, no warning message
1 | 1 | fails the bind, print one warning/major#
To revert to the previous behavior, which is incorrectly returning
success from these ddi_dma_*_bind_handle() operations, put the
following in /etc/system then reboot.
set rootnex:rootnex_bind_fail = 0
To disable the warning message, put the following in /etc/system
then reboot.
set rootnex:rootnex_bind_warn = 0
Anyone fixing bugs found by the warning should make sure they test
their fix *without* rootnex_bind_fail = 0 and set kmem_flags = 3f to
ensure they clean up correctly after the failure (since it's likely
that code path hasn't been tested). e.g. put the following in
/etc/system
set rootnex:rootnex_bind_fail = 1
set kmem_flags = 0x3f
o how to stress a system to reproduce it
In order to hit some of the corner case, the memory your driver
binds must be maximally fragmented (i.e. no contiguous physical
pages). Running a memory stress test during testing is recommended.
o so you've got a driver's source code. how can you
inspect the code to check for possible errors
and how can you fix them?
First you must understand what is the maximum possible bind
size that your driver will see. This may not be trivial.
For example, the ata driver handles ~64K maximum buffers
for normal operation, but newfs goes through /dev/rdsk
which doesn't break buffers down to ~64K buffers initially.
Once you understand maximum possible bind size, make sure
sgllen is set to ((max possible bind size / page size) + 1)
and that you actually handle that many cookies. Or make
sure you handle the fail case correctly.
o can you run your fixed driver on earlier solaris releases?
Yes. A correctly written driver will run fine on both S10
and earlier solaris releases. This bug fix only fixes a
problem of not correctly identifying a failure case. If a
driver doesn't generate a failure case, or correctly
handles the failure case, there will not be any problems.
o What can I do if my system doesn't boot anymore
In the unlikely event your system doesn't boot anymore,
you can recover the system by setting a defered breakpoint
in rootnex_attach, clear the rootnex_bind_fail patchable,
then, once you've booted, add set rootnex:rootnex_bind_fail = 0
to /etc/system
An example is provide below...
Boot args:
Type b [file-name] [boot-flags] <ENTER> to boot with options
or i <ENTER> to enter boot
interpreter
or <ENTER> to boot with defaults
<<< timeout in 5 seconds >>>
Select (b)oot or (i)nterpreter: b kmdb -d
Loading kmdb...
Welcome to kmdb
kmdb: Unable to determine terminal type: assuming `vt100'
[0]> ::bp rootnex`rootnex_attach
[0]> :c
SunOS Release 5.10 Version gate:2004-10-18 32-bit
Copyright 1983-2004 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Loaded modules: [ ufs unix krtld genunix specfs ]
kmdb: stop at rootnex`rootnex_attach
kmdb: target stopped at:
rootnex`rootnex_attach: pushl %ebp
[0]> rootnex`rootnex_bind_fail?W 0
rootnex`rootnex_bind_fail: 0x1 = 0x0
[0]> :c
Hostname: ...
[hostname] console login: root
Password:
[CUT]
bfu'ed from /ws/on10-gate/archives/i386/nightly-nd on 2004-10-18
Sun Microsystems Inc. SunOS 5.10 s10_68 December 2004
# echo "set rootnex:rootnex_bind_fail = 0" >> /etc/system
#
========================================================================
- Next message: Dave Uhring: "Re: "Torn between two OS" - Solaris vs Linux"
- Previous message: ps: "Re: [SOLARIS 10][X86] metadb problem"
- In reply to: Guillaume Clauzon: "[SOLARIS 10][X86] metadb problem"
- Next in thread: Guillaume Clauzon: "Re: [SOLARIS 10][X86] metadb problem"
- Reply: Guillaume Clauzon: "Re: [SOLARIS 10][X86] metadb problem"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|