Re: DBX on AIX 5.3 with Threads



Paul Pluzhnikov wrote:
Larry Smith <lsmith@xxxxxxxxxx> writes:

Have you checked the application with Valgrind (Linux), or Purify
(Solaris and AIX)? No point in spending time guessing/debugging,
unless you know that the application is "squeaky clean".
...
Valgrind & Purify - yes.

Yes what?

We've run the code under VG and Purify for large tests and they
detected no issues? (That's very hard to believe).

Or, yes we've run under VG and ignored all the bugs it found?


See my reply to Jose's msg for addt'l info.

VG shows no issues.

The plugins run on the client, and build the
Wintel binary data sent to the server.

The traps occur on the AIX server.

The Wintel binary data blocks sent
by the client can be up to 200K in size;
they are memory images of "packed" 'C' struct's
concat'd together and sent from the client to
the server. The Windows version of the server
uses these binary images 'as is'.

The Unix server picks apart this Wintel binary
data, placing it into matching struct's
(in the server's native alignment), correcting
endian-ness as it goes. Then the server uses
this binary data to send transaction request
to one or more Mainframe's. The response
data is used to modify the binary data, which
is then put back into the Wintel format
expected by the client and sent back to the
client.

No, they "restart" the thread.

Too bad. Essentially what you've described is a sure-fire recipe for
irreproducible crashes which are extremely hard to catch and debug.

Do you at least log the fact that such a "restart" has occured? (You
should.) Do the crashes follow such "restarts"? (I expect so).


"restarts" are logged.

No, since we're working with the baseline test
data used to test all new releases of the app,
the trap/restart code is never invoked.
Testing with "bad" data will come later - after
the basic functional testing.

The threads "restart" because much of the
data is generated by customer-written
plugins & may have issues.

This is a brain-dead design: if your customer-written plugin corrupts
heap, and a later call to free() crashes (possibly in some other
thread), while holding heap lock, and you longjmp out of free(),
what hope do you have of making any further progress?

*None whatsoever*.

They've used the setjmp/longjmp "restart" logic on
Windows for over 15 years...

They've either got extremely lucky, or they didn't tell you the
whole story.

Also, on Windows customer-written DLLs may be statically linked
against LIBC{MT}.LIB, in which case they will not share malloc()
with the rest of the code.

But on UNIX there is only one "global" malloc, so the issue of buggy
"plugins" will be exacerbated.

This is a huge financial app (100's of DLL's,
100+ exe's) written for Windows.

So keep it on Windows, and don't touch it (for your sanity's sake).

Surprisingly, this all works well on Linux
and Solaris.

Since you (apparently) have little (if any) AIX-specific code, I
expect the same bug(s) that cause AIX crashes are also present in
Solaris and Linux versions, and these versions don't "work well",
you just haven't observed the problems yet.


Yes, I DO suspect the latent bug is everywhere,
but only exposes itself on AIX.

Cheers,

Dozens of Windows developers & the System
Designer have worked on this app since
1989. We, the two new "Unix guys", do not
get to make design changes; we're charged
with making it work "as is" by writing
Windows emulations API's for Unix.

Thanks for your comments.

.



Relevant Pages