Re: HALT, CONTINUE ==> CRASH. HUH?



On Jul 16, 1:47 pm, Stephen Hoffman <H...@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
wrote:
AEF wrote:
This is what happens on some of my MicroVAX 3100 systems. So far, the
best I can find that's common is that my model 95's work fine but my
model 80's crash.
...
My systems are running VMS 6.1, 6.2, and 6.2 with all relevant ECO's.
I've found no correlation between VMS version and crashing, but I
didn't get to try VMS V6.1 running on a Model 80.

This is of concern because power-cycling the terminal server which I
use for console sessions causes any connected systems to HALT...

Attempting a CONTINUE after a HALT has always been somewhat unreliable.

Really? I didn't know that.


If the universe -- the entire system state -- isn't exactly as it was
when the operating system left the building for a powder, the system can
and does tip over; the run-time environment can become unstable.

So do you have any ideas as to why it always tips over with the Model
80's but never with the Model 95's (so far, anyway)?

With newer systems using USB consoles, the universe simply isn't as it
was left when the operating system ceased processing, and there is no
mechanism for CONTINUE.

Here, I'd pursue the goal of not having an unplanned Break arrive at the
console. This through a UPS for the terminal server, or replacing the
current terminal server box with a terminal server that doesn't jabber
on the line during its power up, and/or a combination of these two
approaches.

OK. Out current terminal server occasionally hangs and must be power-
cycled. That would require entry to the room (horrors!) and unplugging
all the cables while doing the bounce. I'll ask my boss about getting
a better TS.


There are some vendors of terminal servers posted over at the web site,
and there's DNPG -- the classic DECserver devices didn't blather on the
ports, in my experience.

Far less desirably, disable Break on the console serial line.

Yeah, I really don't want to do that because that would ruin the
"lights-out" bit.


And there's always emulation, or migration.

You're getting ahead of me. That's for my next post!


The other approach to check here -- in the short term -- is to see if
there is a console upgrade or such, on the off chance that this is due
to a console-level fault. This could also easily be due to differences
in the I/O stack, or the software running in the specific boxes. Weird
errors can arrive back up the software stack, for instance -- one VAX I
managed classically encountered a pile of hardware restart and recovery
errors as a result of overly-long-term high-IPL activities, and that
wasn't even a halt restart. The I/O widgets got really cranky when the
VAX went walkabout, and then later returned.

A console upgrade? There's a part in the system for that? Can you
please clarify just what this entails?


In general, do look to eliminate the trigger -- the framing error -- the
Break signal -- from arriving on the console serial line. Terminals
servers are around US$100 or so these days, for the cheapest ones
around. Hardly worth the effort of debugging the host's indigestion.

Yep, a better TS looks like the way to go. I'm still curious why the
Model 95's always recover fine but the 80's don't.


...This is
also of conern because we are moving to a new data center sometime
next year and the boss wants it to be "lights-out" -- which leads to a
bonus question: What's the deal with lights-out? Is this just to
minimize entries into the data center? What about when you need to
swap hardware? Add servers? And isn't it a good idea to at least
inspect the data center once in a while to make sure something like a
leak from the ceiling isn't developing?

Lights-Out targets lower hardware service costs. The goal is to
eliminate the need to maintain systems and servers; to through
(consistent and replicated) hardware at staffing costs.

"to through"? Sorry, I can't figure out what you meant here by that.

In some ways, it's similar to server co-location, or application or
ASP-like (Application Service Provider) out-sourcing. But unlike
classic co-lo or ASP-based approaches, where your organization retains
full control and ownership.

Alternatively, think RAIS -- a Reliable Array of Independent Servers --
as a (potential, evolutionary) solution toward the cost and time spent
fixing stuff.

One of the organizations leading the curve in this arena simply
replicates racks and racks and racks of identical and inexpensive
industry-standard servers and shelves of storage, and -- when something
fails -- just leaves it dead right in the rack, and vectors another
spare server into the array to take up the load. Think "bad disk
sector" on a grand scale. In this model, fixing dead hardware is viewed
as more expensive. The servers are tailored to the RAIS design model,
with such things as consistent set-up, remote consoles, and -- though I
don't know this -- probably also with a replication-tailored console
(BIOS or mayhap EFI) intended to reduce the configuration and management
overhead.

Depending on the current prices, using either blades or rack-and-stack
is the best solution, and you end up with industrial-scale replication.
Demonstrations are little more than one or more of the typical and
standard containerized cargo shipping containers, filled with the
servers, storage and supporting infrastructure.

And if the widgets are sufficiently identical, you can also use this to
lower hardware support and associated sparing costs, if you're not
organizationally ready to go to a completely Lights Out model; if you're
still going to visit the computer room for cases other than stuff like
fires or floods.

A corollary to the Lights-Out model is to reduce the numbers and models
of servers; to consolidate to fewer boxes and fewer different models of
widgets (disks, servers, etc) and to fewer software versions. To reduce
complexity, and speed what repairs are deemed necessary. Here, I'd be
looking to move to the same models of servers, or to ProLiant or other
servers running SIMH or CHARON-VAX, or to Integrity servers or blades.

Well, my boss says we can now spend some money for stuff like this. I
actually got a DLT drive recently (TZ88N-VA) for my archive system to
help with my severe reliability issues with the DDS tapes. (Another
thing I did was to switch from DDS-2 tapes to DDS-1 tapes. So far this
has made the tapes completely portable except for tapes made by one
bad drive. Tapes made on this bad drive were unMOUNTable on all of my
eight other online DDS drives! So I replaced the bad drive.)

He wants to consolidate. So I'm left to consider exactly what you've
suggested with SIMH and CHARON-VAX. I can do some consolidation with
just the VAX systems as the number of users has gotten smaller so I
can reduce the number of "update nodes". But I wonder how much I could
put on just one modern server running one of these VAX-emulation
programs!

Thanks for your help.

AEF


I'd be happy to chat about this stuff off-line. And this posting will
probably become the basis for a posting over at the new HL website.

--www.HoffmanLabs.com
Services for OpenVMS


.



Relevant Pages

  • Re: [opensuse] Remote upgrade problem
    ... All my remote sites have serial console servers connected. ... CCM840 8 port, dedicated local console ...
    (SuSE)
  • Re: HALT, CONTINUE ==> CRASH. HUH?
    ... Something the console squats on, something with an I/O controller, software running on these boxes, a specific disk, something that sits on a memory lock state, etc. ... eliminate the need to maintain systems and servers; ... help with my severe reliability issues with the DDS tapes. ... Local usages shows somewhere about five to fifteen uses of each piece of DAT/DDS media and periodic replacement of the drives themselves was to be expected -- this based on the local usage models. ...
    (comp.os.vms)
  • Re: HALT, CONTINUE ==> CRASH. HUH?
    ... I'd pursue the goal of not having an unplanned Break arrive at the console. ... This through a UPS for the terminal server, or replacing the current terminal server box with a terminal server that doesn't jabber on the line during its power up, and/or a combination of these two approaches. ... There are some vendors of terminal servers posted over at the web site, and there's DNPG -- the classic DECserver devices didn't blather on the ports, in my experience. ... Weird errors can arrive back up the software stack, for instance -- one VAX I managed classically encountered a pile of hardware restart and recovery errors as a result of overly-long-term high-IPL activities, and that wasn't even a halt restart. ...
    (comp.os.vms)
  • Re: HALT, CONTINUE ==> CRASH. HUH?
    ... use for console sessions causes any connected systems to HALT... ... This through a UPS for the terminal server, ... There are some vendors of terminal servers posted over at the web site, ... managed classically encountered a pile of hardware restart and recovery ...
    (comp.os.vms)
  • Re: Scripting SMS 2003
    ... secondary sites and servers. ... The installation and configuration of SMS2003 on ... the servers is fully automated using VBscript for the most part. ... > When you make a site change in the console, it takes some time to become ...
    (microsoft.public.sms.admin)