Re: HALT, CONTINUE ==> CRASH. HUH?



AEF wrote:
This is what happens on some of my MicroVAX 3100 systems. So far, the
best I can find that's common is that my model 95's work fine but my
model 80's crash.
....
My systems are running VMS 6.1, 6.2, and 6.2 with all relevant ECO's.
I've found no correlation between VMS version and crashing, but I
didn't get to try VMS V6.1 running on a Model 80.

This is of concern because power-cycling the terminal server which I
use for console sessions causes any connected systems to HALT...

Attempting a CONTINUE after a HALT has always been somewhat unreliable.

If the universe -- the entire system state -- isn't exactly as it was when the operating system left the building for a powder, the system can and does tip over; the run-time environment can become unstable.

With newer systems using USB consoles, the universe simply isn't as it was left when the operating system ceased processing, and there is no mechanism for CONTINUE.

Here, I'd pursue the goal of not having an unplanned Break arrive at the console. This through a UPS for the terminal server, or replacing the current terminal server box with a terminal server that doesn't jabber on the line during its power up, and/or a combination of these two approaches.

There are some vendors of terminal servers posted over at the web site, and there's DNPG -- the classic DECserver devices didn't blather on the ports, in my experience.

Far less desirably, disable Break on the console serial line.

And there's always emulation, or migration.

The other approach to check here -- in the short term -- is to see if there is a console upgrade or such, on the off chance that this is due to a console-level fault. This could also easily be due to differences in the I/O stack, or the software running in the specific boxes. Weird errors can arrive back up the software stack, for instance -- one VAX I managed classically encountered a pile of hardware restart and recovery errors as a result of overly-long-term high-IPL activities, and that wasn't even a halt restart. The I/O widgets got really cranky when the VAX went walkabout, and then later returned.

In general, do look to eliminate the trigger -- the framing error -- the Break signal -- from arriving on the console serial line. Terminals servers are around US$100 or so these days, for the cheapest ones around. Hardly worth the effort of debugging the host's indigestion.

...This is
also of conern because we are moving to a new data center sometime
next year and the boss wants it to be "lights-out" -- which leads to a
bonus question: What's the deal with lights-out? Is this just to
minimize entries into the data center? What about when you need to
swap hardware? Add servers? And isn't it a good idea to at least
inspect the data center once in a while to make sure something like a
leak from the ceiling isn't developing?

Lights-Out targets lower hardware service costs. The goal is to eliminate the need to maintain systems and servers; to through (consistent and replicated) hardware at staffing costs.

In some ways, it's similar to server co-location, or application or ASP-like (Application Service Provider) out-sourcing. But unlike classic co-lo or ASP-based approaches, where your organization retains full control and ownership.

Alternatively, think RAIS -- a Reliable Array of Independent Servers -- as a (potential, evolutionary) solution toward the cost and time spent fixing stuff.

One of the organizations leading the curve in this arena simply replicates racks and racks and racks of identical and inexpensive industry-standard servers and shelves of storage, and -- when something fails -- just leaves it dead right in the rack, and vectors another spare server into the array to take up the load. Think "bad disk sector" on a grand scale. In this model, fixing dead hardware is viewed as more expensive. The servers are tailored to the RAIS design model, with such things as consistent set-up, remote consoles, and -- though I don't know this -- probably also with a replication-tailored console (BIOS or mayhap EFI) intended to reduce the configuration and management overhead.

Depending on the current prices, using either blades or rack-and-stack is the best solution, and you end up with industrial-scale replication. Demonstrations are little more than one or more of the typical and standard containerized cargo shipping containers, filled with the servers, storage and supporting infrastructure.

And if the widgets are sufficiently identical, you can also use this to lower hardware support and associated sparing costs, if you're not organizationally ready to go to a completely Lights Out model; if you're still going to visit the computer room for cases other than stuff like fires or floods.

A corollary to the Lights-Out model is to reduce the numbers and models of servers; to consolidate to fewer boxes and fewer different models of widgets (disks, servers, etc) and to fewer software versions. To reduce complexity, and speed what repairs are deemed necessary. Here, I'd be looking to move to the same models of servers, or to ProLiant or other servers running SIMH or CHARON-VAX, or to Integrity servers or blades.

I'd be happy to chat about this stuff off-line. And this posting will probably become the basis for a posting over at the new HL website.

--
www.HoffmanLabs.com
Services for OpenVMS
.



Relevant Pages

  • Re: [opensuse] Remote upgrade problem
    ... All my remote sites have serial console servers connected. ... CCM840 8 port, dedicated local console ...
    (SuSE)
  • Re: HALT, CONTINUE ==> CRASH. HUH?
    ... Something the console squats on, something with an I/O controller, software running on these boxes, a specific disk, something that sits on a memory lock state, etc. ... eliminate the need to maintain systems and servers; ... help with my severe reliability issues with the DDS tapes. ... Local usages shows somewhere about five to fifteen uses of each piece of DAT/DDS media and periodic replacement of the drives themselves was to be expected -- this based on the local usage models. ...
    (comp.os.vms)
  • Re: HALT, CONTINUE ==> CRASH. HUH?
    ... use for console sessions causes any connected systems to HALT... ... There are some vendors of terminal servers posted over at the web site, ... managed classically encountered a pile of hardware restart and recovery ... help with my severe reliability issues with the DDS tapes. ...
    (comp.os.vms)
  • Re: Users Remoting to the Console (RDP)
    ... Console requires administrator access. ... Microsoft MVP - Terminal Server ... This is happening on a couple servers, so it should be a configuration ...
    (microsoft.public.windows.terminal_services)
  • Re: HALT, CONTINUE ==> CRASH. HUH?
    ... use for console sessions causes any connected systems to HALT... ... This through a UPS for the terminal server, ... There are some vendors of terminal servers posted over at the web site, ... managed classically encountered a pile of hardware restart and recovery ...
    (comp.os.vms)