Re: HALT, CONTINUE ==> CRASH. HUH?



AEF wrote:
On Jul 16, 1:47 pm, Stephen Hoffman <H...@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
wrote:
AEF wrote:
This is what happens on some of my MicroVAX 3100 systems. So far, the
best I can find that's common is that my model 95's work fine but my
model 80's crash.
...
My systems are running VMS 6.1, 6.2, and 6.2 with all relevant ECO's.
I've found no correlation between VMS version and crashing, but I
didn't get to try VMS V6.1 running on a Model 80.
This is of concern because power-cycling the terminal server which I
use for console sessions causes any connected systems to HALT...
Attempting a CONTINUE after a HALT has always been somewhat unreliable.

Really? I didn't know that.

You're trying to restart a large and very complex beast, up to and including I/O controllers and other external hardware.

If the universe -- the entire system state -- isn't exactly as it was
when the operating system left the building for a powder, the system can
and does tip over; the run-time environment can become unstable.

So do you have any ideas as to why it always tips over with the Model
80's but never with the Model 95's (so far, anyway)?

It could be any number of factors. Something the console squats on, something with an I/O controller, software running on these boxes, a specific disk, something that sits on a memory lock state, etc. What is specifically happening here, I don't know. A look at the crashdump might show something.

... A console upgrade? There's a part in the system for that? Can you
please clarify just what this entails?

Some of the later MicroVAX boxes -- haven't looked at those details in eons -- had firmware-based consoles, and could be upgraded. There were classically console upgrades made across the lines over time, and some were box swaps, some where ROM swaps, and some of the oldest and some of the newest VAX boxes had soft-loaded console upgrades. I haven't looked at the details of the MicroVAX 3100 model 80 console here.

It's probably not worth investigating this, if you can address it through consolidation or through migration or through not generating the framing error (break) itself.

Yep, a better TS looks like the way to go. I'm still curious why the
Model 95's always recover fine but the 80's don't.

Hard to say, based on what I can see from here.


...This is
also of conern because we are moving to a new data center sometime
next year and the boss wants it to be "lights-out" -- which leads to a
bonus question: What's the deal with lights-out? Is this just to
minimize entries into the data center? What about when you need to
swap hardware? Add servers? And isn't it a good idea to at least
inspect the data center once in a while to make sure something like a
leak from the ceiling isn't developing?
Lights-Out targets lower hardware service costs. The goal is to
eliminate the need to maintain systems and servers; to through
(consistent and replicated) hardware at staffing costs.

"to through"? Sorry, I can't figure out what you meant here by that.

....

Ugh. That doesn't read at all, does it. Try: "The goal here is to reduce the costs of or to entirely eliminate the need to maintain systems and servers. This utilizing consistent and replicated hardware and the associated reduced staffing and support costs."

The theory is that you don't need to have staff working inside the data center (DC), hence you don't need to keep the lights on. You can also work and manage the DC remotely, which has its advantages in various situations.

Each newer generation of iron has trended toward fewer and larger field replaceable units (FRUs) and toward simpler or no service costs, and we're at the level of replacing the whole box at the low-end, and in configurations such as grids and clusters.

With identical servers and server-level FRUs, you have the option to swap the whole box, and you can have site-level or can negotiate for depot-based spares stocks. When there are enough duds to warrant the effort, you either yank the dead servers and dead disks and dead shelves out of the racks and replace them, or you roll in a whole 'nother generation of DC iron and design, and replace the whole thing. RAIS, in other words.

I've posted a set of write-ups and links associated with current trends in disk reliability and disk lifetimes, and around RAID (and particularly RAID 5), too.

Start here: http://64.223.189.234/node/353

Well, my boss says we can now spend some money for stuff like this. I
actually got a DLT drive recently (TZ88N-VA) for my archive system to
help with my severe reliability issues with the DDS tapes. (Another
thing I did was to switch from DDS-2 tapes to DDS-1 tapes. So far this
has made the tapes completely portable except for tapes made by one
bad drive. Tapes made on this bad drive were unMOUNTable on all of my
eight other online DDS drives! So I replaced the bad drive.)

DAT/DDS is not quite write-once and not write-only media, based on what I've been seeing of late. Local usages shows somewhere about five to fifteen uses of each piece of DAT/DDS media and periodic replacement of the drives themselves was to be expected -- this based on the local usage models. YMMV. DLT was significantly more reliable. On the low-end, acquiring a used DLT or SDLT on the used-equipment market can be a very viable strategy for archival processing, and for saving costs on media and drive swaps. Well, racks of disks can also be a strategy, too -- going rate can be as little as US$500 for a terabyte of Compaq SCSI disks and a 4315R or 4354R series StorageWorks shelf. (These suckers are loud, however.)

He wants to consolidate. So I'm left to consider exactly what you've
suggested with SIMH and CHARON-VAX. I can do some consolidation with
just the VAX systems as the number of users has gotten smaller so I
can reduce the number of "update nodes". But I wonder how much I could
put on just one modern server running one of these VAX-emulation
programs!

Emulation can be pretty speedy, though that's more because the Intel and AMD x86 hardware is silly-fast compared with ancient VAX hardware.

Translation is another option, particularly if you have user-mode code.

And there's always a direct port, if you have or can reconstitute the source code.

Porting-related topics include http://64.223.189.234/node/225 and http://64.223.189.234/node/226 -- depending on your particular end-state goal, and how much you want to spend getting there.

This sort of stuff is what I do these days...


--
www.HoffmanLabs.com
Services for OpenVMS
.



Relevant Pages

  • Re: HALT, CONTINUE ==> CRASH. HUH?
    ... use for console sessions causes any connected systems to HALT... ... There are some vendors of terminal servers posted over at the web site, ... managed classically encountered a pile of hardware restart and recovery ... help with my severe reliability issues with the DDS tapes. ...
    (comp.os.vms)
  • Re: SUNs X64 servers
    ... 146gb drives and will be running latest Solaris 10 release. ... Console on anything but the V20z servers are a real bastard, ... seems OK as far as reliability goes. ...
    (comp.unix.solaris)
  • Best backup options for 700gb of data in a months time? Tape? LTO3?
    ... I'm just taking a survey on what everyone thinks is a good backup ... Currently we use external drives to backup our data spread over 4 ... year, pull these tapes out). ... We do have gigabit eithernet between servers.. ...
    (comp.sys.ibm.pc.hardware.storage)
  • Best/fastest/largest backup options to backup 700GB or more of data (between servers too) ?
    ... I'm just taking a survey on what everyone thinks is a good backup ... Currently we use external drives to backup our data spread over 4 ... year, pull these tapes out). ... We do have gigabit eithernet between servers.. ...
    (microsoft.public.windows.server.general)
  • Re: [opensuse] Remote upgrade problem
    ... All my remote sites have serial console servers connected. ... CCM840 8 port, dedicated local console ...
    (SuSE)