Re: How do you manage 1000+ UNIX systems ?

aryzhov_at_spasu.net
Date: 06/27/05


Date: 27 Jun 2005 06:39:38 -0700

IMHO, large environment can not survive without
a formal change management system. I.e. if someone wants
to make a change, she has to request it, get it approved
by all affected parties (like, if you want to grow
the Oracle filesystem, make sure the DB team signs
on the change request, even though you probably could
grow the FS on-the fly, and are 100% sure nothing will
go wrong).

Ticketing system. You never make a change
because you feel like it. You only do when you get a ticket.
Some tickets are generated by monitoring systems, some
by business units, some by your collegues. Ocasionally,
you may open a ticket to yourself.
After all, this shows your boss how busy you are :-)
Some solutions (like Remedy, for instance) combine both
Change Management and Trouble Ticketing.

Knowledge database. As previous posters mentioned,
communication within admin team must be logged, and logs
must be searcheable. The brightes solution I've seen so far
was an unpersonalised mail alias inside sysadmin mail group.
Whenever mail is sent to mail group, a copy is stored
in this archived mailbox. Every group member has read
access to this box and can search by keywords or hostname.

Sudo logging. Good practice is to not only log all
superuser logins, but also trace all commands run by root
interactively. Policy may deny direct root logins
and su root, allowing sudo only, and sudo sessions
can be logged to the central log server. Of course
there are many ways to intentionally break the policies,
but such violations can also be logged in most cases.

"hostinfo" database (can be part of Remedy) must be
carefully maintained. Good things to keep there are
contact person for all apps running on the host, plus
any exotic specifics.
Every trouble ticket or change request must
have a checkbox whether the ticket/change requires
manual update of hostinfo db entry.

Some change requests on live hosts may require
Jumpstart updates (especially when Jumpstart is used
as emergency restore mechanism), thus, people responsible
for host staging must be on signoff list.

regards,
Andrei



Relevant Pages

  • Re: Kerberos and Service Ticket Failure nightmares
    ... Maximum lifetime for Service Ticket 600 minutes. ... >Description: Authentication Ticket Request Failed ... >>> Accounts Manager, intern allowing a users account to ...
    (microsoft.public.win2000.security)
  • RE: Confusing Kerberos Error
    ... This error is typically caused by a DNS error, or incorrect SPN ... When you request a kerberos ticket the request is made to a ... The kerberos ticket is ... A User requests authentication for fileserver1. ...
    (microsoft.public.windows.server.general)
  • Re: Bozo was boffo, no?
    ... was to request a ticket to the show -- then, just maybe, the ticket ... their request soon after he was born. ... WGN;s Bozo was played by Bob Bell. ...
    (alt.usage.english)
  • cookie value being garbled.
    ... Each request to the site refreshes the ticket (stored in a cookie). ... together then the next request after the login OK page receives ...
    (microsoft.public.dotnet.framework.aspnet)

Quantcast