4 steps to preventing server downtime

Eliminating potential single points of failure is a time-tested strategy for reducing the
risk of downtime and data loss. Typically, network administrators or computer consultants do this by introducing redundancy in the application delivery infrastructure, and automating the process of monitoring and
correcting faults to ensure rapid response to problems as they arise. Most leading
companies adopting best practices for protecting critical applications and data also
look at the potential for the failure of an entire site, establishing redundant systems at
an alternative site to protect against site-wise disasters.

STEP #1 – PROTECT AGAINST SERVER FAILURES WITH QUALITY….don’t be a cheapskate with your own business by using low quality CHEAPO server and network hardware. Use HIGH Quality hardware.

HARDWARE AND COMPONENT REDUNDANCY
Unplanned downtime can be caused by a number of different events, including:
• Catastrophic server failures caused by memory, processor or motherboard
failures

Server component failures including power supplies, fans, internal disks,
disk controllers, host bus adapters and network adapters
Server core components include power supplies, fans, memory, CPUs and main logic
boards. Purchasing robust, name brand servers, performing recommended
preventative maintenance, and monitoring server errors for signs of future problems
can all help reduce the chances of automation downtime due to catastrophic server
failure.

You can reduce downtime caused by server component failures by adding
redundancy at the component level. Examples are: redundant power and cooling,
ECC memory, with the ability to correct single-bit memory errors, and combining
Ethernet cards with RAID.

STEP #2 – PROTECT AGAINST STORAGE FAILURES WITH
STORAGE DEVICE REDUNDANCY AND RAID

Storage protection relies on device redundancy combined with RAID storage
algorithms to protect data access and data integrity from hardware failures. There are
distinct issues for both local disk storage and for shared, network storage.

For local storage, it is quite easy to add extra disks configured with RAID protection.
A second disk controller is also required to prevent the controller itself from being a
single point of failure.

Access to shared storage relies on either a fibre channel or Ethernet storage network.
To assure uninterrupted access to shared storage, these networks must be designed
to eliminate all single points of failure. This requires redundancy of network paths,
network switches, and network connections to each storage array.

STEP #3 – PROTECT AGAINST NETWORK FAILURES WITH
REDUNDANT NETWORK PATHS, SWITCHES AND ROUTERS

The network infrastructure itself must be fault-tolerant, consisting of redundant
network paths, switches, routers and other network elements. Server connections can
also be duplicated to eliminate fail-overs caused by the failure of a single server or
network component.

Take care to ensure that the physical network hardware does not share common
components. For example, dual-ported network cards share common hardware logic,
and a single card failure can disable both ports. Full redundancy requires either two separate adapters or the combination of a built-in network port along with a separate network adapter.

STEP #4 – PROTECT AGAINST SITE FAILURES WITH DATA
REPLICATION TO ANOTHER SITE

The reasons for site failures can range from an air conditioning failure or leaking roof
that affects a single building, a power failure that affects a limited local area, or a
major hurricane that affects a large geographic area. Site disruptions can last
anywhere from a few hours to days or even weeks.

There are two methods for dealing with site disasters. One method is to tightly couple
redundant servers across high speed/low latency links, to provide zero data-loss and
zero downtime. The other method is to loosely couple redundant servers over
medium speed/higher latency/greater distance lines, to provide a disaster recovery
(DR) capability where a remote server can be restarted with a copy of the application
database missing only the last few updates. In the latter case, asynchronous data
replication is used to keep a backup copy of the data.
Combining data replication with error detection and fail over tools can help to get a
disaster recovery site up and running in minutes or hours, rather than days.

Comments are closed.

Advert