OS-Multiprocessors and Fault Tolerance

A multiprocessor system is an interconnection of two or more CPUs with memory and input-output equipment. A multiprocessor system is controlled by one operating system that provides interaction between processors and all the components of the system cooperate in the solution of a problem.

One appeal of multiprocessing systems is that if a processor fails, the remaining processors can normally continue operating. A failing processor must somehow inform the other processors to take over ; functioning processors must be able to detect a processor that has failed. The operating system must note that a particular processor has failed and is no longer available for allocation.

Multiprocessing can improve performance by decomposing a program into parallel executable tasks or multiple independent jobs can be made to operate in parallel. With decreasing hardware costs, it has become common to connect a large number of microprocessors to form a multiprocessor in this way, large-scale computer power can be achieved without the use of costly ultra-high speed processors. 

One of the most important capabilities of multiprocessor operating systems is their ability to withstand equipment failures in individual processors and to continue operation; this ability is referred to as fault tolerance.

Fault tolerance systems can achieve operating even when portions of the system fail. This kind of operation is especially important in so-called mission critical systems. Fault tolerance is appropriate for systems in which it may not be possible for humans to intervene and repair the problem, such as in deep-space probes, aircrafts, and the like. It is also appropriate for systems in which these consequences could happen so quickly that humans could not intervene quickly enough.

Many techniques are commonly used to facilitate fault tolerance. These include
  •  critical data for the system and the various processes should be maintained in multiple-copies. These should reside in separate storage banks so that failures in individual components will not completely destroy the data.
  •  The operating system must be designed so that it can run the maximal configuration of hardware effectively, but it must also be able to run subsets of the hardware effectively in case of failures.
  •  Hardware error detection and correction capability should be implemented so that extensive validation is performed without interfering with the efficient operation of the system.
  •  Idle processors capacity should be utilized to attempt to detect potential failures before they occur
Share this article :
Copyright © 2012. Best Online Tutorials | Source codes | Programming Languages - All Rights Reserved