What is Markov Analysis?
Markov analysis is a method of analysis that can be applied to both repairable and non-repairable types of system. The basic output of a Markov analysis is the average time spent by the system in each of its distinct states before the system moves (or makes a transition) into some other distinct state. For example, such a transition or change of state will occur if the system suffers a component failure or if a repair has been carried out. A distinct change in the state of the system will have taken place in both of these cases.
The output from the Markov analysis enables a complete description of the system to be obtained in terms of its reliability, availability and resource utilisation (e.g. use of maintenance teams, spares holdings, buffers, etc). Also different system designs can be explored by comparing their reliability and availability performances as well as the effect of small tweaks to a given design under consideration. Results produced by a Markov analysis can then be used within a cost-benefit analysis to help identify the optimal design choice.
When can it be applied?
Like all quantitative methods in reliability engineering, Markov analysis requires component failure rates to be assumed for non-repairable systems and, in addition, repair rates for repairable systems. Markov analysis uses these rates within a mathematical model that includes all of the possible states of a system. Within this model the failure and repair rates then represent the rates at which the system makes transitions from one system state to another. Failure and repair rates therefore become transition rates within the model which is called a Markov Chain. The “chain” is in fact more accurately described as a “network” of interconnected states, interconnections being present between two states in the network if a transition (i.e. component failure or repair) leads from one of the states to the other.
For example, in any given small time interval, a system that is in some specified state X at the start of the time interval will move to some other state if a component of the system fails during the time interval or a component that was already failed in state X is repaired and returned to service during the time interval.
Fundamentally what Markov analysis does is to look at the network of all possible states of a system (called the State Space of the system) where transition rates between these states are determined by the failure and repair rates of the components. The simplest model or network assumes that components can be in just one of two possible states, failed or working, and a system with n components can therefore be in any one of 2n possible states. The simplest network therefore contains 2n states between which transitions take place according to the failure and repair rates of the relevant components. An example state-space diagram for a two component system is shown below.
State-space diagram for a two component system
There is one further assumption within the model on which Markov analysis depends. This assumption is that transition rates between the states of a system do not depend on the states that the system has been in; usually it’s assumed that the transition rates are constant and don’t change over time. This means that the time for any given transition to take place has an Exponential distribution. The unique feature of an Exponential distribution is that it is “memoriless” in the sense that a given transition does not become more or less likely as time passes. As a consequence of this constant transition rate feature, the future of a Markov system depends purely on its current state and not on the historical transitions by which it has reached its current state. It is this feature that gives the “memoriless” characteristic of Markov systems.
Mathematically this characteristic provides huge benefits in terms of simplifying the analysis of a system and enables its precise reliability or availability to be calculated. This has particular application when the effects of small design changes need to be quantified or where design changes for high integrity systems are investigated. Monte Carlo simulation can often be ineffective in such situations whereas Markov analysis is ideally suited provided that the underlying assumptions for a Markov system described above are satisfied. Even if the Exponential distribution does not accurately represent times to transition, the results of a Markov analysis can be surprisingly accurate, especially when assessing the effects of small design changes.
How is it applied?
Markov analysis is usually provided as a module within integrated reliability software suites such as Isograph’s Reliability Workbench and Item’s ToolKit. These provide a graphical user interface to facilitate the definition of system states and the possible transitions between them, and the failure and repair rate are usually then imported from the suite’s database of values.
Clearly any transition can only occur from the current state of the system so the transition rates are only effective from the current state at any given time, i.e. they are conditional on the state from which they emanate being the current state. This means that dormant as well as active standby systems can be modelled in a Markov system. More difficult, but nevertheless still possible to model, are buffers such as storage tanks which gradually empty when certain types of failure occur and refill when the fault is repaired.
Common cause failures can also be modelled by suitably defining transitions from one system state to another that correspond to multiple failures. For example, if the current system state includes a duty and standby both working, a common cause failure would be represented by a transition from this system state directly to the state in which both the duty and standby have failed.
It can therefore be seen that there is great flexibility in what can be included in the model of a Markov system. Even degraded performance of components in a system can be modelled rather than simply the binary “working” and “failed” states that are normally considered in reliability analysis. Greater detail in the model however must be accompanied by more detailed knowledge of transition rates, so in practice the level of detail will be limited by the level of knowledge and available data on transition rates.
What does it produce?
Markov analysis can be viewed as the analytical equivalent of Monte Carlo Simulation (MCS). By “analytical” we mean a method that ultimately depends on formulas and their solution rather than the randomisation approach of MCS. However Markov analysis and MCS are similar in the sense that they both represent a system in the form of a mathematical model that reflects the dynamics of the system. The Markov model however is more restrictive in that it assumes transition times between system states that have an Exponential distribution while there are no such restrictions if the MCS approach is adopted.
Given this similarity it is clear that a very wide range of output results can be obtained from a Markov analysis such as reliability, availability, fraction of system time within a given state, average time spent in a given state, utilisation of repair teams, demand on spares holdings, etc. A complete picture of the system in terms of reliability parameters can therefore be obtained from a Markov analysis, and these parameters can be examined in terms of their sensitivity to the input assumptions. Being an analytical method, Markov analysis takes very little time to perform on modern microprocessors contrasting with generally much longer times for MCS runs. This analytical approach also implies that practically exact answers are obtained from a Markov analysis while MCS requires long runs to achieve reasonable accuracy, particularly when investigating small design changes or the performance of high integrity systems.
Advantages and disadvantages
Markov analysis has the advantage of being an analytical method which means that the reliability parameters for the system are calculated in effect by a formula. This has the considerable advantages of speed and accuracy when producing results. Speed is especially useful when investigating many alternative variations of design or exploring a range of sensitivities. In contrast accuracy is vitally important when investigating small design changes or when the reliability or availability of high integrity systems are being quantified. Markov analysis has a clear advantage over MCS in respect of speed and accuracy since MCS requires longer simulation runs to achieve higher accuracy and, unlike Markov analysis, does not produce an “exact” answer.
As in the case of applying MCS, Markov analysis requires great care during the model building phase since model accuracy is all-important in obtaining valid results. The assumptions implicit in Markov models that are associated with memorilessness and the Exponential distribution to represent times to failure and repair provide additional constraints to those within MCS. Markov models can therefore become somewhat contrived if these implicit assumptions do not reflect sufficiently well the characteristics of a system and how it functions in practice. In order to gain the benefits of speed and accuracy that it can offer, Markov analysis depends to a greater extent on the experience and judgement of the modeller than MCS. Also, whilst MCS is a safer and more flexible approach, it does not always offer the speed and accuracy that may be required in particular system studies.