Before starting the actual analysis of a Safety or Reliability Study it is important to be sure that you
- have fully defined the aims and objectives of the Study
- are considering all of the design or operational features that will have a significant impact on the results of the analysis.
We have therefore put together a list of ten questions which we hope will help you to address each of these areas.
What is the question being asked? Is it
- What are all of the ways in which my system can fail?
- What is the probability that I will experience a particular consequence or outcome of concern?
- What are all of the potential scenarios that may result from a specified starting point?
- What is the average long term throughput that I can expect my system to achieve?
- Are we interested in repair and maintenance issues or only in the probability of failure?
It is only when we properly understand the problem (which means having agreed reliability or availability requirements) that we can identify the most appropriate tools which will help us to find an answer.
2. What needs to be included in the analysis?
What are the boundaries of the problem? For example:
- Are services such as power, service water, air conditioning or control included in the analysis or are they outside scope?
- Do resource constraints such as spares holdings and repair cover need to be considered?
- Do we need to consider software and human factors issues?
- Will sub-systems provided by suppliers be included? If so, will they be treated as black boxes or do they need to be analysed in detail?
It is important to make sure that everyone involved in the analysis agrees the boundaries of the system to ensure that no significant inputs are overlooked. Equally, for larger systems where different people may be analysing different sub-systems within the same system, clear identification of boundaries will reduce the risk that the same features are double counted by being included in more than one sub-system.
3. How detailed should the analysis be?
The level of detail required will depend on several factors including:
- The stage in the life-cycle that the product or system has reached
- The criticality of the system within the overall product / plant design
Analyses during the concept design stage will, by necessity, be less detailed than those carried out during the detailed design stage of a project or during the operating life simply because there is less design information or operating experience available – decisions may not yet have been made regarding component selection, spares strategy or maintenance and testing requirements. As a result, analyses will be more “broad-brush” and there is little benefit in attempting an in-depth study which requires more than that level of information. By the detailed design stage any analysis should go down to the level of Line Replaceable Units (LRUs) for any reliability, availability or safety critical items, although the precise number of maintenance staff of each skill type or location of the stores and number of spares held may still not have been nailed down. These issues should certainly be considered in any assessment carried out during the operating life of the system however.
Reliability analyses take time and effort. When time and effort are limited it is therefore best to concentrate on those parts of the product or plant which have a significant detrimental impact on safety or reliability when they do fail or those which are expected to be unreliable. This does not mean that there is no benefit in analysing other parts of the product or plant but it is unlikely that the benefits would be as great as the high consequence / low reliability areas.
4. What do we mean by “failure”?
This may sound a pointless question but it is crucial that everyone involved in the analysis agrees on a common definition of failure. All reliability engineering techniques make a very simplistic assumption that systems/equipment items are either working or failed with no intermediate state. This assumption is reasonable for something like a light bulb which usually is either working or not. However, for other types of equipment (e.g. a pump), performance may be on a continuous scale from 0-100% throughput. Alternatively, a system may be working at full capacity, in a degraded state (e.g. 50% capacity) or completely failed. The degraded state is obviously less desirable but may be sufficient in an emergency situation. It is important that everyone agrees what level of degradation would be classed as a failure within the situation being addressed.
5. How will the system or product be used?
The probability of failure will be influenced by many factors. Two of the more important of these are:
- Environment. The more aggressive the environment (in terms of temperature, vibration, loading or exposure to aggressive materials for example), the greater the stresses on the components comprising the system or product and the greater the probability of failure.
- Operating Regime. Similarly, the operating regime may also introduce stresses on the components, e.g. through frequent starts or constant changes of state.
The more that is understood about the way in which the system or product will be used, the more the stresses to which it may be exposed can be taken into account, through the selection of appropriate components, the addition of spare capacity and consideration of maintainability issues in terms of the inclusion of appropriate preventive maintenance activities, together with the access, tools and resources to do the work. We can also select the failure or repair data that is most appropriate to the proposed environment and operating regime.
6. What needs to work for the system or product to work?
In order to understand the effect of failure of any component within the system or product, it is necessary first to understand the role of that component within the larger system. For example, is that component a single point of failure which will cause the entire system or product to fail or is there spare capacity installed which can take on the duty of the failed component? If the latter, does the spare capacity take over immediately and automatically or will there be a brief interruption to service either whilst the spare capacity comes online or to allow manual changeover between the failed and spare component? We need to be able to identify precisely which components need to work for the system or product to work, and in what combination.
Where a product or system can operate in a range of configurations (e.g. an aircraft during taxi-ing, take-off, flight, landing), the question about what needs to work for the system to work may have different answers depending on the configuration. Therefore different combinations of components may be required in each configuration and this will need to be reflected within the resulting reliability or safety analysis.
7. What are the data requirements?
The actual equipment data requirements depend on the type of study being carried out. Reliability analyses require information on component start failure probabilities, failure rates and failure mode proportions. Maintainability analyses on the other hand require mean restoration times (waiting time plus repair time) together with failure rates and failure mode proportions, while availability studies use spurious trip rates in addition to the data required for both reliability and availability analyses.
Other data that may be important include equipment capacities (to help determine spare capacity), buffer storage (e.g. tanks or batteries) and maintenance requirements.
8. Can appropriate data be found?
Having established what data are required for the analysis, the next question is “where can the relevant information be found?” In the case of established systems or products, it is to be hoped that component failure and maintenance histories are already available, allowing the analyst to estimate the required information directly from relevant experience. However, if the analysis is for a new product or system where such experience is not available then it is necessary to refer to other sources.
Generic databases can be useful for obtaining failure rates for proven components which are already used in similar existing products or systems or during the initial design phase when the precise makes and models of components have not been determined. However, even then it is important where possible to ensure that the generic data are appropriate to planned operating philosophy and environment. Factors that should be considered when selecting data sources include:
- The life of the data source system compared with the system being designed or operated. For example, data collected for a component used in mobile phones which only have a useful life of 2-3 years may be inappropriate if that component were then to be used in a system which was being designed for a 25 year life.
- The tolerance of the design to component performance or variability in quality – the data source system may have been designed to have a greater tolerance than the design now being considered. As a result a degree of degradation in performance that was tolerable in the data source system may lead to failure in the proposed design (or vice versa).
It is therefore important not to simply take data from generic databases at face value but to consider how your system compares with the data source system and, if necessary, to modify the reliability figure using expert judgement to reflect differences in performance that may be expected between the data source system and the one that is being analysed.
9. What is the maintenance strategy?
The maintenance strategy is particularly important for availability and maintainability studies since it will have a significant impact on the waiting time before the repair of a failed component is started. It is therefore important to establish:
- What will be the level of maintenance cover (for example, is it 24 hours a day, seven days a week or only during normal working hours?)
- Is maintenance provided in-house or is it contracted out to third parties who may have specialist skills but require a longer mobilisation time?
- What equipment spares will be held and where?
- What will be the preventive maintenance schedule and when will it be carried out (e.g. during overhauls, at weekends, evenly over time)? Preventive maintenance can be a particularly important consideration for mechanical plant where it can account for a high proportion of an item’s unavailability.
10 Are there any mitigating features?
Other features that can improve the reliability of the system include:
- Storage. Storage will typically take the form of storage tanks or reservoirs which are particularly relevant in process industries where they are often an intrinsic part of the design. The primary role of these tanks may be simply to provide storage such that the system can operate in the presence of variable demand, but tanks will also allow the downstream process to continue to operate for a period of time in the event of a failure upstream. As such they can reduce the frequency with which the system output is lost or reduced. Storage need not only relate to storage tanks however; for example it may take the form of a battery in an Uninterruptible Power Supply (UPS).
- Monitoring equipment. Monitoring (as opposed to control) equipment is often not critical to the successful operation of a product or system. However, when working, monitors allow other equipment failures to be recognised more quickly, thereby reducing downtimes and hence improving availability and maintainability.
- By-passes and temporary lash-up facilities. There may be situations where a system does not need to be out of action until the failed component has been repaired or replaced. Rather the system can be re-configured or temporary arrangements put in place that will allow the system to continue to operate, albeit in a non-ideal way. Examples include the connection of a bowser or chemical supply in the event that a make-up system fails.
Thinking about the issues associated with each of the ten questions above will not guarantee a complete and effective safety or reliability analysis but it should at least reduce the possibility that important considerations are overlooked. It will also minimise the risk of embarking on an analysis which cannot be completed due to a lack of sufficient knowledge of the system or equipment reliability data.
If you have are about to start a safety or reliability analysis we hope that some of the points raised here will help you prepare for the analysis. If you need advice before starting, please feel free to contact us and we will be very pleased to help if we can.