On top of the hardware ASTS
On top of the hardware, ASTS employs Linux SUSE with High-Availability (HA) Extension: a COTS OS provided with two clustering software (Pacemaker Cluster Resource Manager (CRM) and Corosync) responsible for resources orchestration, failure diagnosis, nodes coordination, and fail-over management. The Pacemaker CRM allows for monitoring the health and status of leukotriene receptor agonist resources, managing dependencies, and automatically stopping and starting services. It relies on Corosync messaging layer, in charge of reliable messaging communications, membership and quorum information needed by the cluster for node orchestration. For simplicity, in the rest of this paper, Pacemaker and Corosync are indicated as a single entity under the CRM notation.
The topmost layer of the system is the ASTS proprietary TMS application. This consists of several SW modules that communicate with the IXL through a CORBA message broker. Since the ASTS proprietary application software is beyond the scope of our work, the reliability of the broker itself is not addressed in our study (except for aspects related to its interaction with other software modules). The decision of using a CORBA broker was taken by the ASTS proprietary application software development team. In this study, we assume that the ASTS proprietary application software (which includes the broker), is the result of a rigorous development process, and of a thorough reliability testing activity. The ASTS proprietary application software uses FT-CORBA, which embeds fault tolerance mechanisms that help to ensure a resilient and highly-available message broker service. FT-CORBA was used, e.g., for an Air Traffic Control System, which is indeed a safety-critical system .
Concepts and approach The Safety Integrity Level (SIL) is a way to indicate the tolerable failure rate – interchangeably defined as Tolerable Hazard Rate (THR) – of a particular Safety Instrumented Function (SIF). The TMS implements one main SIF function, i.e., the continuous monitoring and controlling of a particular railways area, which must be kept in a safe state, with respect to the functional or non-functional failures defined in 3. The correct provision of the SIF does not depend on the time in which it is performed, i.e., the TMS has not stringent real-time requirements. This work assumes that functional or non-functional failures can be generated by (Fig. 2): (1) Hardware Faults caused by a hard or soft malfunction within the electronic circuits or electromechanical components, which require a repair intervention or a reboot, respectively; (2) Software Faults caused by errors influenced by the total time the system has been running, i.e., Aging-Related Mandelbugs, or any other SW fault not caused by aging errors, i.e., Non-Aging-Related Mandelbugs. Software aging is often cause of Linux failures at both Kernel and User level, as Cotroneo et al.  demonstrate. Aging-related faults – caused by exhaustion of OS resources (e.g. memory leak) – can reduce performance or culminate in a system hang/crash. If a fail-stop failure of a node occurs, that is, the server node exhibits a crash or a hang, than, the CRM detects it, unless the CRM itself is not failed. The TMS cluster stops providing services (it is down) when: the active node fails and the CRM fail-over operation is in action, or the active node and the CRM fail, or the active and standby nodes fail together. The ratio between the number of manifested TMS failures and the operating time is the failure rate, which is the inverse of the Mean Time Between Failures (MTBF): the mean operating time (uptime) between failures of the cluster. The overall goal of the assessment process is to provide evidence able to demonstrate that the TMS failure rate falls within SIL2 bounds: . The assessment would be facilitated if it had been possible to build the system from scratch using hardware and software components dedicated for mission-critical systems (i.e. VxWorks, PikeOS, QNX). However, this would have meant for ASTS a drastic increase in terms of costs, which cannot be justified for a system without stringent real-time requirements. For this reason, the fulfillment of SIL2 is pursued within the context of the already defined TMS architecture, and using the COTS components that have already been selected. The paper shows how it is possible to configure the existing design so to achieve the desired level of reliability. To do that, in this paper an hybrid approach is pursued (Fig. 3) composed by the following three main phases: