PREDICTIVE FAILURE DETECTION FOR FAULT TOLERANCE IN DISTRIBUTED OPERATING SYSTEMS

Authors

  • Talha Ilyas Department of Computer Science, Institute of Distributed Computing and Systems Engineering, Lahore, Pakistan Author

DOI:

https://doi.org/10.64035/car.01.2026.27

Keywords:

Distributed Operating Systems, Fault Tolerance, Predictive Failure Detection, System Availability, Failure Recovery

Abstract

Distributed operating systems are widely used to support scalable, reliable, and high-performance computing environments; however, their dependence on multiple interconnected nodes makes them vulnerable to failures caused by hardware faults, network instability, workload imbalance, and software-level errors. Traditional fault-tolerance mechanisms usually respond after a failure has occurred, which may increase downtime, reduce service availability, and affect system performance. This paper presents a predictive failure detection approach for enhancing fault tolerance in distributed operating systems. The proposed approach monitors system-level indicators such as CPU usage, memory consumption, disk activity, network latency, error logs, heartbeat signals, and workload variation to identify early signs of node or process failure. By applying predictive analysis, the system can detect abnormal behavior before complete failure occurs and initiate preventive actions such as process migration, checkpoint recovery, task replication, resource reallocation, or node isolation. The results demonstrate that predictive failure detection improves system availability, reduces mean time to recovery, lowers failure-related service interruptions, and maintains stable throughput under fault-prone conditions. The findings suggest that integrating prediction-based monitoring with existing fault-tolerance strategies can significantly improve the resilience of distributed operating systems. This study highlights the importance of proactive failure management in modern distributed environments, particularly for cloud platforms, data centers, edge computing systems, and large-scale enterprise applications where continuous service delivery is essential.

Downloads

Published

2026-06-30

How to Cite

PREDICTIVE FAILURE DETECTION FOR FAULT TOLERANCE IN DISTRIBUTED OPERATING SYSTEMS. (2026). Computing and Applications Reviews, 3(01), 36-55. https://doi.org/10.64035/car.01.2026.27