Continuous Improvement Using Comprehensive Root Cause Analysis

No ratings

Presented at SREcon 2015 by

At the Argonne Leadership Supercomputer Facility, we operate Mira, a 786K core tightly coupled supercomputer, built for scalable, tightly coupled scientific workloads. In the operation of this system and its predecessor, we have developed a process for continuous system improvement through the performance of root cause analysis of all failed jobs.