CPL - Chalmers Publication Library
| Utbildning | Forskning | Styrkeområden | Om Chalmers | In English In English Ej inloggad.

Digging deeper into cluster system logs for failure prediction and root cause diagnosis

X. Fu ; R. Ren ; Sally A McKee (Institutionen för data- och informationsteknik, Datorteknik (Chalmers)) ; J. Zhan ; N. Sun
2014 IEEE International Conference on Cluster Computing, CLUSTER 2014 p. 103-112. (2014)
[Konferensbidrag, refereegranskat]

© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become normal. System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis. Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both non-fatal and fatal events, nor a precise one that uses fine-grained information (such as failure type, node location, related application, and time of occurrence). A deeper and more precise log analysis technique is needed. We propose a three-step approach to draw out event dependencies and to identify failure-event generating processes. First, we cluster frequent event sequences into event groups based on common events. Then we infer causal dependencies between events in each event group. Finally, we extract failure rules based on the observation that events of the same event types, on the same nodes or from the same applications have similar operational behaviors. We use this rich information to improve failure prediction. Our approach semi-automates diagnosing the root causes of failure events, making it a valuable tool for system administrators.

Nyckelord: event causal dependency inference , failure prediction , large-scale cluster systems , root cause diagnosis

Article number 6968768

Denna post skapades 2015-01-02. Senast ändrad 2016-03-22.
CPL Pubid: 209323


Läs direkt!

Länk till annan sajt (kan kräva inloggning)

Institutioner (Chalmers)

Institutionen för data- och informationsteknik, Datorteknik (Chalmers)



Chalmers infrastruktur