Data corruption presents a significant hurdle across various scientific computational domains, owing to the inherent susceptibility within the computational pipeline. It causes the failure of critical computations (e.g., hurricane path prediction, auto-pilot decision), which may lead to catastrophic outcomes such as human life loss or property damage. Addressing data corruption outright often proves unfeasible in real-world conditions. Hence, acknowledging its persistent presence in our daily operations, and focusing on profiling, comprehending, and mitigating its impact, emerges as a practical alternative. However, profiling the reliability of computations concerning data corruption poses its own challenges. Factors such as limited testing coverage and the opaque nature of computations add layers of complexity to this task, making it inherently difficult. Error propagation analysis is a valuable yet often overlooked technique that yields invaluable insights into the resiliency of applications. It holds promise as a potential solution for assessing computation resiliency at scale. Nonetheless, the complexity of computation coupled with a multitude of intermediate states poses challenges to effective propagation analysis.
This dissertation is initiated by presenting a novel visualization system that empowers HPC researchers to track, explore, and comprehend the intricate error propagation dynamics within high-performance numerical kernels. A novel error propagation analysis framework is also introduced to efficiently assess the impact of silent data corruption in high-performance numerical kernels. Moreover, profiling the impact of data corruption during machine learning prediction and training presents certain similarities with error corruption in HPC. The intricate nonlinear operation and opaque training process pose challenges in quantifying errors and interpreting predictions. In response, this dissertation introduces the design of cutting-edge visualization techniques tailored to unravel the robustness of neural network prediction through the high dimensional geometry properties of neural network feature representations.