Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks|
A. Bhatele, J. Yeom, N. Jain, C. J. Kuhlman, Y. Livnat, K. R. Bisset, L. V. Kale, M. V. Marathe. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), May, 2017.
Controlling the spread of infectious diseases in large populations is an important societal challenge. Mathematically, the problem is best captured as a certain class of reaction-diffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. We describe EpiSimdemics, a highly scalable, parallel code written in Charm++ that uses agent-based modeling to simulate disease spreads over large, realistic, co-evolving interaction networks. We present a new parallel implementation of EpiSimdemics that achieves unprecedented strong and weak scaling on different architectures — Blue Waters, Cori and Mira. EpiSimdemics achieves five times greater speedup than the second fastest parallel code in this field. This unprecedented scaling is an important step to support the long term vision of real-time epidemic science. Finally, we demonstrate the capabilities of EpiSimdemics by simulating the spread of influenza over a realistic synthetic social contact network spanning the continental United States (∼280 million nodes and 5.8 billion social contacts).
Progressive CPU Volume Rendering with Sample Accumulation|
W. Usher, J. Amstutz, C. Brownlee, A. Knoll, I. Wald . In Eurographics Symposium on Parallel Graphics and Visualization, Edited by Alexandru Telea and Janine Bennett, The Eurographics Association, 2017.
We present a new method for progressive volume rendering by accumulating object-space samples over successively rendered frames. Existing methods for progressive refinement either use image space methods or average pixels over frames, which can blur features or integrate incorrectly with respect to depth. Our approach stores samples along each ray, accumulates new samples each frame into a buffer, and progressively interleaves and integrates these samples. Though this process requires additional memory, it ensures interactivity and is well suited for CPU architectures with large memory and cache. This approach also extends well to distributed rendering in cluster environments. We implement this technique in Intel's open source OSPRay CPU ray tracing framework and demonstrate that it is particularly useful for rendering volumetric data with costly sampling functions.
Pathways for Theoretical Advances in Visualization|
M. Chen, G. Grinstein, C. R. Johnson, J. Kennedy, M. Tory. In IEEE Computer Graphics and Applications, IEEE, pp. 103--112. July, 2017.
More than a decade ago, Chris Johnson proposed the "Theory of Visualization" as one of the top research problems in visualization. Since then, there have been several theory-focused events, including three workshops and three panels at IEEE Visualization (VIS) Conferences. Together, these events have produced a set of convincing arguments.
OpenMP 4 Fortran Modernization of WSM6 for KNL|
T.A.J. Ouermi, A. Knoll, R.M. Kirby, M. Berzins. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, PEARC17, No. 12, ACM, pp. 12:1--12:8. 2017.
Parallel code portability in the petascale era requires modifying existing codes to support new architectures with large core counts and SIMD vector units. OpenMP is a well established and increasingly supported vehicle for portable parallelization. As architectures mature and compiler OpenMP implementations evolve, best practices for code modernization change as well. In this paper, we examine the impact of newer OpenMP features (in particular OMP SIMD) on the Intel Xeon Phi Knights Landing (KNL) architecture, applied in optimizing loops in the single moment 6-class microphysics module (WSM6) in the US Navy's NEPTUNE code. We find that with functioning OMP SIMD constructs, low thread invocation overhead on KNL and reduced penalty for unaligned access compared to previous architectures, one can leverage OpenMP 4 to achieve reasonable scalability with relatively minor reorganization of a production physics code.
Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks|
J. K. Holmen, A. Humphrey, D. Sutherland, M. Berzins. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, PEARC17, No. 27, pp. 27:1--27:8. 2017.
The University of Utah's Carbon Capture Multidisciplinary Simulation Center (CCMSC) is using the Uintah Computational Framework to predict performance of a 1000 MWe ultra-supercritical clean coal boiler. The center aims to utilize the Intel Xeon Phi-based DOE systems, Theta and Aurora, through the Aurora Early Science Program by using the Kokkos C++ library to enable node-level performance portability. This paper describes infrastructure advancements and portability improvements made possible by our integration of Kokkos within Uintah. Scalability results are presented that compare serial and data parallel task execution models for a challenging radiative heat transfer calculation, central to the center's predictive boiler simulations. These results demonstrate both good strong-scaling characteristics to 256 Knights Landing (KNL) processors on the NSF Stampede system, and show the KNL-based calculation to compete with prior GPU-based results for the same calculation.
Department of Energy, Advanced Scientific Computing Advisory Committee (ASCAC), Subcommittee on LDRD Review Final Report|
M. Berzins, D. A. Bonnell, Jr. Cizewski, K. M. Heeger, A.J.G. Hey, C. J. Keane, B. A. Ramsey, K. A. Remington, J.L. Rempe. May, 2017.
Using the material point method to model chemical/mechanical coupling in the deformation of a silicon anode|
C. Gritton, J. Guilkey, J. Hooper, D. Bedrov, R. M. Kirby, M. Berzins. In Modelling and Simulation in Materials Science and Engineering, Vol. 25, No. 4, pp. 045005. 2017.
The lithiation and delithiation of a silicon battery anode is modeled using the material point method (MPM). The main challenges in modeling this process using the MPM is to simulate stress dependent diffusion coupled with concentration dependent stress within a material that undergoes large deformations. MPM is chosen as the numerical method of choice because of its ability to handle large deformations. A method for modeling diffusion within MPM is described. A stress dependent model for diffusivity and three different constitutive models that fully couple the equations for stress with the equations for diffusion are considered. Verifications tests for the accuracy of the numerical implementations of the models and validation tests with experimental results show the accuracy of the approach. The application of the fully coupled stress diffusion model implemented in MPM is applied to modeling the lithiation and delithiation of silicon nanopillars.
In Situ Exploration of Particle Simulations with CPU Ray Tracing|
W. Usher, I. Wald, A. Knoll, M. Papka, V. Pascucci. In Supercomputing Frontiers and Innovations, Vol. 3, No. 4, 2016.
We present a system for interactive in situ visualization of large particle simulations, suitable for general CPU-based HPC architectures. As simulations grow in scale, in situ methods are needed to alleviate IO bottlenecks and visualize data at full spatio-temporal resolution. We use a lightweight loosely-coupled layer serving distributed data from the simulation to a data-parallel renderer running in separate processes. Leveraging the OSPRay ray tracing framework for visualization and balanced P-k-d trees, we can render simulation data in real-time, as they arrive, with negligible memory overhead. This flexible solution allows users to perform exploratory in situ visualization on the same computational resources as the simulation code, on dedicated visualization clusters or remote workstations, via a standalone rendering client that can be connected or disconnected as needed. We evaluate this system on simulations with up to 227M particles in the LAMMPS and Uintah computational frameworks, and show that our approach provides many of the advantages of tightly-coupled systems, with the flexibility to render on a wide variety of remote and co-processing resources.
VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures|
K. Moreland, C. Sewell, W. Usher, L. Lo, J. Meredith, D. Pugmire, J. Kress, H. Schroots, K. Ma, H. Childs, M. Larsen, C. Chen, R. Maynard, B. Geveci. In IEEE Computer Graphics and Applications, Vol. 36, No. 3, pp. 48--58. May, 2016.
Traditional scientific visualization software approaches do not fare well in massively threaded environments. To address the needs of the high-performance computing community, the VTK-m framework fills the gaps in functionality by bringing together the most recent research.
Improving accuracy in the MPM method using a null space filter|
C. Gritton, M. Berzins. In Computational Particle Mechanics, pp. 1--12. 2016.
The material point method (MPM) has been very successful in providing solutions to many challenging problems involving large deformations. Nevertheless there are some important issues that remain to be resolved with regard to its analysis. One key challenge applies to both MPM and particle-in-cell (PIC) methods and arises from the difference between the number of particles and the number of the nodal grid points to which the particles are mapped. This difference between the number of particles and the number of grid points gives rise to a non-trivial null space of the linear operator that maps particle values onto nodal grid point values. In other words, there are non-zero particle values that when mapped to the grid point nodes result in a zero value there. Moreover, when the nodal values at the grid points are mapped back to particles, part of those particle values may be in that same null space. Given positive mapping weights from particles to nodes such null space values are oscillatory in nature. While this problem has been observed almost since the beginning of PIC methods there are still elements of it that are problematical today as well as methods that transcend it. The null space may be viewed as being connected to the ringing instability identified by Brackbill for PIC methods. It will be shown that it is possible to remove these null space values from the solution using a null space filter. This filter improves the accuracy of the MPM methods using an approach that is based upon a local singular value decomposition (SVD) calculation. This local SVD approach is compared against the global SVD approach previously considered by the authors and to a recent MPM method by Zhang and colleagues.
An Overview of Performance Portability in the Uintah Runtime System Through the Use of Kokkos|
D. Sunderland, B. Peterson, J. Schmidt, A. Humphrey, J. Thornock,, M. Berzins. In Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and Middleware, Salt Lake City, Utah, ESPM2, IEEE Press, Piscataway, NJ, USA pp. 44--47. 2016.
The current diversity in nodal parallel computer architectures is seen in machines based upon multicore CPUs, GPUs and the Intel Xeon Phi's. A class of approaches for enabling scalability of complex applications on such architectures is based upon Asynchronous Many Task software architectures such as that in the Uintah framework used for the parallel solution of solid and fluid mechanics problems. Uintah has both an applications layer with its own programming model and a separate runtime system. While Uintah scales well today, it is necessary to address nodal performance portability in order for it to continue to do. Incrementally modifying Uintah to use the Kokkos performance portability library through prototyping experiments results in improved kernel performance by more than a factor of two.
Packing Configurations of PBX-9501 Cylinders to Reduce the Probability of a Deflagration to Detonation Transition (DDT)|
J. Beckvermit, T. Harman, C. Wight,, M. Berzins. In Propellants, Explosives, Pyrotechnics, 2016.
The detonation of hundreds of explosive devices from either a transportation or storage accident is an extremely dangerous event. This paper focuses on identifying ways of packing/storing arrays of explosive cylinders that will reduce the probability of a Deflagration to Detonation Transition (DDT). The Uintah Computational Framework was utilized to predict the conditions necessary for a large scale DDT to occur. The results showed that the arrangement of the explosive cylinders and the number of devices packed in a "box" greatly effects the probability of a detonation.
Special Section on Two Themes: CSE Software and Big Data in CSE|
H. De Sterck, C. Johnson,, L. C. McInnes. In SIAM J. Sci. Comput, Vol. 38, No. 5, SIAM, pp. S1--S2. 2016.
The 2015 SIAM Conference on Computational Science and Engineering (CSE) was held March 14-18, 2015, in Salt Lake City, Utah. The SIAM Journal on Scientific Computing (SISC) created this special section in association with the CSE15 conference. The special section focuses on two topics that are of significant current interest to CSE researchers: CSE software and big data in CSE.
Research and Education in Computational Science and Engineering|
Subtitled Report from a workshop sponsored by the Society for Industrial and Applied Mathematics (SIAM) and the European Exascale Software Initiative (EESI-2), U. Rüde, K. Willcox, L. C. McInnes, H. De Sterck, G. Biros, H. Bungartz, J. Corones, E. Cramer, J. Crowley, O. Ghattas, M. Gunzburger, M. Hanke, R. Harrison, M. Heroux, J. Hesthaven, P. Jimack, C. Johnson, K. E. Jordan, D. E. Keyes, R. Krause, V. Kumar, S. Mayer, J. Meza, K. M. Mørken, J. T. Oden, L. Petzold, P. Raghavan, S. M. Shontz, A. Trefethen, P. Turner, V. Voevodin, B. Wohlmuth, C. S. Woodward. Aug, 2016.
Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.
Optimizing Multi-Image Sort-Last Parallel Rendering|
M. Larsen, K. Moreland, C.R. Johnson,, H. Childs. In Symposium on Large Data Analysis and Visualization, IEEE, 2016.
Sort-last parallel rendering can be improved by considering the rendering of multiple images at a time. Most parallel rendering algorithms consider the generation of only a single image. This makes sense when performing interactive rendering where the parameters of each rendering are not known until the previous rendering completes. However, in situ visualization often generates multiple images that do not need to be created sequentially. In this paper we present a simple and effective approach to improving parallel image generation throughput by amortizing the load and overhead among multiple image renders. Additionally, we validate our approach by conducting a performance study exploring the achievable speed-ups in a variety of image-based in situ use cases and rendering workloads. On average, our approach shows a 1.5 to 3.7 fold improvement in performance, and in some cases, shows a 10 fold improvement.
Visualization for Understanding Uncertainty in Activation Volumes for Deep Brain Stimulation|
B. Hollister, G. Duffley, C. Butson,, C.R. Johnson. In Eurographics Conference on Visualization, Edited by K.L. Ma G. Santucci, and J. van Wijk, 2016.
We have created the Neurostimulation Uncertainty Viewer (nuView or nView) tool for exploring data arising from deep brain stimulation (DBS). Simulated volume of tissue activated (VTA), using clinical electrode placements, are recorded along withpatient outcomes in the Unified Parkinson's disease rating scale (UPDRS). The data is volumetric and sparse, with multi-value patient results for each activated voxel in the simulation. nView provides a collection of visual methods to explore the activated tissue to enhance understanding of electrode usage for improved therapy with DBS.
TOD-Tree: Task-Overlapped Direct send Tree Image Compositing for Hybrid MPI Parallelism and GPUs|
A. V. P. Grosset, M. Prasad, C. Christensen, A. Knoll, C. Hansen. In IEEE Transactions on Visualization and Computer Graphics, IEEE, pp. 1--1. 2016.
Modern supercomputers have thousands of nodes, each with CPUs and/or GPUs capable of several teraflops. However, the network connecting these nodes is relatively slow, on the order of gigabits per second. For time-critical workloads such as interactive visualization, the bottleneck is no longer computation but communication. In this paper, we present an image compositing algorithm that works on both CPU-only and GPU-accelerated supercomputers and focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a parallel direct send stage, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting on the Stampede and Edison supercomputers, show strong scaling results and explain how we generally achieve better performance than these two algorithms. We developed a GPU-based image compositing algorithm where we use CUDA kernels for computation and GPU Direct RDMA for inter-node GPU communication. We tested the algorithm on the Piz Daint GPU-accelerated supercomputer and show that we achieve performance on par with CPUs. Lastly, we introduce a workflow in which both rendering and compositing are done on the GPU.
Dynamically Scheduled Region-Based Image Compositing|
A.V. P. Grosset, A. Knoll, C.D. Hansen. In Eurographics Symposium on Parallel Graphics and Visualization, June, 2016.
Algorithms for sort-last parallel volume rendering on large distributed memory machines usually divide a dataset equally across all nodes for rendering. Depending on the features that a user wants to see in a dataset, all the nodes will rarely finish rendering at the same time. Existing compositing algorithms do not often take this into consideration, which can lead to significant delays when nodes that are compositing wait for other nodes that are still rendering. In this paper, we present an image compositing algorithm that uses spatial and temporal awareness to dynamically schedule the exchange of regions in an image and progressively composite images as they become available. Running on the Edison supercomputer at NERSC, we show that a scheduler-based algorithm with awareness of the spatial contribution from each rendering node can outperform traditional image compositing algorithms.
Physical Mechanisms of DDT in an Array of PBX 9501 Cylinders Initiation Mechanisms of DDT|
J. Beckvermit, T. Harman, C. Wight, M. Berzins. SCI Institute, April, 2016.
The Deflagration to Detonation Transition (DDT) in large arrays (100s) of explosive devices is investigated using large-scale computer simulations running the Uintah Computational Framework. Our particular interest is understanding the fundamental physical mechanisms by which convective deflagration of cylindrical PBX 9501 devices can transition to a fully-developed detonation in transportation accidents. The simulations reveal two dominant mechanisms, inertial confinement and Impact to Detonation Transition. In this study we examined the role of physical spacing of the cylinders and how it influenced the initiation of DDT.
Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement|
A. Humphrey, D. Sunderland, T. Harman, M. Berzins. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1222-1231. May, 2016.
Modeling thermal radiation is computationally challenging in parallel due to its all-to-all physical and resulting computational connectivity, and is also the dominant mode of heat transfer in practical applications such as next-generation clean coal boilers, being modeled by the Uintah framework. However, a direct all-to-all treatment of radiation is prohibitively expensive on large computers systems whether homogeneous or heterogeneous. DOE Titan and the planned DOE Summit and Sierra machines are examples of current and emerging GPUbased heterogeneous systems where the increased processing capability of GPUs over CPUs exacerbates this problem. These systems require that computational frameworks like Uintah leverage an arbitrary number of on-node GPUs, while simultaneously utilizing thousands of GPUs within a single simulation. We show that radiative heat transfer problems can be made to scale within Uintah on heterogeneous systems through a combination of reverse Monte Carlo ray tracing (RMCRT) techniques combined with AMR, to reduce the amount of global communication. In particular, significant Uintah infrastructure changes, including a novel lock and contention-free, thread-scalable data structure for managing MPI communication requests and improved memory allocation strategies were necessary to achieve excellent strong scaling results to 16384 GPUs on Titan.