Strengthening the US Department of Energy's Recruitment Pipeline: The DOE/NNSA Predictive Science Academic Alliance Program (PSAAP) Experience|
J. K. Holmen, V. G. Vergara Larrea, E. W. Draeger, E. T. Phipps, P. J. Smith, M. Berzins, S. T. Smith, J. N. Thornock, S. Parete-Koon. In Practice and Experience in Advanced Research Computing, ACM, pp. 137--144. 2023.
The US Department of Energy (DOE) oversees a system of 17 national laboratories responsible for developing unique scientific capabilities beyond the scope of academic and industrial institutions. These labs strive to keep America at the forefront of discovery and are home to some of the Nation’s best minds and the world’s best scientific and research facilities. Collaborations between national laboratories and academic institutions are critical to develop and recruit talent for the DOE workforce. Academia’s cooperative education model poses challenges for DOE recruitment pipelines centered around traditional internships. This paper discusses a promising DOE recruitment pipeline, the National Nuclear Security Administration’s (NNSA) Predictive Science Academic Alliance Program (PSAAP) initiative. As a part of this, experiences capturing the successes and challenges faced by the University of Utah’s Carbon Capture Multidisciplinary Simulation Center (CCMSC) through their participation in the PSAAP-II initiative are shared. These experiences demonstrate the success of Utah’s PSAAP center as a recruitment pipeline with approximately 43% of CCMSC students going to a national laboratory after graduation. Potential opportunities to strengthen the DOE’s recruitment pipeline are also discussed.
reVISit: Supporting Scalable Evaluation of Interactive Visualizations|
Subtitled OSF Preprints, Y. Ding, J. Wilburn, H. Shrestha, A. Ndlovu, K. Gadhave, C. Nobre, A. Lex, L. Harrison. 2023.
reVISit is an open-source software toolkit and framework for creating, deploying, and monitoring empirical visualization studies. Running a quality empirical study in visualization can be demanding and resource-intensive, requiring substantial time, cost, and technical expertise from the research team. These challenges are amplified as research norms trend towards more complex and rigorous study methodologies, alongside a growing need to evaluate more complex interactive visualizations. reVISit aims to ameliorate these challenges by introducing a domain-specific language for study set-up, and a series of software components, such as UI elements, behavior provenance, and an experiment monitoring and management interface. Together with interactive or static stimuli provided by the experimenter, these are compiled to a ready-to-deploy web-based experiment. We demonstrate reVISit's functionality by re-implementing two studies – a graphical perception task and a more complex, interactive study. reVISit is an open-source community project, available at https://revisit.dev/
Studying Latency and Throughput Constraints for Geo-Distributed Data in the National Science Data Fabric|
J. Luettgau, H. Martinez, G. Tarcea, G. Scorzelli, V. Pascucci, M. Taufer. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, ACM, pp. 325–326. 2023.
The National Science Data Fabric (NSDF) is our solution to the problem of addressing the data-sharing needs of the growing data science community. NSDF is designed to make sharing data across geographically distributed sites easier for users who lack technical expertise and infrastructure. By developing an easy-to-install software stack, we promote the FAIR data-sharing principles in NSDF while leveraging existing high-speed data transfer infrastructures such as Globus and XRootD. This work shows how we leverage latency and throughput information between geo-distributed NSDF sites with NSDF entry points to optimize the automatic coordination of data placement and transfer across the data fabric, which can further improve the efficiency of data sharing.
AI for Scientific Visualization|
C. R. Johnson, H. Shen. In Artificial Intelligence for Science, Edited by Alok Choudhary, Geoffrey Fox, and Tony Hey, World Scientific, pp. 535-552. 2023.
Fiber Uncertainty Visualization for Bivariate Data With Parametric and Nonparametric Noise Models|
T. M. Athawale, C.R. Johnson, S. Sane,, D. Pugmire. In IEEE Transactions on Visualization and Computer Graphics, Vol. 29, No. 1, IEEE, pp. 613-23. 2023.
Visualization and analysis of multivariate data and their uncertainty are top research challenges in data visualization. Constructing fiber surfaces is a popular technique for multivariate data visualization that generalizes the idea of level-set visualization for univariate data to multivariate data. In this paper, we present a statistical framework to quantify positional probabilities of fibers extracted from uncertain bivariate fields. Specifically, we extend the state-of-the-art Gaussian models of uncertainty for bivariate data to other parametric distributions (e.g., uniform and Epanechnikov) and more general nonparametric probability distributions (e.g., histograms and kernel density estimation) and derive corresponding spatial probabilities of fibers. In our proposed framework, we leverage Green’s theorem for closed-form computation of fiber probabilities when bivariate data are assumed to have independent parametric and nonparametric noise. Additionally, we present a nonparametric approach combined with numerical integration to study the positional probability of fibers when bivariate data are assumed to have correlated noise. For uncertainty analysis, we visualize the derived probability volumes for fibers via volume rendering and extracting level sets based on probability thresholds. We present the utility of our proposed techniques via experiments on synthetic and simulation datasets
FunMC2: A Filter for Uncertainty Visualization of Marching Cubes on Multi-Core Devices|
Z. Wang, T. M. Athawale, K. Moreland, J. Chen, C. R. Johnson, D. Pugmire. In Eurographics Symposium on Parallel Graphics and Visualization, 2023.
Visualization is an important tool for scientists to extract understanding from complex scientific data. Scientists need to understand the uncertainty inherent in all scientific data in order to interpret the data correctly. Uncertainty visualization has been an active and growing area of research to address this challenge. Algorithms for uncertainty visualization can be expensive, and research efforts have been focused mainly on structured grid types. Further, support for uncertainty visualization in production tools is limited. In this paper, we adapt an algorithm for computing key metrics for visualizing uncertainty in Marching Cubes (MC) to multi-core devices and present the design, implementation, and evaluation for a Filter for uncertainty visualization of Marching Cubes on Multi-Core devices (FunMC2). FunMC2 accelerates the uncertainty visualization of MC significantly, and it is portable across multi-core CPUs and GPUs. Evaluation results show that FunMC2 based on OpenMP runs around 11× to 41× faster on multi-core CPUs than the corresponding serial version using one CPU core. FunMC2 based on a single GPU is around 5× to 9× faster than FunMC2 running by OpenMP. Moreover, FunMC2 is flexible enough to process ensemble data with both structured and unstructured mesh types. Furthermore, we demonstrate that FunMC2 can be seamlessly integrated as a plugin into ParaView, a production visualization tool for post-processing.
A Visual Environment for Data Driven Protein Modeling and Validation|
M. Falk, V. Tobiasson, A. Bock, C. Hansen, A. Ynnerman. In IEEE Transactions on Visualization and Computer Graphics, IEEE, pp. 1-11. 2023.
In structural biology, validation and verification of new atomic models are crucial and necessary steps which limit the production of reliable molecular models for publications and databases. An atomic model is the result of meticulous modeling and matching and is evaluated using a variety of metrics that provide clues to improve and refine the model so it fits our understanding of molecules and physical constraints. In cryo electron microscopy (cryo-EM) the validation is also part of an iterative modeling process in which there is a need to judge the quality of the model during the creation phase. A shortcoming is that the process and results of the validation are rarely communicated using visual metaphors. This work presents a visual framework for molecular validation. The framework was developed in close collaboration with domain experts in a participatory design process. Its core is a novel visual representation based on 2D heatmaps that shows all available validation metrics in a linear fashion, presenting a global overview of the atomic model and provide domain experts with interactive analysis tools. Additional information stemming from the underlying data, such as a variety of local quality measures, is used to guide the user's attention toward regions of higher relevance. Linked with the heatmap is a three-dimensional molecular visualization providing the spatial context of the structures and chosen metrics. Additional views of statistical properties of the structure are included in the visual framework. We demonstrate the utility of the framework and its visual guidance with examples from cryo-EM.
Data Abstraction Elephants: The Initial Diversity of Data Representations and Mental Models|
K. Williams, A. Bigelow, K.E. Isaacs. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), ACM, 2023.
Two people looking at the same dataset will create diferent mental models, prioritize diferent attributes, and connect with diferent visualizations. We seek to understand the space of data abstractions associated with mental models and how well people communicate their mental models when sketching. Data abstractions have a profound infuence on the visualization design, yet it’s unclear how universal they may be when not initially infuenced by a representation. We conducted a study about how people create their mental models from a dataset. Rather than presenting tabular data, we presented each participant with one of three datasets in paragraph form, to avoid biasing the data abstraction and mental model. We observed various mental models, data abstractions, and depictions from the same dataset, and how these concepts are infuenced by communication and purpose-seeking. Our results have implications for visualization design, especially during the discovery and data collection phase.
|Orchestration of materials science workflows for heterogeneous resources at large scale,
N. Zhou, G. Scorzelli, J. Luettgau, R.R. Kancharla, J. Kane, R. Wheeler, B. Croom, B. Newell, V. Pascucci, M. Taufer. In The International Journal of High Performance Computing Applications, Sage, 2023.
In the era of big data, materials science workflows need to handle large-scale data distribution, storage, and computation. Any of these areas can become a performance bottleneck. We present a framework for analyzing internal material structures (e.g., cracks) to mitigate these bottlenecks. We demonstrate the effectiveness of our framework for a workflow performing synchrotron X-ray computed tomography reconstruction and segmentation of a silica-based structure. Our framework provides a cloud-based, cutting-edge solution to challenges such as growing intermediate and output data and heavy resource demands during image reconstruction and segmentation. Specifically, our framework efficiently manages data storage, scaling up compute resources on the cloud. The multi-layer software structure of our framework includes three layers. A top layer uses Jupyter notebooks and serves as the user interface. A middle layer uses Ansible for resource deployment and managing the execution environment. A low layer is dedicated to resource management and provides resource management and job scheduling on heterogeneous nodes (i.e., GPU and CPU). At the core of this layer, Kubernetes supports resource management, and Dask enables large-scale job scheduling for heterogeneous resources. The broader impact of our work is four-fold: through our framework, we hide the complexity of the cloud’s software stack to the user who otherwise is required to have expertise in cloud technologies; we manage job scheduling efficiently and in a scalable manner; we enable resource elasticity and workflow orchestration at a large scale; and we facilitate moving the study of nonporous structures, which has wide applications in engineering and scientific fields, to the cloud. While we demonstrate the capability of our framework for a specific materials science application, it can be adapted for other applications and domains because of its modular, multi-layer architecture.
Here’s what you need to know about my data: Exploring Expert Knowledge’s Role in Data Analysis|
H. Lin, M. Lisnic, D. Akbaba, M. Meyer, A. Lex. 2023.
Data driven decision making has become the gold standard in science, industry, and public policy. Yet data alone, as an imperfect and partial representation of reality, is often insufficient to make good analysis decisions. Knowledge about the context of a dataset, its strengths and weaknesses, and its applicability for certain tasks is essential. In this work, we present an interview study with analysts from a wide range of domains and with varied expertise and experience inquiring about the role of contextual knowledge. We provide insights into how data is insufficient in analysts workflows and how they incorporate other sources of knowledge into their analysis. We also suggest design opportunities to better and more robustly consider both, knowledge and data in analysis processes.
Progressive Tree-Based Compression of Large-Scale Particle Data|
D. Hoang, H. Bhatia, P. Lindstrom, V. Pascucci. In IEEE Transactions on Visualization and Computer Graphics, IEEE, pp. 1--18. 2023.
Scientific simulations and observations using particles have been creating large datasets that require effective and efficient data reduction to store, transfer, and analyze. However, current approaches either compress only small data well while being inefficient for large data, or handle large data but with insufficient compression. Toward effective and scalable compression/decompression of particle positions, we introduce new kinds of particle hierarchies and corresponding traversal orders that quickly reduce reconstruction error while being fast and low in memory footprint. Our solution to compression of large-scale particle data is a flexible block-based hierarchy that supports progressive, random-access, and error-driven decoding, where error estimation heuristics can be supplied by the user. For low-level node encoding, we introduce new schemes that effectively compress both uniform and densely structured particle distributions.
Protein-metabolite interactomics of carbohydrate metabolism reveal regulation of lactate dehydrogenase|
K. G. Hicks, A. A. Cluntun, H. L. Schubert, S. R. Hackett, J. A. Berg, P. G. Leonard, M. A. Ajalla Aleixo, Y. Zhou, A. J. Bott, S. R. Salvatore, F. Chang, A. Blevins, P. Barta, S. Tilley, A. Leifer, A. Guzman, A. Arok, S. Fogarty, J. M. Winter, H. Ahn, K. N. Allen, S. Block, I. A. Cardoso, J. Ding, I. Dreveny, C. Gasper, Q. Ho, A. Matsuura, M. J. Palladino, S. Prajapati, P. Sun, K. Tittmann, D. R. Tolan, J. Unterlass, A. P. VanDemark, M. G. Vander Heiden, B. A. Webb, C. Yun, P. Zhap, B. Wang, F. J. Schopfer, C. P. Hill, M. C. Nonato, F. L. Muller, J. E. Cox, J. Rutter. In Science, Vol. 379, No. 6636, pp. 996-1003. 2023.
Metabolic networks are interconnected and influence diverse cellular processes. The protein-metabolite interactions that mediate these networks are frequently low affinity and challenging to systematically discover. We developed mass spectrometry integrated with equilibrium dialysis for the discovery of allostery systematically (MIDAS) to identify such interactions. Analysis of 33 enzymes from human carbohydrate metabolism identified 830 protein-metabolite interactions, including known regulators, substrates, and products as well as previously unreported interactions. We functionally validated a subset of interactions, including the isoform-specific inhibition of lactate dehydrogenase by long-chain acyl–coenzyme A. Cell treatment with fatty acids caused a loss of pyruvate-lactate interconversion dependent on lactate dehydrogenase isoform expression. These protein-metabolite interactions may contribute to the dynamic, tissue-specific metabolic flexibility that enables growth and survival in an ever-changing nutrient environment. Understanding how metabolic state influences cellular processes requires systematic analysis of low-affinity interactions of metabolites with proteins. Hicks et al. describe a method called MIDAS (mass spectrometry integrated with equilibrium dialysis for the discovery of allostery systematically), which allowed them to probe such interactions for 33 enzymes of human carbohydrate metabolism and more than 400 metabolites. The authors detected many known and many new interactions, including regulation of lactate dehydrogenase by ATP and long-chain acyl coenzyme A, which may help to explain known physiological relations between fat and carbohydrate metabolism in different tissues. —LBR A mass spectrometry and dialysis method detects metabolite-protein interactions that help to control physiology.
Exploring Classification of Topological Priors with Machine Learning for Feature Extraction|
S. Leventhal, A. Gyulassy, M. Heimann, V. Pascucci. In IEEE Transactions on Visualization and Computer Graphics, pp. 1--12. 2023.
In many scientific endeavors, increasingly abstract representations of data allow for new interpretive methodologies and conceptualization of phenomena. For example, moving from raw imaged pixels to segmented and reconstructed objects allows researchers new insights and means to direct their studies toward relevant areas. Thus, the development of new and improved methods for segmentation remains an active area of research. With advances in machine learning and neural networks, scientists have been focused on employing deep neural networks such as U-Net to obtain pixel-level segmentations, namely, defining associations between pixels and corresponding/referent objects and gathering those objects afterward. Topological analysis, such as the use of the Morse-Smale complex to encode regions of uniform gradient flow behavior, offers an alternative approach: first, create geometric priors, and then apply machine learning to classify. This approach is empirically motivated since phenomena of interest often appear as subsets of topological priors in many applications. Using topological elements not only reduces the learning space but also introduces the ability to use learnable geometries and connectivity to aid the classification of the segmentation target. In this paper, we describe an approach to creating learnable topological elements, explore the application of ML techniques to classification tasks in a number of areas, and demonstrate this approach as a viable alternative to pixel-level classification, with similar accuracy, improved execution time, and requiring marginal training data.
Troubling Collaboration: Matters of Care for Visualization Design Study|
D. Akbaba, D. Lange, M. Correll, A. Lex, M. Meyer. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23),, pp. 23--28. April, 2023.
A common research process in visualization is for visualization researchers to collaborate with domain experts to solve particular applied data problems. While there is existing guidance and expertise around how to structure collaborations to strengthen research contributions, there is comparatively little guidance on how to navigate the implications of, and power produced through the socio-technical entanglements of collaborations. In this paper, we qualitatively analyze refective interviews of past participants of collaborations from multiple perspectives: visualization graduate students, visualization professors, and domain collaborators. We juxtapose the perspectives of these individuals, revealing tensions about the tools that are built and the relationships that are formed — a complex web of competing motivations. Through the lens of matters of care, we interpret this web, concluding with considerations that both trouble and necessitate reformation of current patterns around collaborative work in visualization design studies to promote more equitable, useful, and care-ful outcomes.
Accelerated Probabilistic Marching Cubes by Deep Learning for Time-Varying Scalar Ensembles|
M. Han, T.M. Athawale, D. Pugmire, C.R. Johnson. In 2022 IEEE Visualization and Visual Analytics (VIS), IEEE, pp. 155-159. 2022.
Visualizing the uncertainty of ensemble simulations is challenging due to the large size and multivariate and temporal features of en-semble data sets. One popular approach to studying the uncertainty of ensembles is analyzing the positional uncertainty of the level sets. Probabilistic marching cubes is a technique that performs Monte Carlo sampling of multivariate Gaussian noise distributions for positional uncertainty visualization of level sets. However, the technique suffers from high computational time, making interactive visualization and analysis impossible to achieve. This paper introduces a deep-learning-based approach to learning the level-set uncertainty for two-dimensional ensemble data with a multivariate Gaussian noise assumption. We train the model using the first few time steps from time-varying ensemble data in our workflow. We demonstrate that our trained model accurately infers uncertainty in level sets for new time steps and is up to 170X faster than that of the original probabilistic model with serial computation and 10X faster than that of the original parallel computation.
Adaptive elasticity policies for staging-based in situ visualization|
Z. Wang, M. Dorier, P. Subedi, P.E. Davis, M. Parashar. In Future Generation Computer Systems, 2022.
In situ processing aims to alleviate the growing gap between computation and I/O capabilities by performing data processing close to the data source. In situ processing is widely used to process data generated by multiple data sources, including observation data from edge devices or scientific observational facilities and the simulation data generated by scientific computation on a high-performance computing (HPC) platform. For a scientific workflow that is run on an HPC platform and composed of a simulation program and an in situ data analytics or visualization (abbreviated as ana/vis) task, there is an implicit assumption that the computing resources assigned to the workflow keep static during the workflow execution. However, with the converging trend between the HPC and cloud computing platform, running the in situ ana/vis task in an elastic way is promising to decrease its overhead and improve its resource utilization rate. Resource elasticity represents the ability to change resource configurations such as the number of computing nodes/processes during workflow execution. An elastic job may dynamically adjust resource configurations; it may use a few resources at the beginning and more resources toward the end of the job when interesting data appear. However, it is hard to predict a priori how many computing nodes/processes need to be added/removed during the workflow execution to adapt to changing workflow needs. How to efficiently guide elasticity operations, such as growing or shrinking the number of processes used for in situ analysis during workflow execution, is an open-ended research question. In this article, we present adaptive elasticity policies that adopt workflow runtime information collected during workflow execution to predict how to trigger the addition/removal of processes in order to minimize in situ processing overhead. Taking in situ visualization tasks as an example, we integrate the presented elasticity policies into a staging-based elastic workflow and evaluate its efficiency in multiple elasticity scenarios. Compared with the situation without elasticity or with a static elasticity policy that uses a fixed number of processes for each rescaling operation, the adaptive elasticity policy can save overhead in finding a proper resource configuration and improve resource utilization efficiency. For example, one experiment illustrates that the adaptive elasticity policy saves 41% of core-hours compared with the situation without the resource elasticity.
A Visual Comparison of Silent Error Propagation|
Z. Li, H. Menon, K. Mohror, S. Liu, L. Guo, P.T. Bremer, V. Pascucci. In IEEE Transactions on Visualization and Computer Graphics, IEEE, 2022.
High-performance computing (HPC) systems play a critical role in facilitating scientific discoveries. Their scale and complexity (e.g., the number of computational units and software stack) continue to grow as new systems are expected to process increasingly more data and reduce computing time. However, with more processing elements, the probability that these systems will experience a random bit-flip error that corrupts a program's output also increases, which is often recognized as silent data corruption. Analyzing the resiliency of HPC applications in extreme-scale computing to silent data corruption is crucial but difficult. An HPC application often contains a large number of computation units that need to be tested, and error propagation caused by error corruption is complex and difficult to interpret. To accommodate this challenge, we propose an interactive visualization system that helps HPC researchers understand the resiliency of HPC applications and compare their error propagation. Our system models an application's error propagation to study a program's resiliency by constructing and visualizing its fault tolerance boundary. Coordinating with multiple interactive designs, our system enables domain experts to efficiently explore the complicated spatial and temporal correlation between error propagations. At the end, the system integrated a nonmonotonic error propagation analysis with an adjustable graph propagation visualization to help domain experts examine the details of error propagation and answer such questions as why an error is mitigated or amplified by program execution.
Interactive Visualization for Data Science Scripts|
R. Faust, C. Scheidegger, K. Isaacs, W.Z. Bernstein, M. Sharp, C. North. In 2022 IEEE Visualization in Data Science (VDS), IEEE, pp. 37-45. 2022.
As the field of data science continues to grow, so does the need for adequate tools to understand and debug data science scripts. Current debugging practices fall short when applied to a data science setting, due to the exploratory and iterative nature of analysis scripts. Additionally, computational notebooks, the preferred scripting environment of many data scientists, present additional challenges to understanding and debugging workflows, including the non-linear execution of code snippets. This paper presents Anteater, a trace-based visual debugging method for data science scripts. Anteater automatically traces and visualizes execution data with minimal analyst input. The visualizations illustrate execution and value behaviors that aid in understanding the results of analysis scripts. To maximize the number of workflows supported, we present prototype implementations in both Python and Jupyter. Last, to demonstrate Anteater’s support for analysis understanding tasks, we provide two usage scenarios on real world analysis scripts.
Ferret: Reviewing Tabular Datasets for Manipulation|
Subtitled OSF Preprint, D. Lange, S. Sahai, J.M. Phillips, A. Lex. 2022.
How do we ensure the veracity of science? The act of manipulating or fabricating scientific data has led to many high-profile fraud cases and retractions. Detecting manipulated data, however, is a challenging and time-consuming endeavor. Automated detection methods are limited due to the diversity of data types and manipulation techniques. Furthermore, patterns automatically flagged as suspicious can have reasonable explanations. Instead, we propose a nuanced approach where experts analyze tabular datasets, eg, as part of the peer-review process, using a guided, interactive visualization approach. In this paper, we present an analysis of how manipulated datasets are created and the artifacts these techniques generate. Based on these findings, we propose a suite of visualization methods to surface potential irregularities. We have implemented these methods in Ferret, a visualization tool for data forensics work. Ferret makes potential data issues salient and provides guidance on spotting signs of tampering and differentiating them from truthful data.
The Materials Commons Data Repository|
G. Tarcea, B. Puchala, T. Berman, G. Scorzelli, V. Pascucci, M, Taufer, J. Allison. In 2022 IEEE 18th International Conference on e-Science (e-Science), pp. 405--406. 2022.
Repositories are increasingly used for publishing and sharing scientific data. The Materials Commons is a data repository that follows the FAIR (Findable, Accessible, Inter-operable, Reusable) principles. We demonstrate the challenges with FAIR and how Materials Commons solves them. We also discuss the Nationals Science Data Fabric (NSDF) , a project that is democratizing data access, and show how Materials Commons with the NSDF software stack accelerates data access and scientific research.