Asset Publisher

Return to Full Page
« Back

September 2016 Symposium ECSS

September 20, 2016

Integrating Scientific Tools and Web Portals

Presenter(s): Kevin (Feng) Chen (TACC)
Principal Investigator(s): Carol X. Song (Purdue) Ritu Arora (TACC)

Presentation Slides

Abstract: Diagrid is powered by the HUBzero® software developed at Purdue University. It is specifically designed to help a scientific community share resources and work together with one another. The Diagrid Science as a Service platform allows for easy web-based access to software applications used by thousands of researchers around the world. In today's ECSS symposium, Dr. Kevin Chen will discuss the development on scientific tools leveraging Diagrid web portal and XSEDE HPC resources.

System-level Checkpoint-Restart with DMTCP

Presenter(s): Jerome Vienne (TACC)
Principal Investigator(s): Gene Cooperman (Northeastern University)

Presentation Slides

DMTCP (Distributed MultiThreaded CheckPointing) is a software package used to checkpoint-restart applications. The primary purpose of checkpointing in HPC is achieving fault tolerance. If a computation fails, whether for reasons of hardware failure or temporary software failure, then the user restarts the computation from a previous checkpoint. This presentation highlights work on ECSS project with the team that develops it. The initial purpose of the ECSS project was to provide support to extend the scalability of DMTCP but it ended to be more than that. During the presentation, I will introduce DMTCP and explain how it can be used to checkpoint-restart and debug a batch session, checkpoint OpenSHMEM implementations and large scale experiments running on InfiniBand clusters. All these points brought to different challenges that were solved during this ECSS project. This collaboration led to papers presented at XSEDE'16, OpenSHMEM 2016 and IEEE ICPADS 2016.