« Back

Developing Scientific Computing Communities

Researchers present experiences from ENZO, CACTUS, and iPlant API development efforts

Decades of scholarships and billions of dollars have gone into the development of community software codes that are crucial not only to science, but to our everyday lives and future.

The General Circulation Model, used by the Intergovernmental Panel on Climate Change to model our future environment, and the Weather Research and Forecasting model, which helps predict extreme weather, are two key examples. Others, like CHARMM and NAMD, are used by researchers and pharmaceutical companies to find drug leads and to better understand disease.

Nearly every field of science has a community code (or several) that satisfies a large percentage of the discipline's scientific needs. Great minds — and thousands of hours of Ph.D. and post-doc labor — have gone into the creation of these codes. However, as new technologies emerge that are capable of delivering millions of times the power as previous systems, it is often necessary to rethink and rewrite these existing community codes, which is no small feat.

What to do about community codes has been an open question in the scientific computing community for many years. The problem is described in the final report of the National Science Foundation's Task Force on Software for Science and Engineering published in March 2011.

"All software must evolve to keep up with changes in systems, usage, and to include new algorithms and techniques," the authors wrote. "The scientific community has an interest in ensuring that the software it needs will continue to be available, efficient, and employ state-of-the-art technology."

Several sessions addressed this issue at the Teragrid ‘11 conference in Salt Lake City in July 2011, pointing to successful examples of community, and community code, development and their continuing evolution using the resources of the Extreme Digital Environment for Science and Engineering (XSEDE).

These talks represented technologies or methods that interact with HPC hardware and software at very different levels of the architecture; nonetheless, they represent possible paths for other scientific computing communities to follow.

Open Source Astrophysics

Brian O'Shea, assistant professor of physics and astronomy at Michigan State University, began his talk with a question: How do you transform a closed scientific computing code into a community code that can address the needs and harness the skills of a wide variety of researchers?

His talk described the evolution of the astrophysics code, Enzo, from a black box system that only a few understood or could access, to a free-for-all in which divergent strains of the code proliferated, to the current state of controlled chaos whereby several dozen developers experiment with and provide input into the code development, spurring rapid advances.

The new development workflow is "transparent to the users and easy to use," O'Shea said. "The result is that we have a very enthusiastic and involved user community. And it's sustainable." 

Enzo is used by a relatively small number of scientists, yet they are among the most adept and proficient users of HPC resources. Approximately 60 Enzo users consumed 60 million computing hours on the TeraGrid in 2010, according to O'Shea, and many million hours on XSEDE will be dedicated to these projects in the coming years, leading to many astrophysical discoveries, including a better understanding of cosmic reionization.

An API to Feed the World

World governments and private industry are investing trillions of dollars in the collection of data relating to plants in the hopes of continuing to feed the growing population on Earth. To date, however, these data collections have been scattered and difficult to connect.

To address this issue, the National Science Foundation funded a five-year, $50 million dollar effort called "iPlant" to develop new tools, networks, and cyberinfrastructure that can connect plant biologists and bring their data together to spur insights and innovations.

Software developer Rion Dooley from the Texas Advanced Computing Center described the creation of a common application programming interface (API) for iPlant that allows researchers with little programming experience to add common functionality to their plant biology projects.

APIs are a particular set of rules and specifications that software programs use to communicate with each other. They serve as an interface between different software programs and facilitate their interaction, similar to the way user interfaces facilitate interaction between humans and computers.

Modeled after popular social and industry APIs like Yelp or PayPal, the tools are intuitive, easy to use, and scalable on the XSEDE's very large high-performance computing systems. Among the most important API capabilities in iPlant are tools that allow any user to translate and integrate data in different file formats, allowing for far greater collaboration.

"The API serves as a Rosetta stone for our users," Dooley said. "It gives them a way to collaborate with any other user without having to be fluent in every piece of software used in the plant bio community. And that's really the goal: to keep scientists focused on science rather than semantics."

Modular Software for Community Growth

A third example of community code development was featured in a full-day tutorial at the conference centered on the Cactus computational framework, an open source problem-solving environment for scientists and engineers. Its modular structure enables parallel computation across different architectures and collaborative code development between different groups.

Cactus originated in the academic research community, where it was developed and used over many years by a large international collaboration of physicists and computational scientists. Applications, developed on standard workstations or laptops, can seamlessly run on clusters or supercomputers.

The Cactus user community has created and maintained toolkits for several research fields. The Einstein Toolkit (described at length by Ed Seidel in his keynote talk—need a live link to Jan's keynote feature) is a powerful example of Cactus' capabilities. The Toolkit consists of an open set of more than 100 Cactus "thorns," or application modules, useful for computational relativity, along with associated tools for simulation management and visualization. The code has undergone tremendous growth by virtue of the development model in the last several years.

"Our aim is to provide the core computational tools than can enable new science, broaden our community, facilitate interdisciplinary research and take advantage of emerging petascale computers and advanced cyberinfrastructure," Allen said.

Whether through the controlled chaos of the Enzo evolution, the add-on extensibility of the iPlant API, or the parallel framework offered by Cactus, successful models of community code creation and evolution are critical to the continued growth of the scientific computing community.

 

CONTACT:

Aaron Dubrow,
 Texas Advanced Computer Center

aarondubrow@tacc.utexas.edu