The Data Observation Network for Earth (DataONE) is a virtual organization dedicated to providing open, persistent, robust, and secure access to biodiversity and environmental data, supported by the U.S. National Science Foundation. DataONE is pleased to announce the availability of summer research internships for undergraduates, graduate students and recent postgraduates.
Program Structure
Up to eight interns will be accepted in 2012, each paired with one primary mentor and, in some cases, secondary mentors. Interns need not necessarily be at the same location or institution as their mentor(s). Interns and mentors are expected to have a face-to-face meeting at the beginning of the summer, and interns are encouraged to attend the DataONE All-Hands Meeting in the fall to present the results of their work. DataONE will pay all necessary travel expenses.
Schedule
- February 24 - Application period opens
- March 14 - Deadline for receipt of applications at midnight Mountain time
- April 4 - Notification of acceptance. Scheduling of face-to-face kickoff meetings based on availability of interns and mentors
- May 23 - Program begins*
- June 27 - Midterm evaluations
- July 29 - Program concludes
- September 18-20 - DataONE All-Hands-Meeting, New Mexico (attendance encouraged)
* Allowance will be made for students who are unavailable during these date due to their school calendar.
Eligibility
The program is open to all undergraduate students, graduate students, and postgraduates who have received their masters or doctorate within the past five years. Given the broad range of projects, there are no restrictions on academic backgrounds or field of study. Interns must be at least 18 years of age by the program start date, must be currently enrolled or employed at a U.S. university or other research institution and must currently reside in, and be eligible to work in, the United States. Interns are expected to be available approximately 40 hours/week during the internship period (noted below) with significant availability during the normal business hours. Interns from previous years are eligible to participate.
Financial Support
Interns will receive a stipend of $4,500 for participation, paid in two installments (one at the midterm and one at the conclusion of the program). In addition, required travel expenses will be borne by DataONE. Participation in the program after the mid-term is contingent on satisfactory performance. The University of New Mexico will administer funds. Interns will need to supply their own computing equipment and Internet connection. For students who are not US citizens or permanent residents, complete visa information will be required, and it may be necessary for the funds to be paid through the student’s university or research institution. In such cases, the student will need to provide the necessary contact information for their organization.
Project Ideas
Projects cover a range of topic areas and vary in the extent and type of prior background required of the intern. The interests and expertise of the applicants will, in part, determine which projects will be selected for the program. The titles of this year’s projects (see below for more detailed descriptions) are:
- Publish (data) or Perish: Best Practices for Creating, Reviewing, and Publishing Data Products
- Enriching the Content of the DMPTool for the DataONE Community
- A Portable Web Application for Data and Metadata Submission
- Querying Scientific Workflow Provenance
- Data Usage and Citation Visualization
- Evaluating the Feasibility of Using Bottom-Up Text Mining Approaches to Complement Thesaurus and Ontology-based Approaches for Supporting Data Discovery
- Enhancing Semantic Search in ONEMercury
- An Information Model for Observational Data within DataONE
- Components of Successful Metadata Registry Frameworks
- Developing a DaaS (Data as a Service) view of DataONE
To Apply
To apply for the Summer Internship program, click on this link. Applications must be completed by 11:59 PM (Mountain time) on March 14th. You will be asked to upload a cover letter and resume, both in PDF format. Applicants should also provide a letter of reference . The letter of reference should be sent directly by its author to internship@dataone.org.
- The cover letter should address the following questions:
- What DataONE Summer Internship projects are you most interested in and why?
- What contributions do you expect to be able to make to the project(s)?
- What background do you have which is relevant to the project(s)?
- What do you expect to learn and/or achieve by participating?
- What are your thoughts and ideas about the project, including particular suggestions for ways of achieving the project objectives?
- How will participation in this program help you achieve your educational and career objectives?
- Are there any factors that would affect your ability to participate, including other summer employment, university schedules, and other commitments?
- The resume should include the applicant’s educational history, current position, any publications or honors, and full contact information (including phone number, e-mail address, and mailing address).
- The letter of reference should be sent directly to internship@dataone.org and should be from a professor, supervisor, or mentor.
Evaluation of applications
Applications will be judged by the following criteria:
- The academic and technical qualifications of the applicant.
- Evidence of strong written and oral communication skills.
- The extent to which the applicant can provide substantive contributions to one or more projects, including the applicant’s ideas for project implementation.
- The extent to which the internship would be of value to the career development of the applicant
- The availability of the applicant during the period of the internship.
Intellectual Property
DataONE is predicated on openness and universal access. Software is developed under one of several open source licenses, and copyrightable content produced during the course of the project will made available under a Creative Commons (CC-BY 3.0) license. Where appropriate, projects may result in published articles and conference presentations, on which the intern is expected to make a substantive contribution, and receive credit for that contribution.
Funding acknowledgement
The Summer Internships are supported by The National Science Foundation: "INTEROP: Creation of an International Virtual Data Center for the Biodiversity, Ecological and Environmental Sciences" (NSF Award 0753138) and "DataNet Full Proposal: DataNetONE (Observation Network for Earth)" (NSF Award 0830944).
For more information
If you have questions or problems about the application process or internship program in general, please send e-mail to internship@dataone.org.
Project Ideas
- Publish (data) or Perish: Best Practices for Creating, Reviewing, and Publishing Data Products
Description: Scientists and research and funding organizations are increasingly focusing their attention on creating high quality data products that are openly available and useful to the broader community. However, there are few sources of good information that provide recommendations for preparing well documented and high quality data to share. This internship involves several activities that will lead to one or more co-authored, peer-reviewed publications that provide sound recommendations for both creating high quality data products and establishing criteria for reviewing data products that are pending publication. In particular, the intern will: 1) review data papers published in the Ecological Society of America’s Data Papers as well as comments from peer reviewers in order to identify the most common mistakes encountered in creating data papers; 2) interview data managers associated with programs such as the NASA Distributed Active Archive Centers and the Long-Term Ecological Research Program to ascertain best practices based on their experiences; and 3) contribute to one or more publications geared toward prominent open-source journals. We envision two publications. The first publication will document how to most effectively publish research data, focusing on common mistakes such as inadequate QA/QC and metadata, as well as those data management practices used by successful archivists. The second publication will deal with how to review a data product, focusing on best practices and what we have learned from existing peer-review programs and experiences at National data centers and research networks.
Necessary Prerequisites: B.S. in environmental or earth sciences, computer science, statistics, or library or information sciences
Desirable skills/qualifications: Training and interest in one or more earth sciences; understanding of the research and peer-review process; interest in analyzing data products and metadata (including QA/QC approaches); strong communication and writing skill
Primary mentor: Bill Michener (University of New Mexico)
Secondary mentor: Bob Cook (Oak Ridge National Laboratory) - Enriching the Content of the DMPTool for the DataONE Community
Description: This will primarily be a marketing project, and will focus on DMPTool* content review and expert guidance in a focused area or areas of interest to DataONE. It will begin with a market analysis to determine the best area(s) of focus via analysis of the DMPTool user base within DataONE compared to those which might have highest need or which currently have notable gaps. This will include specific research domains, major research institutions, key funding agencies/programs, etc. The intern will set targets and work on increasing the user base in the growth areas they have identified. Throughout this process, they will draw-upon existing DataONE and other community data management best practice resources to enhance and expand content in the DMPTool in related areas. Lastly, they will conduct structured user testing/interviews with researchers in selected areas to assess the quality of content/guidance and whether it meets the user's needs (will draw in UVa user experience/usability team for support). Throughout the entire internship, they will blog to promote the DataONE effort, DMPTool effort, needs of researchers, etc. within the scope of content and needs for the DMPTool.
* The DMPTool is a community resource developed by the UCLA Library, the UC3 from the California Digital Library, the Smithsonian, UVA Libraries, the UCSD Library, UIUC (library and CIO), DataONE, and the DCC. It has two goals: 1) assisting researchers with creation of complete and realistic data management plans quickly and easily, and 2) connecting researchers with institutional resources that support further data management efforts. Since going live in November 2011, it has had over 1,200 unique users, over 290 different institutions represented, and over 35 institutions providing institutional resource guidance. It includes data management guidance for funded research programs across the board, with particular emphasis on NSF programs.
Necessary Prerequisites: At least a rising undergraduate junior or senior with some knowledge of data management best practices within one of DataONE’s domains.
Desirable skills/qualifications: Some knowledge of data and data management best practices, strong research skills, marketing experience, ability to work independently, set goals, and meet deadlines. Appropriate academic areas might include: library/information (grad or undergrad), education (grad), marketing, social science, earth or life sciences.
Primary mentor: Andrew Sallans (University of Virginia, DUG Vice-Chair)
Secondary mentors: Carly Strasser (California Digital Library), Sherry Lake (University of Virginia) - A Portable Web Application for Data and Metadata Submission
Description: Having a variety of mechanisms for adding content to Member Nodes participating in the DataONE federation is beneficial to content creators who may then select a tool most appropriate for their needs. The focus of this project would be to develop a user-oriented, streamlined web application that communicates with any Member Node to upload a dataset consisting of one or more data files plus associated metadata. Students will gain experience with designing and developing web applications for environmental data management, and will gain a detailed understanding of common environmental data and metadata standards. In the simplest case, users should be able to fill in principal metadata needed to create a basic EML/FGDC record, but should also be given the option to upload an existing metadata record rather than creating it in place. This project can build upon ideas from existing, similar tools, including the Metacat metadata registry (see http://knb.ecoinformatics.org/software/register.html). Ideally this application would have a responsive, client-side interface that uses AJAX to streamline interactions with Member Nodes and Coordinating Nodes, and should be built using only HTML5 and Javascript, or using a modern web framework like GWT or extJS (Sencha GXT, etc). The resultant web application should be deployable in a variety of web environments at multiple member nodes, with customizable look and feel through CSS substitutions to allow integration with existing web sites.
Necessary Prerequisites: Experience with HTML, CSS, and Javascript and one or more programming languages, preferably Java or Python.
Desirable skills/qualifications: Experience and understanding of environmental data formats and metadata standards such as Ecological Metadata Language and the Biological Data Profile.
Primary mentor: Chris Jones (National Center for Ecological Analysis and Synthesis - NCEAS)
Secondary mentors: Matt Jones (National Center for Ecological Analysis and Synthesis - NCEAS) - Querying Scientific Workflow Provenance
Description: Scientific workflow systems are used to compose and automate complex computational pipelines from pre-existing software components. An important feature of scientific workflow systems is their ability to record provenance information. Provenance includes the
processing history and lineage of data, and can be used, e.g., to validate data products, debug workflows, document authorship and attribution chains, etc., and thus facilitate “reproducible science”.
The DataONE Working Group on Provenance (ProvWG) has developed a provenance model D-OPM (DataONE-OPM) for scientific workflows, based on the general purpose Open Provenance Model (OPM), and extended with workflow specific features. The goal of this year's summer project is to implement a special-purpose query language for provenance and workflow graphs, based on prior work by the mentors and state-of-the-art languages and techniques known from graph-based and declarative query languages. In particular, the system will allow the user to express a provenance query as a path expression or "graph pattern", which is then translated to a lower-level representation, which in turn is executed on an existing database engine. The resulting prototype will form a starting point for the DataONE cyberinfrastructure to support provenance analytics.
Necessary Prerequisites: Proficiency in Java or Python
Desirable skills/qualifications: The ideal candidate will also have experience in (SQL) databases or even graph databases. Some experience in a declarative or functional language (Datalog, Prolog, ML, Haskell, …) or formal languages (compilers) is a plus, but not required.
Primary mentor: Bertram Ludaescher (University of California Davis)
Secondary mentors: Paolo Missier (Newcastle University), Shawn Bowers (Gonzaga University) - Data Usage and Citation Visualization
Description: This project aims to provide usage information about scientific data hosted within the DataONE network of confederated data sources. Students will gain experience in analysis and visualization of data citation and usage metrics, as well as with the design and development of web applications for use in scientific data portals.
The goal of this project is to create a web application integrated with the DataONE site enabling generation of citation and usage reports for all of DataONE, particular Member Nodes, and for specific data sets. The application should also support reporting usage and citation across a particular user's datasets (for data they own and use) and may include visualization aids such as graphs of object use over time. Citation and reuse data would be drawn from various sources including PubMed Central, Scopus, PLoS, Twitter, and total-impact using application programming interfaces (APIs) supported by those services. Usage information within DataONE will be extracted through analysis of the DataONE aggregated log system which records access to all objects managed by DataONE. A web application with data visualization and reporting capabilities will be built using the DataONE logging service APIs in conjunction with the APIs exposed by the various citation tracking services. Another important deliverable is be providing useful documentation of the DataONE logging facility API to assist third party reporting and usage tools.
Necessary Prerequisites: Proficiency in Java, Javascript, HTML, CSS, XML. Knowledge of REST web application architecture.
Desirable skills/qualifications: Experience in data visualization and analysis.
Primary mentor: Skye Roseboom (University of New Mexico)
Secondary mentor: Heather Piwowar (National Evolutionary Synthesis Center) - Evaluating the Feasibility of Using Bottom-Up Text Mining Approaches to Complement Thesaurus and Ontology-based Approaches for Supporting Data Discovery
Description: This proof-of-concept project will focus on using a text mining approach such as Latent Dirichlet Allocation to extract latent ‘topics’ for DataONE datasets from natural language metadata such as title, abstract, and author-supplied keywords and linking the latent topics with appropriate ontological terms for further enhancing data discovery and interpretation within DataONE. Outputs of the LDA topic model include lists of the most probable terms (including multi-word phrases) for each latent ‘topic’ and probability values that describe the strength of the association between each ‘topic’ and each dataset. This project will provide outputs that can be used to evaluate the value of the topic modeling approach for two data discovery-related applications: first, the latent topics identified could be used to evaluate adequacy of existing thesauri/glossaries and inform further development of such resources. Second, observed statistical relationships between latent ‘topics’ and thesaurus/glossary terms or ontological categories for datasets that have been annotated with such metadata could be used to impute (with probability estimates) these types of metadata to datasets for which such terms or categories are unknown. The intern working on this project will work closely with mentors to acquire a sample of metadata from DataONE affiliated repositories, extract natural language metadata and prepare in format appropriate for LDA topic modeling, systematically vary LDA model parameters to create a set of topic model outputs using an existing implementation of the LDA model, consult with mentors to identify parameter values that produce outputs most suitable for each of the two possible applications described above, quantify statistical relationships between latent topics and metadata fields of interest, and prepare a brief report summarizing findings. Intern will work in close collaboration with Stacy to complete the topic modeling portion of the project, and will consult with Mark as necessary for help with ontological and technical issues.
Necessary Prerequisites: Proficiency in subsetting and manipulating data using R or similar; facility with command-line interfaces and scripting code languages such as Python or PERL; experience working with large datasets; familiarity with scientific metadata; knowledge of or interest in natural language processing; and basic understanding of probability and statistics.
Desirable skills/qualifications: Experience with scientific data management; familiarity with web ontology and semantic web (e.g., RDF, OWL); experience with or interest in algorithmic data visualization; familiarity with Matlab; and advanced statistical skills.
Primary mentor: Stacy Rebich Hespanha (National Center for Ecological Analysis and Synthesis - NCEAS)
Secondary mentor: Mark Schildhauer (National Center for Ecological Analysis and Synthesis - NCEAS) - Enhancing Semantic Search in ONEMercury
Description: ONEMercury, DataONE’s primary online data discovery interface, has recently been enhanced with semantic searches aimed at improving Mercury’s faceted search and for improving recall and precision. ONEMercury is an adaptation of the Mercury federated metadata system that was developed at the Oak Ridge National Laboratory Distributed Data Archive Center (ORNL DAAC). Several issues with these recent advances to ONEMercury still need to be addressed, including query selection, customization of query results display, and links from query results to actual datasets. With the breadth of scientific data that will eventually be housed within DataONE, these enhancements will become crucial to pinpointing datasets relevant to a particular query with high recall and specificity. Under the direction of project mentors, the student intern will select an appropriate aspect of ONEMercury to advance from those listed above and will work with the developers of ONEMercury to enhance DataONE’s data discovery services. This project will enable the successful candidate to learn about data discovery techniques, faceted search, and ontologies in providing data discovery services for DataONE.
Necessary Prerequisites: Experience with ontologies, faceted search technologies, and web application development.
Desirable skills/qualifications: Experience and understanding of environmental data formats and metadata systems. Experience with SOLR and LUCENE search libraries.
Primary mentor: Line Pouchard (Oak Ridge National Laboratory)
Secondary mentors: Giri Palanisamy (Oak Ridge National Laboratory), Natasha Noy (Stanford University), Jeff Horsburgh (Utah State University) - An Information Model for Observational Data within DataONE
Description: Currently, integrated access and analysis of observational data from multiple scientific domains are hindered because common characteristics of observational data, including time, location, provenance, methods, and units are described using different constructs within different systems. Integration requires multiple syntactic and semantic translations that are, in many cases, manual, error-prone, and/or lossy. Standardizing such descriptions of common characteristics within a common observations information model would lead to more reliable data integration. In its first phase, DataONE is treating datasets as opaque objects and does not require a specific format or information model for submitted data. Integration and use of observational data retrieved from DataONE could be significantly improved if the format and semantics of data objects deposited into the system conformed to a well-specified observations information model. This internship provides the opportunity to advance one or more ongoing activities that seek to facilitate the semantic interpretation and broad interoperability of scientific data. More specifically, the intern may 1) extract, document, and develop data discovery and integration use cases that would be supported by a common information model; 2) examine the formats and information content of existing DataONE datasets to extract common data and information elements and requirements; 3) advance an observations information model for DataONE, building upon the results of existing information modeling and knowledge representation projects such as SONet (http://sonet.ecoinformatics.org).
Necessary Prerequisites: Experience with information/data modeling techniques and technologies.
Desirable skills/qualifications: Experience and understanding of environmental data formats and metadata systems.
Primary mentor: Mark Schildhauer (National Center for Ecological Analysis and Synthesis - NCEAS)
Secondary mentor:Deborah McGuinness (Rensselaer Polytechnic Institute), Carl Lagoze (Cornell University), Hilmar Lapp (National Evolutionary Synthesis Center – NESCent) - Components of Successful Metadata Registry Frameworks
Description: The goal of the proposed summer intern project is to evaluate a sample of metadata registries and identify factors impacting registration success and limitations. Metadata registration (asserting element names, values, restrictions and other properties) is essential for interoperability among data repositories in the DataONE network. Advancing metadata registration processes (ease, transparency, discoverability of prior registrations) within DataONE would promote metadata use and understanding, hence discovery and interoperable re-use of data held at member nodes. Unfortunately, successful metadata registries are very few in number. For DataONE, other DataNet awardees, and any other community to move forward with registration requires an assessment of factors impacting the success and limitations of exciting registration. The Metadata Registration summer intern would pursue an assessment of existing registries and analyze their functionalities, and then, via communication with registry developers and maintainers, produce a list of features and processes crucial to successful metadata registration. The focus will be on registries most relevant to DataONE. The intern will work closely with project mentors to present a prototype framework supporting such features. The framework will be illustrative during this initial phase, and inform the work of the DataONE Preservation and Metadata Working Group’s efforts to pursue this activity on an operational level.
Necessary Prerequisites: Knowledge of processes and procedures underlying metadata scheme development, including factors impacting scheme revision and modification; an understanding of semantic interoperability, and other levels of interoperability (syntactic, interchange protocols, metadata workflows, etc.);an understanding of the rationale and potential functionality of metadata registration; knowledge of at least several national or international metadata registries for data structures and vocabularies (e.g., Dublin Core, EML, FGDC, ISO 19115); good research and communication skills, an interest in research and metadata assessment, and a willingness to pursue correspondence with metadata registry developers and maintainers.
Desirable skills/qualifications: Background in information and library science or computer science; interest in or background in scientific topics covered in the DataONE domain; awareness of the ISO/IEC 11179 Metadata Registry (MDR) standard.
Primary Mentor: Jane Greenberg (SILS Metadata Research Center)
Secondary mentor:John Kunze (University of California Curation Center, California Digital Library),Joan Boone (Metadata Research Center) - Developing a DaaS (Data as a Service) view of DataONE
Description: DataONE is establishing an interoperable, extensible, sustainable national (and international) scale cyberinfrastructure for the collection, analysis, and synthesis of data from many projects, fields, and modes of science. The V1.0 software stack for DataONE will soon be in deployment. It represents a interoperable set of data collections of the DataONE member nodes. While this software stack is complete and allows exploration and re-use of the data collections, there are other forms for how the sum of these collections could be presented. Some alternate interfaces may be more (or less) suitable for various analysis and synthesis research. This short term project would explore developing an additional, somewhat more universal key/value metadata store for the collected DataONE metadata (using an approach such as Google Big-Table, Cassandra, or similar tools as appropriate). The goal would be to enable an additional set of REST interfaces from which to access the DataONE collection along the lines of DaaS - Data as a service cloud model. The specific sequential tasks in this project could be: 1) Develop familiarity with DataONE Software stack; 2) Design DaaS service definitions and make technology choices. Included in this design should be some anticipation of future query engines and science applications; 3) Implement pilot DaaS service on DataONE test node; 4) Import and convert DataONE metadata; 5) Test implementation for correctness and performance;
6) Develop query tools to exercise the REST interface and provide useful higher level "toolbox items" for DataONE infrastructure and target science areas; 7) Pilot application to science problems of interest; 8) As well as appropriate documentation and scholarly communication throughout the project. These items could be modified in response to mentor/intern discussions early in the project as well as reasonable goals achieved during the project's duration.
Necessary Prerequisites: Programming ability; BS in Computer Science or related field with computer science experience; experience in another discipline science (preferably in a DataONE-oriented field like biology, ecology, environmental science, …)
Desirable skills/qualifications: Experience with database design, large-scale data algorithms, python proficiency , software design, RESTful web services, cyberinfrastructure, analytics proficiency (for example, R)
Primary mentor: John W. Cobb (Oak Ridge National Laboratory)
Secondary mentors: Nicholas Dexter (University of Tennessee)