Full article title Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators
Journal PLOS Computational Biology
Author(s) Barone, Lindsay; Williams, Jason; Micklos, David
Author affiliation(s) Cold Spring Harbor Laboratory
Primary contact Email: lbarone at cshl dot edu
Editors Ouellette, Francis
Year published 2017
Volume and issue 13(11)
Page(s) e1005858
DOI 10.1371/journal.pcbi.1005755
ISSN 1553-7358
Distribution license Creative Commons Attribution 4.0 International
Website http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005755
Download http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005755&type=printable (PDF)

Abstract

In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high-performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

Introduction

Genotypic data based on DNA and RNA sequences have been the major driver of biology’s evolution into a data science. The current Illumina HiSeq X sequencing platform can generate 900 billion nucleotides of raw DNA sequence in under three days, four times the number of annotated nucleotides currently stored in GenBank, the United States “reference library” of DNA sequences.[1][2] In the last decade, a 50,000-fold reduction in the cost of DNA sequencing[3] has led to an accumulation of 9.3 quadrillion (million billion) nucleotides of raw sequence data in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The amount of sequence in the SRA doubled on average every six to eight months from 2007 to 2016.[4][5] It is estimated that by 2025, the storage of human genomes alone will require two to 40 exabytes[5] (an exabyte of storage would hold 100,000 times the printed materials of the U.S. Library of Congress[6]). Beyond genotypic data, big data are flooding biology from all quarters—phenotypic data from agricultural field trials, patient medical records, and clinical trials; image data from microscopy, medical scanning, and museum specimens; interaction data from biochemical, cellular, physiological, and ecological systems; as well as an influx of data from translational fields such as bioengineering, materials science, and biogeography.

A 2003 report of a National Science Foundation (NSF) blue-ribbon panel, headed by Daniel Atkins, popularized the term "cyberinfrastructure" to describe systems of data storage, software, high-performance computing (HPC), and people who can solve scientific problems of the size and scope presented by big data.[7] The report was the impetus for several cyberinfrastructure projects in the biological sciences, including the NSF’s CyVerse, the Department of Energy’s KBase, and the European Grid Infrastructure and the European Life Sciences Infrastructure for Biological Information (ELIXIR).[8] Atkins' report described cyberinfrastructure as the means to harness the data revolution and to develop a “knowledge economy.” Although people were acknowledged as active elements of cyberinfrastructure, few published studies since have assessed how well their computational and cyberinfrastructure needs are being met.

In 2006, EDUCAUSE surveyed 328 information technology (IT) professionals, primarily chief information officers, at institutions in the U.S. and Canada.[9] When asked about preferences for funding allocation, respondents rated training and consulting (20%) a distant second to infrastructure and storage (46%). This suggested that “training and consulting get short shrift when bumped against the realities of running an IT operation.”[9] Similarly, infrastructure and training emerged as important needs in a study done as part of the 2015 University of Illinois’s “Year of Cyberinfrastructure.”[10] Faculty and graduate students responding to a survey (n = 327) said they needed better access to data storage (36%), data visualization (29%), and HPC (19%). Training was not addressed in the initial survey, suggesting that it was not viewed as integral to discussions of cyberinfrastructure. However, it emerged as a major need in follow-up focus groups (n = 200).

Over the last four years, CyVerse has taken the computational pulse of the biological sciences by surveying attendees at major professional meetings. Consistently and across different conference audiences, 94% of students, faculty, and researchers said that they currently use large data sets in their research or think they will in the near future (n = 1,097). Even so, 47% rated their bioinformatics skill level as “beginner,” 35% rated themselves “intermediate,” and 6% said they have never used bioinformatics tools. Only 12% rate themselves “advanced” (n = 608), and 58% felt their institutions do not provide all the computational resources needed for their research (n = 1,024). These studies suggest a scenario of big data inundating unprepared biologists.

Results

In the summer of 2016, we expanded upon our previous studies with a purposeful needs assessment of 704 principal investigators (PIs) receiving grants from the NSF Directorate of Biological Sciences (BIO). The respondents were relatively evenly dispersed among four major BIO divisions: Division of Biological Infrastructure (DBI), Division of Environmental Biology (DEB), Division of Integrative Organismal Systems (IOS), and Division of Molecular and Cellular Biosciences (MCB). These BIO PIs worked with a variety of data, with sequence, image, phenotype, and ecological data predominating (Fig 1). The vast majority (87%) said they are currently using big data sets in their research or will within the next three years. This is slightly lower than in our previous studies of meeting attendees, a large proportion of whom had a genomics focus or were students or early career researchers.


Fig1 BaronePLOSCompBio2017 13-11.png

Figure 1. Major data types used by National Science Foundation (NSF) Biological Sciences Directorate (BIO) principal investigators (PIs)

We asked BIO PIs to rate the importance of 13 computational needs in data analysis; data storage, sharing, and discovery; and computational support and training. More than half of the PIs said that 11 of the 13 computational needs are currently important to their research. The proportions increased across all needs—82% to 97%—when PIs considered what would be important three years in the future (Fig 2). Significantly more PIs who identified themselves as bioinformaticians said nine of the current needs are important compared to PIs from all other disciplines. Significantly more PIs from larger research groups (greater than five people) said seven of the current needs are important compared to those from smaller groups. Most of the differences between bioinformaticians and larger research groups persisted in their predictions of future needs (Table 1).


Fig2 BaronePLOSCompBio2017 13-11.png

Figure 2. Current (grey) and future (blue) data analysis needs of National Science Foundation (NSF) Biological Sciences Directorate (BIO) principal investigators (PIs) (percent responding affirmatively, 387 ≤ n ≤ 551)

Tab1 BaronePLOSCompBio2017 13-11.png

Table 1. Current and future data analysis needs of National Science Foundation (NSF) Biological Sciences Directorate (BIO) principal investigators (PIs): Bioinformaticians versus others, large versus small research groups

Significantly more PIs funded by DEB said five of the current needs are important, compared to PIs funded through the other three NSF research divisions. However, differences between the four NSF divisions disappeared for predictions of future need, suggesting that computational needs will converge across all fields of biology in the future (Table 2).


Tab2 BaronePLOSCompBio2017 13-11.png

Table 2. Current and future data analysis needs of National Science Foundation (NSF) Biological Sciences Directorate (BIO) principal investigators (PIs) by the NSF BIO division

A majority of PIs—across bioinformatics/other disciplines, larger/smaller groups, and the four NSF programs—said their institutions are not meeting nine of 13 needs (Fig 3). Training on the integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HPC (71%) were the three greatest unmet needs. HPC was an unmet need for only 27% of PIs, with similar percentages across disciplines, different sized groups, and NSF programs.


Fig3 BaronePLOSCompBio2017 13-11.png

Figure 3. Unmet data analysis needs of National Science Foundation (NSF) Biological Sciences Directorate (BIO) principal investigators (PIs) (percent responding negatively, 318 ≤ n ≤ 510)

Discussion

This study fills a gap in the published literature on the computational needs of biological science researchers. Respondents had all been awarded at least one peer-reviewed grant from NSF BIO and thus represent competitive researchers across a range of biological disciplines. Even so, a majority of this diverse group of successful biologists did not feel that their institutions are meeting their needs for tackling large data sets.

This study stands in stark contrast to previous studies that identified infrastructure and data storage as the most pressing computational needs.[9][10] BIO PIs ranked availability of data storage and HPC lowest on their list of unmet needs. This provides strong evidence that the NSF and individual universities have succeeded in developing a broadly available infrastructure to support data-driven biology. Hardware is not the issue. The problem is the growing gap between the accumulation of many kinds of data and researchers’ knowledge about how to use them effectively. The biologists in this study see training as the most important factor limiting their ability to best use the big data generated by their research.

Closing this growing data knowledge gap in biology demands a concerted effort by individual biologists, by institutions, and by funding agencies. We need to be creative in scaling up computational training to reach large numbers of biologists at all phases of their education and careers and in measuring the impact of our educational investments. Metrics for a supercomputer are readily described in terms of petaflops and CPUs, and we can facilely measure training attendance and “satisfaction.” However, answering unmet training needs will require a better understanding of how institutions are attempting to meet these needs and how we can best assess their outcomes.[11] Some solutions already exist. For example, data sets available at the SRA provide almost unlimited entry points for course-based undergraduate research experiences (CUREs), which scale up discovery research in the context of for-credit courses. Participation in CUREs significantly improves student graduation rates and retention in science, effects that persist across racial and socioeconomic status.[12][13] However, many biologists acquire skills for big data analysis on their own, in the midst of their careers. Software Carpentry and Data Carpentry[14] are volunteer-driven organizations that provide a cost-effective, disseminated model for reaching biologists outside of an academic classroom.

Reflected in the top two unmet needs of BIO PIs is the looming problem of integrating data from different kinds of experiments and computational platforms. This will be required for a deeper understanding of “the rules of life”[15][16], notably, genotype-environment-phenotype interactions that are essential to predicting how agricultural plants and animals can adapt to changing climates. Such integration demands new standards of data management and attention to metadata about how these data are collected. The BIO PIs in this study are anticipating a new world of pervasive data and the training they will need to become data scientists. Likewise, funding agencies need to recognize that significant new investments in training are now required to make the best use of the biological data infrastructures they have helped establish over the last decade.

Materials and methods

This study was conducted under IRB no. 12–018 from Cold Spring Harbor Laboratory. Working from a list of 5,197 active grant awards, we removed duplicate PIs and those without email addresses to produce a final list of 3,987 subjects. The survey was administered in Survey Monkey using established methods.[17] An initial email invitation with a link to the survey was sent to each subject in June 2016, with three follow-up emails sent at two-week intervals. Surveys were completed by 704 PIs, a response rate of 17.7%, which provided a ±3.35% margin of error at the 95% confidence level.

The respondents were asked to consider 13 computational elements of research, including data storage, discovery, analysis, and sharing. For each need, PIs were asked to reflect on their current use, their anticipated future requirements, and the institutional resources available to meet the need. Data were analyzed in IBM SPSS Statistics version 23. “I don’t know” responses were eliminated from the analysis of computational needs questions. Frequencies were calculated for each of the affirmative and negative responses in the computational needs matrix. Chi-square tests for independence were used to determine if there were significant differences in computational needs across the following three dimensions: (1) NSF BIO division, (2) research area (bioinformatics/computational biology versus all others), and (3) research group size (groups of less than five versus groups with five or more).

Data are available for download at https://figshare.com/articles/Survey_of_biologist_s_computational_needs/4643641

Author summary

Our computational needs assessment of 704 principal investigators (PIs) receiving grants from the National Science Foundation (NSF) Biological Sciences Directorate (BIO) confirmed that biology is awash with big data. Nearly 90% of BIO PIs said they are currently or will soon be analyzing large data sets. They considered a range of computational needs important to their work, including high-performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. However, a majority of PIs—across bioinformatics and other disciplines, large and small research groups, and four NSF BIO programs—said their institutions are not meeting nine of 13 needs. Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HPC (71%) were the three greatest unmet needs. Hardware is not the problem; data storage and HPC ranked lowest on their list of unmet needs. The problem is the growing gap between the accumulation of big data and researchers’ knowledge about how to use it effectively.

Declarations

Acknowledgements

The authors wish to thank Bob Freeman and Christina Koch of the ACI-REF project for helpful discussions and references during the development of the survey.

Funding

This study is an Education, Outreach and Training (EOT) activity of CyVerse, an NSF-funded project to develop a “cyber universe” to support life sciences research (DBI-0735191 and DBI-1265383). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests

The authors have declared that no competing interests exist.

References

  1. "GenBank and WGS Statistics". National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/genbank/statistics/. Retrieved 2017. 
  2. "HiSeq X Series of Sequencing Systems" (PDF). Illumina, Inc. 22 March 2016. https://www.illumina.com/content/dam/illumina-marketing/documents/products/datasheets/datasheet-hiseq-x-ten.pdf. Retrieved 2017. 
  3. Wetterstrand, K.. "DNA Sequencing Costs: Data". National Human Genome Research Institute. https://www.genome.gov/sequencingcostsdata/. Retrieved 2017. 
  4. "Sequence Read Archive". National Center for Biotechnology Information. 7 November 2017. https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement. Retrieved 2017. 
  5. 5.0 5.1 Stephens, Z.D.; Lee, S.Y.; Faghri, F. et al. (2015). "Big Data: Astronomical or Genomical?". PLOS Biology 13 (7): e1002195. doi:10.1371/journal.pbio.1002195. PMC PMC4494865. PMID 26151137. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494865. 
  6. Johnston, L. (25 April 2012). "A “Library of Congress” Worth of Data: It’s All In How You Define It". The Signal. Library of Congress. https://blogs.loc.gov/thesignal/2012/04/a-library-of-congress-worth-of-data-its-all-in-how-you-define-it/. Retrieved 08 March 2017. 
  7. Atkins, D.E.; Droegemeier, K.K.; Feldman, S.I. et al. (January 2003). "Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure" (PDF). National Science Foundation. https://www.nsf.gov/cise/sci/reports/atkins.pdf. Retrieved 2017. 
  8. Duarte, A.M.; Psomopoulos, F.E.; Blanchet, C. et al. (2015). "Future opportunities and trends for e-infrastructures and life sciences: Going beyond the grid to enable life science data analysis". Frontiers in Genetics 6: 197. doi:10.3389/fgene.2015.00197. PMC PMC4477178. PMID 26157454. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4477178. 
  9. 9.0 9.1 9.2 Blustain, H.; Katz, R.; Salaway, G. (28 August 2006). "IT Engagement in Research: A Baseline Study". EDUCAUSE. https://library.educause.edu/resources/2006/8/it-engagement-in-research-a-baseline-study. Retrieved 2017. 
  10. 10.0 10.1 Towns, J.; Gerstenecker, D.; Herriott, L. et al. (12 October 2015). "University of Illinois Year of Cyberinfrastructure Final Report". IDEALS. University of Illinois. https://www.ideals.illinois.edu/handle/2142/88444. Retrieved 2017. 
  11. Williams, J.J.; Teal, T.K. (2017). "A vision for collaborative training infrastructure for bioinformatics". Annals of the New York Academy of Sciences 1387 (1): 54–60. doi:10.1111/nyas.13207. PMID 27603332. 
  12. Auchincloss, L.C.; Laursen, S.L.; Branchaw, J.L. et al. (2014). "Assessment of course-based undergraduate research experiences: A meeting report". CBE Life Sciences Education 13 (1): 29–40. doi:10.1187/cbe.14-01-0004. PMC PMC3940459. PMID 24591501. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3940459. 
  13. Rodenbusch, S.E.; Hernandez, P.R.; Simmons, S.L. et al. (2016). "Early Engagement in Course-Based Research Increases Graduation Rates and Completion of Science, Engineering, and Mathematics Degrees". CBE Life Sciences Education 15 (2): ar20. doi:10.1187/cbe.16-03-0117. PMC PMC4909342. PMID 27252296. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4909342. 
  14. Teal, T.K.; Cranston, K.A.; Lapp, H. et al. (2015). "Data Carpentry: Workshops to Increase Data Literacy for Researchers". International Journal of Digital Curation 10 (1). doi:10.2218/ijdc.v10i1.351. 
  15. Olds, J. (1 July 2015). "Understanding the Rules of Life: Examining the Role of Team Science". Bio Buzz. Directorate for Biological Sciences. https://oadblog.nsfbio.com/2015/07/01/team_science/. Retrieved 08 March 2017. 
  16. Mervis, J. (10 May 2016). "NSF director unveils big ideas, with an eye on the next president and Congress". Science. American Association for the Advancement of Science. http://www.sciencemag.org/news/2016/05/nsf-director-unveils-big-ideas-eye-next-president-and-congress. Retrieved 2017. 
  17. Dillman, D.A.; Smyth, J.D.; Christian, L.M. (2009). Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. Wiley. pp. 499. ISBN 9780471698685. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. In one case, the original URL was dead but updated with a functional one.