Full article title Data without software are just numbers
Journal Data Science Journal
Author(s) Davenport, James H.; Grant, James; Jones, Catherine M.
Author affiliation(s) University of Bath, Science and Technology Facilities Council
Primary contact Email: J dot H dot Davenport at bath dot ac dot uk
Year published 2020
Volume and issue 19(1)
Article # 3
DOI 10.5334/dsj-2020-003
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website https://datascience.codata.org/articles/10.5334/dsj-2020-003/
Download https://datascience.codata.org/articles/10.5334/dsj-2020-003/galley/929/download/ (PDF)

Abstract

Great strides have been made to encourage researchers to archive data created by research and provide the necessary systems to support their storage. Additionally, it is recognized that data are meaningless unless their provenance is preserved, through appropriate metadata. Alongside this is a pressing need to ensure the quality and archiving of the software that generates data, through simulation and control of experiment or data collection, and that which analyzes, modifies, and draws value from raw data. In order to meet the aims of reproducibility, we argue that data management alone is insufficient: it must be accompanied by good software practices, the training to facilitate it, and the support of stakeholders, including appropriate recognition for software as a research output.

Keywords: software citation, software management, reproducibility, archiving, research software engineer

Introduction

In the last decade, there has been a drive towards improved research data management in academia, moving away from the model of "supplementary material" that did not fit in publications, to the requirement that all data supporting research be made available at the time of publication. In the U.K., for example, the Research Councils have a Concordat on Open Research Data[1], and the E.U.’s Horizon 2020 program incorporates similar policies on data availability.[2] The FAIR principles[3]—that state data be findable, accessible, interoperable, and re-usable—embody the philosophy underlying this: data should be preserved through archiving with a persistent identifier, it should be well described with suitable metadata, and it should be done in a way that is relevant to the domain. Together with the OpenAccess movement, there has been a profound transformation in the availability of research and the data supporting it.

While this is a great stride towards transparency, it does not by itself improve the quality of research, and even what exactly transparency entails remains debated.[4] A common theme discussed in many disciplines is the need for a growing emphasis on "reproducibility."[5][6][7] This goes beyond data itself, requiring software and analysis pipelines to be published in a usable state alongside papers. In order to spread such good practices, a coordinated effort towards training in professional programming methods in academia, recognizing the role of research software and the effort required to develop it, and storing the software instance itsels as well as the data it creates and operates on.

In the next section of this article we next discuss two cases where the use of spreadsheets highlights the need for programmatic approaches to analysis, then in the subsequent section we review the research software engineer movement, which now has nascent organizations internationally. While some domains are adopting and at the forefront of developing good practices, the sector-wide approaches needed to support their uptake generally are lacking; we discuss this issue in the penultimate section. We finally close by summarizing how data librarians and research software engineers need to work with researchers to continue to improve the situation.

When analysis "goes wrong"

The movement towards reproducible research is driven by the belief that reviewers and readers should be able to verify and readily validate the analysis workflows supporting publications. Rather than being viewed as questioning academic rigor, this concept should be embraced as a vital part of the research cycle. Here we discuss two examples which illustrate how oversights can cause issues, which ultimately should be avoidable.

How not to Excel ... at economics

Reinhart and Rogoff’s now notorious 2010 paper showed a headline figure of a 0.1% contraction for economies with >90% debt.[8] A number of issues with their work are raised by Herndon, Ash, and Pollin[9], who were unable to reproduce the results—despite the raw data being published—since Reinhart & Rogoff's method was not fully described. Further, when the spreadsheet used for the calculation was analyzed it was found that five countries (Australia, Austria, Belgium, Canada, and Denmark) had been incorrectly omitted from the analysis. Together with methodological issues, the revised analysis showed a 2.2% growth.

The mistakes received particular attention, with numerous article published on the topic (e.g., Borwein & Bailey's 2013 article[10]), since the original paper was used to justify austerity policies aimed at cutting debt, in the U.S., U.K., and E.U., as well as within the Inernational Monetary Fund (IMF). The reliance of the proponents of these policies—and their economic and geopolitical results—on a flawed analysis should act as a stark warning that all researchers need to mitigate against error and embrace transparency.

How not to Excel ... with genes

When files are opened in Microsoft Excel, the default behaviour is to infer data types, but while this may benefit general users, it is not always helpful. For example, two gene symbols, SEPT2 and MARCH1, are converted into dates, while certain identifiers (e.g., 2310009E13) are converted to floating point numbers. Although this has been known since 2004, a 2016 study by Ziemann, Eren, and El-Osta[11] found that the issue continues to affect papers, as identified through supplementary data. Numbers have typically increased year-on-year, with 20% of papers affected on average, rising to over 30% in Nature. This problem continues to occur despite the problem being sufficiently mature and pervasive, so much so (and despite the fact) that a service has been developed to identify affected spreadsheets.[12]

Research software

While we stress that non-programmatic approaches such as the use of spreadsheets do not of themselves cause errors, it does compromise the ability to test and reproduce analysis workflows. Further, the publication of software is part of a wider program of transparency and open access.[13] However, if these relatively simple issues occur, we must find ways of identifying and avoiding all problems with data analysis, data collection, and experiment operation. If it also makes deliberately obfuscated methods easier to identify and discuss with authors at review.

Increasingly, research across disciplines depends upon software, used for experimental control or instrumentation, simulating models or analysis, and turning numbers into figures. It is vital that bespoke software is published alongside the journal article and the data it supports. While it doesn’t ensure that code is correct, it does enable the reproducibility of analysis and allows experimental workflows to be checked and validated against correct or "expected" behavior. Making code available and employing good practice in its development should be the default, whether it be a million lines of community code or a short analysis script.

The Research Software Engineer movement grew out of a working group of the Software Sustainability Institute[14] (SSI), which has since been a strong supporter of the U.K. Research Software Engineer Association (UKRSEA), now known as the Society of Research Software Engineering (RSE).[15] The aim has been to improve the sustainability, quality, and recognition of research software by advocating good software practice (see, e.g., Wilson et al.[16]) and career progression for its developers. Its work has resulted in recognition of the role by funders and fellowship schemes, as well as growing recognition of software as a vital part of e-infrastructure. Its success has spawned sister organizations internationally in Germany, Netherlands, Scandanavia, and the U.S.

A 2014 survey by the SSI showed that 92% of researchers used research software, and that 69% would not be able to conduct their research without it.[17] Research software was defined as that used to generate, process, or analyze results for publication. Furthermore, 56% of researchers developed software, of whom 21% had never received any form of software training. It is clear that software underpins modern research and that many researchers are involved in development, even if it is not their primary activity.

Programmatic approaches to analysis and plotting allow for greater transparency, deliver efficiencies for researchers in academia, and, with formal training, improve employability in industry. Their adoption is further motivated by the requirements of funders and journals, which increasingly require, or at least encourage (see, e.g. the Associate for Computing Machinery[18]), publication of software. This evolving landscape requires a rapid and connected response from researchers, data managers, and research software engineers if institutions are to improve software development practices in a sustainable way.

Establishing cultural change

In spite of the vital role research software plays, it largely remains undervalued, with time spent in training or development seen as detracting from the "real research." The lack of recognition starts with funders’ level of investment, the development and maintenance of code, and institutions and investigators. This is compounded by the U.K.’s Research Assessment Exercises, and similar evaluations elsewhere, which have prioritized papers over all else. This results in inefficient development of new capability or introduction to new users, wasting researcher time and funder’s investment. The lack of recongition also ingrains bad habits, with the result that the longer researchers spend in academia, the lower their employability as software developers in industry becomes. Three areas in particular are key to securing the change in culture to mirror what has been achieved with research data management.

Training

In recent years, organizations such as Software Carpentry[19] have led the development of training material to improve the professional software development skills of researchers. Material is available under Creative Commons licence and introduces programming skills and methods such as working with Unix, using version control, understanding programming languages, and practicing automation with Make.

The need for such training is recognized in the recent Engineering and Physical Sciences Research Council (EPSRC) call for new Centres for Doctoral Training (CDTs, one of the principal streams of research postgraduate funding in the U.K.)[20]:

It is therefore a certainty that many of the students being trained through the CDTs will be using computational and data techniques in their projects ... It is essential that they are given appropriate training so that they can confidently undertake such research in a manner that is correct, reproducible and reusable such as data curation and management.

To achieve this, there is a need to increase the number of training sessions and range of courses. Introductory courses alone are not sufficient to generate reproducible research, manage analysis workflows, and improve paper writing.[21] This requires additional in-depth training and mentoring to develop programming skills, including the use of version control appropriately for data management and the automation of testing. Indeed, CarpentryCon events are focusing efforts to develop courses to address these and other recommendations of Jiménez et al.[22]

Recognition

One of the principal challenges to improving research software is the lack of recognition. In spite of the ubiquity of research software and its role in enabling research, there is no formal citation or credit in assessment exercises. At times developers are named on papers, but often they are not, and there is no standardised approach to allow contributions to specific functionality or versions of code to be highlighted. A number of working groups have been addressing this (e.g., FORCE11[23] and the group Working Towards Sustainable Software for Science: Practice and Experience [WSSSPE][24]), producing various papers and blogs posts (see, e.g., Smith et al.[23], Jones et al.[25], and Martone[26]) that lay out a vision for what software citation could look like.

Services to support version control are plentiful, while versioning and persistent identification of research software exist alongside research data management tools and services (such as [27] and Figshare[28]). Additionaly, the Digital Curation Centre[29] has introduced a Software Management Plan template in collaboration with the SSI. Mechanisms are in place to support the development, publication, and citation of software if the benefits are recognised by funders. Related to this is the benefit appropriate recognition can give to developing career pathways for research software engineers.

A further aspect is the role of software in delivering a "digital" research experience. While computing has revolutionized research, the principal output—research papers, their look, and their process of publication—has barely changed. Some organizations are looking at how technology might deliver alternative experiences, and how the publication process itself might be modernized (e.g., F1000Research[30]).

Policy

Funders and journals are key drivers for changes in the way in which research software is valued, and in having it recorded with reproducible workflows. The clear training requirements in EPSRC’s CDT proposal are aligned with funder requirements, U.K. Research and Innovation (UKRI), and the European Research Council (ERC), mandating that research software, where possible, should be made freely available when the research that it enables is published. For a number of years, the Defense Advanced Research Projects Agency (DARPA) has published all of the software that it has supported in a single catalog, the DARPA Open Catalog.[31]

Similarly, the Nature Publishing Group have recently strengthened their requirements in respect to software to include its publication and usability at time of submission to one of its journals.[32] This should be done "in a way that allows readers to repeat the published results."[32] Efficient workflow and recognition should be sufficient carrots to engage researchers, but if they aren't, then these changes in policy are the sticks. They will force institutions to develop policies to support research software, reproducibility, and the transparency it delivers, as is happening in research data management.

Albeit one of semantics, an issue that does need to be addressed is one of terminology, in particular the terms "replicate" and "reproduce" can have different, indeed contradictory definitions.[33][34] Once there is common language and understanding to complement the tools, the programatic approach to data analysis and publication of software should be as ubiquitous as open data and open access are becoming.

Towards research software management

Software management in general is a much-studied subject, and many companies live or die by their software management in a way that comparatively few academic groups do. These companies may use large proprietary systems, but open-source solutions also exist, and the availability of open continuous integration tools to automate testing means that the resource barriers to software management are much lower than they used to be. Additionally, containerization offers potential for reproducibility since it allows the storage and re-use of the system environment when the software was originally executed. The real barriers these days are the lack of consistent application of good practice across the board and the stop-start funding models too common in academia.

Encouraging the use of modern methods and professional training will improve the quality of research software, as well as the employability and value of researchers to industry if they leave academia. It is also important that institutions continue to invest in research software engineers to support this effort, as is being seen across the U.K. and through schemes such as the EPSRC’s RSE Fellowship programs.[35] Perhaps, most importantly, reproducibility requires research software engineers and data librarians to work together with researchers rather than in isolation. A recent workshop at TU Delft provided an opportunity for this[36], but larger scale events are required to increase engagement, get us out of our silos, and ensure that the tools, services, and training are designed with and for the benefit of researchers.

Acknowledgements

Competing interests

The authors have no competing interests to declare.

References

  1. Higher Education Funding Council for England, Research Councils UK, Universities UK, Wellcome (28 July 2016). "Concordat on Open Research Data" (PDF). https://www.ukri.org/files/legacy/documents/concordatonopenresearchdata-pdf/. 
  2. Directorate-General for Research & Innovation (26 July 2016). "Guidelines on FAIR Data Management in Horizon 2020" (PDR). H2020 Programme. European Commission. https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf. Retrieved 12 August 2019. 
  3. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175. 
  4. Lyon, L.; Jeng, W.; Mattern, E. (2017). "Research Transparency: A Preliminary Study of Disciplinary Conceptualisation, Drivers, Tools and Support Services". International Journal of Digital Curation 12 (1): 46–64. doi:10.2218/ijdc.v12i1.530. 
  5. Chen, X.; Dallmeier-Tiessen, S.; Dasler, R. et al. (2019). "Open is not enough". Nature Physics 15: 113–19. doi:10.1038/s41567-018-0342-2. 
  6. Mesnard, O.; Barba, L.A. (2017). "Reproducible and Replicable Computational Fluid Dynamics: It’s Harder Than You Think". Computing in Science & Engineering 19 (4): 44–55. doi:10.1109/MCSE.2017.3151254. 
  7. Allison, D.B.; Shiffrin, R.M.; Stodden, V. (2018). "Reproducibility of research: Issues and proposed remedies". Proceedings of the National Academy of Sciences of the United States of America 115 (11): 2561–62. doi:10.1073/pnas.1802324115. PMC PMC5856570. PMID 29531033. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5856570. 
  8. Reinhart, C.M.; Rogoff, K.S. (2010). "Growth in a Time of Debt". American Economic Review 100 (2): 573–78. doi:10.1257/aer.100.2.573. 
  9. Herndon, T.; Ash, M.; Pollin, R. (2013). "Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff". Cambridge Journal of Economics 38 (2): 257–279. doi:10.1093/cje/bet075. 
  10. Borwein, J.; Bailey, D.H. (22 April 2020). "The Reinhart-Rogoff error – or how not to Excel at economics". The Conversation. https://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646. 
  11. Ziemann, M.; Eren, Y.; El-Osta, A. (2016). "Gene name errors are widespread in the scientific literature". Genome Biology 17 (1): 177. doi:10.1186/s13059-016-1044-7. PMC PMC4994289. PMID 27552985. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4994289. 
  12. Mallona, I.; Peinado, M.A. (2018). "Truke, a web tool to check for and handle excel misidentified gene symbols". BMC Genomics 18 (1): 242. doi:10.1186/s12864-017-3631-8. PMC PMC5359807. PMID 28327106. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5359807. 
  13. Munafò, M.R.; Nosek, B.A.; Bishop, D.V.M. et al. (2017). "A manifesto for reproducible science". Nature Human Behaviour 1: 0021. doi:10.1038/s41562-016-0021. 
  14. Software Sustainability Institute. "Software Sustainability Institute". https://www.software.ac.uk/. 
  15. Society of Research Software Engineering. "RSE Society of Research Software Engineering". http://rse.ac.uk/. 
  16. Wilson, G.; Bryan, J.; Cranston, K. et al. (2017). "Good enough practices in scientific computing". PLoS Computational Biology 13 (6): e1005510. doi:10.1371/journal.pcbi.1005510. PMC PMC5480810. PMID 28640806. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5480810. 
  17. Hettrick, S.; Antronioletti, M.; Carr, L. et al. (2014). "UK Research Software Survey 2014". Zenodo. doi:10.5281/zenodo.14809. 
  18. Association for Computing Machinery (2018). "Software and Data Artifacts in the ACM Digital Library". https://www.acm.org/publications/artifacts. 
  19. Software Carpentry. "Software Carepentry". https://www.software-carpentry.org/. 
  20. Engineering and Physical Sciences Research Council (February 2018). "EPSRC 2018 CDTs" (PDF). https://epsrc.ukri.org/files/funding/calls/2018/2018cdtsoutlinescall/. Retrieved 12 August 2019. 
  21. Mawdsley, D.; Haines, R.; Jay, C. (2017). "Reproducible Research is Software Engineering". RSE 2017 Conference. University of Manchester. http://idinteraction.cs.manchester.ac.uk/RSE2017Talk/ReproducibleResearchIsRSE.html#/. Retrieved 12 August 2019. 
  22. Jiménez, R.C.; Kuzak, M.; Alhamdoosh, M. et al. (2017). "Four simple recommendations to encourage best practices in research software". F1000Research. doi:10.12688/f1000research.11407.1. PMC PMC5490478. PMID 28751965. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490478. 
  23. 23.0 23.1 Smith, A.M.; Katz, D.S.; Niemeyer, K.E. et al. (2016). "Software citation principles". PeerJ Computer Science 2: e86. doi:10.7717/peerj-cs.86. 
  24. Working Towards Sustainable Software for Science: Practice and Experience. "Proceedings". http://wssspe.researchcomputing.org.uk/proceedings/. 
  25. Jones, C.M.; Matthews, B.; Gent, I. et al. (2017). "Persistent Identification and Citation of Software". International Journal of Digital Curation 11 (2). doi:10.2218/ijdc.v11i2.422. 
  26. Martone, M., ed. (2014). "Data Citation Synthesis Group: Joint Declaration of Data Citation Principles". FORCE11. doi:10.25490/a97f-egyk. https://www.force11.org/datacitationprinciples. 
  27. Zenodo. "Zenodo". https://zenodo.org/. 
  28. Figshare. "Figshare". https://figshare.com/. 
  29. Digital Curation Centre. "Digital Curation Centre". https://www.dcc.ac.uk/. 
  30. F1000Research. "F1000Research". https://f1000research.com/. 
  31. Defense Advanced Research Projects Agency. "Open Catalog". https://www.darpa.mil/opencatalog. 
  32. 32.0 32.1 Nature Research. "Reporting standards and availability of data, materials, code and protocols". Nature Research - Editorial Policies. https://www.nature.com/nature-research/editorial-policies/reporting-standards. 
  33. Barba, L.A. (2018). "Terminologies for Reproducible Research". arXiv. https://arxiv.org/abs/1802.03311. 
  34. Plesser, H.E. (2018). "Reproducibility vs. Replicability: A Brief History of a Confused Terminology". Frontiers in Neuroinformatics 11: 76. doi:10.3389/fninf.2017.00076. PMC PMC5778115. PMID 29403370. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5778115. 
  35. Engineering and Physical Sciences Research Council (21 April 2017). "Research Software Engineer Fellowships II". https://epsrc.ukri.org/funding/calls/research-software-engineer-fellowships-ii/. Retrieved 12 August 2019. 
  36. Cruz, M.; Kurapati, S.; Türkyilmaz-van der Velden, Y. (2018). "Software Reproducibility: How to put it into practice?". OSFPreprints. doi:10.31219/osf.io/z48cm. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; however, this version lists them in order of appearance, by design.