Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

dc.bibliographicCitation.firstPage861944
dc.bibliographicCitation.volume7
dc.contributor.authorSchubotz, Moritz
dc.contributor.authorSatpute, Ankit
dc.contributor.authorGreiner-Petter, André
dc.contributor.authorAizawa, Akiko
dc.contributor.authorGipp, Bela
dc.date.accessioned2022-09-19T11:07:12Z
dc.date.available2022-09-19T11:07:12Z
dc.date.issued2022
dc.description.abstractSmall to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computational expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.eng
dc.description.versionpublishedVersioneng
dc.identifier.urihttps://oa.tib.eu/renate/handle/123456789/10218
dc.identifier.urihttp://dx.doi.org/10.34657/9253
dc.language.isoeng
dc.publisherLausanne : Frontiers Media
dc.relation.doihttps://doi.org/10.3389/frma.2022.861944
dc.relation.essn2504-0537
dc.relation.ispartofseriesFrontiers in research metrics and analytics 7 (2022)
dc.rights.licenseCC BY 4.0 Unported
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectcachingeng
dc.subjectdata science (DS)eng
dc.subjectreproducibility of resultseng
dc.subjectopen scienceeng
dc.subjectresearch softwareeng
dc.subject.ddc380
dc.subject.ddc070
dc.titleCaching and Reproducibility: Making Data Science Experiments Faster and FAIRereng
dc.typearticleeng
dc.typeTexteng
dcterms.bibliographicCitation.journalTitleFrontiers in research metrics and analytics
tib.accessRightsopenAccesseng
wgl.contributorFIZ KA
wgl.subjectErziehung, Schul- und Bildungswesenger
wgl.subjectInformatikger
wgl.typeZeitschriftenartikelger
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Caching_and_Reproducibility.pdf
Size:
148.49 KB
Format:
Adobe Portable Document Format
Description: