Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

Schubotz, Moritz; Satpute, Ankit; Greiner-Petter, André; Aizawa, Akiko; Gipp, Bela

doi:https://doi.org/10.34657/9253

Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

dc.bibliographicCitation.firstPage	861944
dc.bibliographicCitation.journalTitle	Frontiers in research metrics and analytics	eng
dc.bibliographicCitation.volume	7
dc.contributor.author	Schubotz, Moritz
dc.contributor.author	Satpute, Ankit
dc.contributor.author	Greiner-Petter, André
dc.contributor.author	Aizawa, Akiko
dc.contributor.author	Gipp, Bela
dc.date.accessioned	2022-09-19T11:07:12Z
dc.date.available	2022-09-19T11:07:12Z
dc.date.issued	2022
dc.description.abstract	Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computational expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.	eng
dc.description.version	publishedVersion	eng
dc.identifier.uri	https://oa.tib.eu/renate/handle/123456789/10218
dc.identifier.uri	https://doi.org/10.34657/9253
dc.language.iso	eng
dc.publisher	Lausanne : Frontiers Media
dc.relation.doi	https://doi.org/10.3389/frma.2022.861944
dc.relation.essn	2504-0537
dc.rights.license	CC BY 4.0 Unported
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject.ddc	380
dc.subject.ddc	070
dc.subject.other	caching	eng
dc.subject.other	data science (DS)	eng
dc.subject.other	reproducibility of results	eng
dc.subject.other	open science	eng
dc.subject.other	research software	eng
dc.title	Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer	eng
dc.type	Article	eng
tib.accessRights	openAccess	eng
wgl.contributor	FIZ KA
wgl.subject	Erziehung, Schul- und Bildungswesen	ger
wgl.subject	Informatik	ger
wgl.type	Zeitschriftenartikel	ger

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Caching_and_Reproducibility.pdf
Size:: 148.49 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Informationswissenschaften