Open Data
Open data refers to the practice of making research data freely available to others to use and reuse, with minimal restrictions. In the context of academic research, open data is becoming increasingly important as a way to promote transparency, collaboration, and reproducibility in the scientific process. By sharing their data, researchers can increase the visibility and impact of their work, facilitate collaborations, and build trust in the scientific method. Regarding this later point, Open Data can enable other researchers to replicate and build upon previous studies, leading to a more robust knowledge. Unfortunately, the production of Open Data sets is not currently included in the metrics used for researchers evaluation in CNRS.
The bare necessities
The minimum first step should be to publish the data related to all graphs included in a paper. In other words, if you include a figure in an article, the data points, or histograms data, … should be made available for anybody to re-plot the figure by themselves, or include your points in their studies.
Many publishers actually offer this option [1], and referees are more and more likely to request access to such material during the review process anyway.
We note that the data published as Supplemental material will only be as accessible as the paper itself (i.e. under the same license), and may not have a specific DOI (for example, the journal Physical Review C. only provides a Supplemental Material URL for the supplemental material), and it will just be associated with the article, not exist by itself.
True Open Data: the Fair principles
A true Open Data publication follows the Fair principles. Fair stands for Findability, Accessibility, Interoperability, and Reuse [2] and [3]. Each of this item is important to ensure that the data can be found and use properly.
- Findability
The data should be easily found. This means steps have to be taken so that it can be indexed and searched correctly. This includes attribution of a DOI, as well as the inclusion of descriptive and relevant metadata.
- Accessibility
Once identified, the data should be accessible via a communication standard. If some data are small enough to be simply downloaded from a website or equivalent service, this might not be the best options for very large datasets. In the later cases, specific solutions should be implemented.
The access to the data may require identification (the list of person accessing the data is logged) as well as registration (not everyone has access to the data).
- Interoperability
To be used, the data is most likely to need to be interpreted by a specific software, or be compared to other data set. For this purpose, the metadata should clearly indicate the compatible softwares, the data format, … so that it can be read correctly again in the future.
- Reusability
For reuse, the data should be attributed a specific copyright license that indicate who can use it, for what purpose, and if it can be redistributed as-is or in a modified way. Again, it is in the metadata that all this information is to be indicated.
Note
As for many things, Fair is more an ideal to aim for rather than a strict pass or fail stamp. Just because one can’t check all the marks does not means the achievable steps should not be undertaken.
To help the researcher with Open Data publication, some website and services are available to deposit the data, such as Zenodo (an open registry developed by the European OpenAIRE program and operated by Cern) or Recherche Data Gouv (a French government backed open science repository). These repositories make the process of depositing data easier and ensure the compliance with Fair principles.
Note
Although the general goal is to have truly Open data sets, a published data set may still require some sort of registration, identification, or authorization before access, while still being considered open and following the Fair principles.
Metadata, metadata, metadata
The three key components of a good Open Data deposit are: the presence of metadata, extended metadata and accurate metadata. Indeed, having a binary data file is good, but if you don’t know what’s in it or how to read it, it’s just a useless chunk of bits [4].
The term metadata means “all the information that is necessary to identify and make sense of Open Data files”. This will most likely include the author’s info, the description of the data set, … but also reference and documentation on the format of the files.
Depending on the platform where the Open Data set is published, the required information in the metadata will vary. Below, are a few metadata fields that are most likely required in an Open Data deposit:
- Data information
A simple title and description should be given. Optionally, keywords, Subject headings and any other categorizing scheme should be used to ensure proper indexing.
- Authorship
The authors of the data, as well as eventual special funding sources, … should be clearly identified. If possible, and in addition to name and affiliation, the author’s identification numbers on platforms such as Orcid or HAL should be included.
- License
The copyright license under which the data is published has to be indicated.
- Description
Finally, the full description of the data (what format is it, how can it be read, how was it obtained, …) should be given, so that anybody can access it and use it correctly.
Guidelines and standards for writing metadata exist [5] and [6]. The file format of the metadata (when you create it yourself, some deposit platform will help you by providing a webpage form to fill) is usually XML, Json or yaml (which are almost interchangeable).
The documentation of the data format is very important, it is the absolute requirements to be able to read the data again later. In the appendix Data format, we discuss the constraints and choices related to formatting data.
When many files are concerned, automatic gathering of metadata is possible with scripts. I developed one that collect file properties (type, size, checksum, …) and output them in yaml format in order to prepare the metadata.yaml files in my repositories “Experimental (n, n’ gamma) cross sections for isotopes 182,184 and 186W“ [7] and Experimental gamma-ray data recorded with a LaBr3 and digital acquisition. [8]. The script can be found in a dedicated git repository: “file_metadata“ [9].
Case Study: Metadata for this manuscript
The file metadata.xml (Listing 9) contains metadata in the DataCite format [5] about this document. It is shown in Listing 9.
<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 https://schema.datacite.org/meta/kernel-4.4/metadata.xsd">
<identifier identifierType="other">HDR_ghenning_2024</identifier>
<titles>
<title xml:lang="en">
Applying Random Sampling methods to data analysis for uncertainty production, with an Open source and Open science outlook.
</title>
</titles>
<creators>
<creator>
<creatorName nameType="Personal">Henning, Greg</creatorName>
<givenName>Greg</givenName>
<familyName>Henning</familyName>
<nameIdentifier nameIdentifierScheme="orcid" schemeURI="https://orcid.org/">
https://orcid.org/0000-0003-3678-8728
</nameIdentifier>
<affiliation affiliationIdentifier="https://ror.org/01g3mb532" affiliationIdentifierScheme="ror" SchemeURI="https://ror.org/">
Institut Pluridisciplinaire Hubert Curien
</affiliation>
<affiliation affiliationIdentifier="https://ror.org/00pg6eq24" affiliationIdentifierScheme="ror" SchemeURI="https://ror.org/">
Université de Strasbourg
</affiliation>
</creator>
</creators>
<descriptions>
<description xml:lang="en" descriptionType="Abstract">
This HDR manuscript discusses the significance of nuclear data for energy applications, with and emphasis on the need for improved accuracy in reaction modeling.
It focuses on inelastic neutron scattering reactions and presents findings from experimental studies using the Grapheme setup at the Gelina facility.
Specifically, it presents measurements on 183W, with the end goal of constrainning reaction models.
Thies study uses a full Monte Carlo analysis approach to produce uncertainties and correlation matrices, aiming for comprehensive documentation.
Embracing Open Science principles, the manuscript details the current research practice standards for better publication of research products.
</description>
</descriptions>
<Date dateType="Created">2024-06-27</Date>
<Date dateType="Submitted">2024-07-01</Date>
<publisher xml:lang="en">Université de Strasbourg</publisher>
<publicationYear>2024</publicationYear>
<subjects>
<subject xml:lang="en" subjectScheme="PhySH - Physics Subject Headings" schemeURI="https://physh.org/" valueURI="https://physh.org/concepts/ef098eda-722a-4e25-8996-aa19116b725a">
Inelastic scattering reactions
</subject>
<subject xml:lang="en" subjectScheme="PhySH - Physics Subject Headings" schemeURI="https://physh.org/" valueURI="https://physh.org/concepts/624a78a4-af11-4f01-8bb8-5a0a81905d3d">
Nucleon induced nuclear reactions
</subject>
<subject xml:lang="en" subjectScheme="PhySH - Physics Subject Headings" schemeURI="https://physh.org/" valueURI="https://physh.org/concepts/14fede99-f5aa-4e3e-94be-167116d8c322">
Neutron physics
</subject>
<subject xml:lang="en" subjectScheme="PhySH - Physics Subject Headings" schemeURI="https://physh.org/" valueURI="https://physh.org/concepts/cd1858f1-89e4-4b0e-864a-b5a4a73de5a8">
Nuclear data analysis & compilation
</subject>
<subject xml:lang="en" subjectScheme="PhySH - Physics Subject Headings" schemeURI="https://physh.org/" valueURI="https://physh.org/concepts/eb9bd2e1-eedd-4bd0-997d-58b44ffa3ebb">
Monte Carlo methods
</subject>
<subject xml:lang="en" subjectScheme="LCSH - Library of Congress Subject Headings" schemeURI="https://id.loc.gov/authorities/subjects.html" valueURI="http://id.loc.gov/authorities/subjects/sh85139563">
Uncertainty
</subject>
<subject xml:lang="en" subjectScheme="LCSH - Library of Congress Subject Headings" schemeURI="https://id.loc.gov/authorities/subjects.html" valueURI="http://id.loc.gov/authorities/subjects/sh2012002918">
Measurement uncertainty (Statistics)
</subject>
<subject xml:lang="en" subjectScheme="LCSH - Library of Congress Subject Headings" schemeURI="https://id.loc.gov/authorities/subjects.html" valueURI="http://id.loc.gov/authorities/subjects/sh85004781">
Analysis of covariance
</subject>
</subjects>
<Contributor contributorType="ProjectLeader">
<contributorName nameType="Personal ">Henning, Greg</contributorName>
<givenName>Greg</givenName>
<familyName>Henning</familyName>
<nameIdentifier nameIdentifierScheme="orcid" schemeURI="https://orcid.org/">
https://orcid.org/0000-0003-3678-8728
</nameIdentifier>
<affiliation affiliationIdentifier="https://ror.org/01g3mb532" affiliationIdentifierScheme="ror" SchemeURI="https://ror.org/">
Institut Pluridisciplinaire Hubert Curien
</affiliation>
</Contributor>
<language>en</language>
<resourceType resourceTypeGeneral="Dissertation">
Habiliatation à Diriger les Recherches / Manuscript
</resourceType>
<Format>application/pdf</Format>
<Format>text/html</Format>
<Format>text/x-rst</Format>
<Rights rightsURI="https://creativecommons.org/licenses/by/4.0" rightsIdentifier="CC BY 4.0">
Creative Commons Attribution 4.0
</Rights>
<version>1.0</version>
</resource>
Here, one can see that the subjects of the document refers to two different schemes (the Physics Subject Headings, and Library of Congress Subject Headings) as the PhySH do not include subject heading for uncertainties or covariance.
Case Study: Test data recorded with an LaBr3 detector
As a case study, I offer the deposit of a very simple dataset of a \(^{152}\text{Eu}\) source \(\gamma\) ray spectrum recorded with an LaBr3 detector connected to a digital acquisition, as shown in Figure 68.
The data [8], recorded when testing and characterizing the detector, were deposited on Recherche Data Gouv with the specific goal of being an example of Open Data publication. The data itself is very simple: one binary file where the raw data is recorded by the acquisition, and one additionally file of metadata created by the acquisition (this metadata is related only to the recorded data).
The deposit contains eleven files in total. Indeed, in addition to the raw data, documentation of the detector is given, so that any user can be sure of what has been used to record the data. A description of the conditions of recording are also given (schematic of the geometry, photographs). A README.md file (written in markdown) describe the files and the data (how it was obtained, how it can be read, …). An example of the processed data is also given.
Finally, a metadata file is provided.
In yaml format, it lists the deposit content, describes the data and the format of each file.
For the type of file, the MIME Media type scheme was used.
The important file (the actual data) has a checksum
field that allows users to make sure the data they download is the actual file from the deposit.
Once deposited on the platform and validated by a curator, the dataset can be viewed, downloaded and cited with its own DOI [8].
DMP (i.e.) What we should have started with
DMP stands for Data Management Plan. As the name hints, it is the plan to manage the data. Ideally, the DMP is prepared in advance of data recording. Its goal is to cover the whole life cycle of the data: from its production to its long term storage, sharing and eventual permanent deletion. (Although the DMP is supposed to be written at the start of a data recording cycle, we mentioned it only at the end of this Section because to fully understand what is inside the management plan, we needed first to expose the different aspects of Open Data.)
The DMP will help identify who produce the data, where it will be stored, how it will be organized, for how long, who is going to have access (and how) as well as how/where/if the data will be made available as Open Data.
Writing the DMP is a great way to iron out many details that will be included in the metadata and foresee possible challenges when sharing large amount of data between different institutions. It is also a key element in maintaining research continuity i.e. resilience to changes in team composition.
You are not left by yourself to write your DMP, websites such as dmp.opidor.fr can help you create (with the help of collaborators) the DMP and maintain it.
Note
The DMP is not a fixed, once-written-never-touched-again file, but a living document that may evolve with the project. Of course, the more in-advanced prepared it is, the better, but there’s no shame in going back and changing some aspects of it.
Case Study: DMP for data recorded at NFS
As a case study, I present a DMP for experimental data recorded at NFS: DMP du projet “(n, n’g) measurements at NFS” (an account is needed to access online).
A PDF version can be accessed here
.
Footnotes