Open Source
In computer programming, the principles of Open Source have been around for a long time [1]. In some ways, Open Science is just an extension of the concept to all research products (papers, data, …). In the context of physics research, Open Source means providing full access to the code of the softwares used for data analysis.
It’s important to realize that the analysis code is as much part of the scientific process leading to the results than any theory, experimental setup, parameters, … However, many researchers (at least in the nuclear physics community) are self-taught in writing analysis code, and spend a significant amount of time creating their own, very application specific, softwares [2]. Having codes published in Open Source is a way to ensure code quality, strengthen the software capabilities, and allow researchers to reuse part of the code written by others, in order to spend less time and energy on re-creating software parts. More importantly, publishing analysis code is a way to demonstrate technique (and get ownership over it, just as mentioned before), expose your skills (this is important for PhD students and postdocs) and building trust in your results.
Speaking for myself, if I receive, as a referee, a paper to review that claims to use a novel or original analysis code, I will tend to ask to see it, at least to validate that it is written in a way that avoid mistakes, check that it does what it’s supposed to do, …
Writing Open Source code
One major thing to remember when writing Open Source code is that it’s very likely that someone else will run your code [3]. This will have consequences on how the code is written, documented and packaged [2].
Particularly, when writing a code that has a vocation to be published in Open Source, one has to be careful of :
Using a versioning, sharing, … system such as git to keep track of the code, as well as provide a remote hosting and distribution page via a gitlab or github instance [4].
Separating external libraries and dependencies (some of which may not be Open, or distributed under the same license), while giving clear indication on how to install them for correct functioning of your code.
Writing the code in readable way (no funny variables names like
blabla
) and commenting, so people can understand its inner working and might even change it [5].Organize the directory containing the code, with parameter files, test scripts, … in separate folders [6].
Test! Before publication, and also provide a test procedure to ensure the software works as intended once installed on a different computer.
Provide information to install, test and run the code.
Many classes (online or otherwise) can be found to learn the best practices in programming (for example [7]).
Source code publication
Different options exist for publishing source code:
- General repositories that accept codes
General repositories, such as HAL or Zenodo can receive softwares and store them. As such repositories are maintained by institutions, they are the best options for publishing your Open Source code. However, they are not the most ideal for retrieving and working with the code for future users.
- Git repositories
Git repositories are by far one of the best option for people to explore your code, download it and run it on their own. There are several institutions supported servers (gitlab.in2p3.fr, git.unistra.fr) and some private / commercial platforms (the most famous being Github.com).
- Package manager repositories
For programming languages such as Python,
go
[8],rust
[9],node.js
[10], … public repositories exist that allow a speedy and clean installation of a library, with all the necessary checks for dependencies. The version of the code uploaded to these platform is usually a specially packaged one, which may not include all the metadata files, test information, …
Thankfully, all these solutions are generally compatible between them. One can develop a python
package and host the code on gitlab.in2p3.fr
, publish the code on HAL and make it available via Pypi.org
[11].
(That’s what I have done for example for faster
A library to read Faster files with python, published via HAL with the code hosted on the IN2P3 gitlab and available on pypi.org).
Following best practices from Software Development methods is a very good way to have a code good and ready for publication.
Virtual Environment and Containers
To help the code run smoothly on different computers, platforms, … it is better to rely on standard libraries, virtual environment and/or containers to embed the software in a controlled environment that is frozen in place and will stay in running state.
- Cern Linux images
Cern provides a series of standard linux distributions on a dedicated webpage [12], that can be used as bases for your software development and distributions.
- Python Virtual environment
Python provides a
venv
module to create a virtual environment, which, in concordance with pip will make it possible to have a standard execution setup for a python code. Python packaged version such as Anaconda [13] extends the concept of virtual environment further, but is less universal than the standardvenv
.- Containers
Finally, running “containers“ with Docker or Singularity are the best way to provide a controlled environment that can be ported almost anywhere. (The Cern linux distributions mentioned earlier also come in the container format). For example, the current document is compiled with Sphinx using a Docker image. This ensures that it is compiled exactly the same way on a different computer [14]. The
Open Container Initiative
[15] format aims at making the container content available in a standard and open format. Docker can export images built into OCI tar balls. There is overall a great deal of compatibility between Docker and other containers runners like Singularity/Apptainer.
Footnotes