Scientists don't test their code

First published: September 16, 2020
Last updated: February 13, 2023

Seriously, we don't. Go ahead and look at your favorite scientist written code base, used by hundreds or submitted with a paper and never intended to be shared. Python, Matlab, or R, it doesn't matter. I can almost guarantee you that 95%+ of them will have no tests directory or automated/unit testing setup at all (let alone an option to automatically install it with a package manager like pip or conda for python).

I'm not a software engineer!

Increasingly, a lot of big, important, used-by-tons-of-people software systems are built by informally trained scientists in their spare time. I think this is a genuinely good thing. We are taking control of our research in ways that were not possible before, and creating tools that we actually need and will use rather than settling for expensive garbage.

However, there are serious downsides if we don't start emphasizing better programming practices from the beginning. In October 2019, a Python script used by 100s of published scientific papers was found to produce different results on different operating systems (see also: Ars, Nature Communications, ACS (original source) – but ACS sucks almost as much as Elsiver, so you won't be able to read it), calling all those results into question.

What was the error? Well it involved an untested assumption about file sorting/IO. In short, the authors expected files to be sorted when they were read by Python's glob method before processing, as they always were on their own machines. Unrelated OS updates since the publication of the script invalidated this assumption, leading to problems. Who could have seen that coming? Well, unfortunately, the documentation for the function itself said that you couldn't depend on or expect that behavior due to OS differences:

"…although results are returned in arbitrary order." - Python2 glob docs "Whether or not the results are sorted depends on the file system." - Python3 glob docs

While this story gained some (justified) attention, it is a totally understandable mistake for a non-professional (or a professional) to make. It would also be hard to write proper tests for (which is what I am advocating for here) if you didn't read that part of the docs. I have no doubt that countless more of these errors exist, with many incorporated into projects used by hundreds or more. I have no doubt I have made mistakes just like this in the past as well. Despite this, I still think it is good and important for scientists to write their own code. But how do we combat this issue and write better code?

Every software engineer knows that testing code is vital to creating good software. Software tests help you confirm (and discover) assumptions, they make things easier to change by providing a ground truth to prevent regressions or unintended consequences, and they (hopefully) help tease out errors to make your system more resilient and correct. To write a good test, engineers think like scientists:

What does this expects-a-non-empty-list function do if I provide it an empty list?
What happens to this do-some-math function if I pass it nothing at all?
What if I didn't verify input and I pass poorly formed HTML to a script that processes HTML?

Asking questions like this should sound familiar to researchers. We are in the business of developing a hypothesis and designing an experiment to test that hypothesis. We are also increasingly writing more and more code to do complex analysis and run our instruments, and we are sharing this code with others. You'd think we would flock to software testing, as that hypothesis-design-test cycle is extremely similar to what we do every day.

But for the most part, we don't. Why?

For most of us, programming is like that language you 'learned' over two years in high school and suddenly, you desperately to speak it every day. We are mostly self taught (Wilson et al. 2014), and we kind-of-know it, but not really. Many of us started programming out of necessity. Our new lab used homemade spike-counting/microscope running code so we took a few online classes, then tried our best in the lab. How were we supposed to know about 'best practices' for software development?

I have never seen testing mentioned in any tutorial aimed at scientists or beginners, so you can't really blame us. There are excellent articles documenting the rise of the scientist programmer (Freeman 2015, Wilson et al. 2014), with many even providing lists of "best practices" for the working scientist (see especially Wilson et al. 2014, Taschuk and Wilson 2017, and Baxter et al. 2006). But when it comes to actually implementing those best practices… nothing. I don't think scientists don't want to write tests, I just think we don't know how and no one has taken the time to show us. When I am writing a script to analyze patch clamp data, a "testing your python web app" article isn't going to help me.

How can we start to fix this situation?

I think adopting two practices will have the greatest impact:

Automated testing (from the beginning)
Automated deployment (can you install and try this on another machine)

I'll give examples for what I have learned about doing this for my own work in a few other posts I will link to here. But I would really like to see more software engineers and programming teachers prioritize this sort of lesson early and often. It isn't difficult, it just isn't covered.

Scientists don't test their code

I'm not a software engineer!

Testing and designing for sharing from the start