A simple, extensible organization structure for scientific data

First published: May 27, 2021
Last updated: February 13, 2023

I've recently defended my PhD. Over the course of this experience, I have generated a lot of data from a lot of different kinds of experiments. The most common tool I used was microscopy. Microscope images are (relatively) big. I often work on a laptop, and don't want to keep hundreds of gigabytes of images on my personal hard drive. But it is also often too slow to develop an image analysis pipeline by pulling images down from a remote server. At the time, my old University didn't have a great standard data storage solution. If you wanted to store images on their system, you had to download them to analyze anything.

For this reason, most of us just used external hard drives to store and backup data while we were gathering data and performing analysis. We would typically back up those drives on an external server as well (when we though of it). I also typically have many experiments for different projects going at the same time, and sometimes other students or research assistants will be gathering data for me as well. Having many external hard drives, many different people, and a bunch of different projects going at once is a recipe for organizational disaster. If your methods aren't standardized, you won't be able to find data you need a month from now. Trust me, I've tried.

How do you keep data organized when it is collected by multiple people on different tools for different projects?

Simple organization for long running projects ¶

I did my PhD in two very collaborative labs focusing on both cellular and systems neuroscience. Most of the data analysis work that I do involves working with data from different experimental modalities (electrophysiology, genetics, and microscopy/image analysis) so that I can analyze it and present it in a meaningful way. After ~5 years of work, I am getting ready to defend my thesis (Spring 2021). Looking back, I'd describe most research projects as: long term, ever changing, and poorly specified. Things change often, and sometimes you have to use data gathered from months or years prior, often by different people when putting together a publication or a new analysis pipeline. How do people typically organize this data?

A basic directory structure ¶

Surprisingly, there are some nice articles and editorials on organizing research projects. The Journal PLoS has published a few Ten Simple Rules editorials on the topic, including creating a good data storage plan ¹ and digital data storage ². Greg Wilson et al. (co founder of Software Carpentry) has written a few articles on good practices for scientific computing ³ ⁴ and building robust software ⁵. These articles give you some best practices and concrete plans, but one paper that influenced me the most when I was exploring how to manage my project was William Noble's piece on Organizing Computational Biology Projects ⁶. These projects work very well when your data fits on your local computer/hard drive. However, this becomes much more difficult when it does not. However, I do like the idea of a standard, flexible directory structure, so I adopted that idea going forward.

Dates and a simple directory structure ¶

My project involves a number of different types of microscopes (confocal, widefield, multiphoton, stimulated emission depletion (STED), etc.) each with it's own (often proprietary) file format. As I mentioned, I also work with professional research assistants and fellow students who will sometimes take images for me and who I may need to contact with questions in the future.

How do you stay organized when you have multiple people taking data on different days on different microscopes for the same project?

Early on, I developed the following directory structure for microscopy experiments on my external hard drive (mirrored on the backup server):

mnc
├── cdb_nikon
├── core_olympus
├── macklin_leica
├── cdb_leica
...

The base directory is the name of the project that this data is being taken for, in this case it is my main project called mnc. Under the main mnc directory is a separate directory for each microscope. So cdb_nikon and cdb_leica stand for the department of cell and developmental biology Nikon and Leica microscopes. core_olympus means the Light Microscopy Core's Olympus microscope, and macklin_leica is the Macklin Lab's Leica microscope.

Under each microscope is a directory named using ISO-8086 date, for the images taken that day. Dates are the single most important part of any organization structure. Scientists keep lab notebooks with (hopefully) good notes for individual experiments. Using dates you can reference your notebooks for context on the experiments and what was going on.

Readable metadata with the data ¶

In a perfect world, imaging experiments are done by you soon after the sample preparation. However, eventually you have to write up your results for a publication, often months or years later. How do you quickly find the original data to share, or to dig up a good example image for your article? I've found the easiest way to do this is with a simple text file located in your daily experiment folder named: DATE_EXPERIMENT.txt. In this file, you want to keep notes as informative, simple, and standardized as possible. Here is an example:

Name: 
Date: 
Microscope:
Purpose:

< notes about imaging here >

Name Who did the experiment? This is mostly so I can ask that person questions in the future.
Date When was it done?
Microscope On what microscope were these images taken?
Purpose brief description of what was being done that day, why you were taking these images? I do a lot of immunohistochemistry experiments (LINK), so I'll often write the name of the antibodies in this section.

The remainder is more free-form. I'll often take note like: "Image001 was a great example of axons projecting into myelin", or "Image004 was taken with higher zoom" When I am scrolling through to find specific experiments or images, it is enormously helpful to quickly open or preview a small text file rather than opening 10-100GB images manually.

Quickly finding target files with `grep` / `ripgrep` ¶

Another huge advantage of using this structure is that you can get help searching using common command line text search tools. This was especially useful when I am trying to find certain example images when writing my main paper and thesis. For example, I knew I took nice images of microglia at some point in the past, but I couldn't remember when I took them or even what microscope I used. I'm sure I could have dug through one of my three lab notebooks to narrow down a date, but that would be very slow and boring. I have a hard drive with all the images, and knew that all the experiments imaging microglia were done with a protein called Iba1. So I downloaded ripgrep, and ran the following command from the root of the hard drive: rg -i iba1 -g '*.txt'

Now I had a list of all the times I mentioned Iba1, along with the context of surrounding notes in the text file (sometimes as helpful as "excellent Iba1 labeleing, good example image"). This made it MUCH faster and nicer to search through terabytes worth of imaging data to zero in on what I needed quickly, using only command line tools and text files.

Conclusions ¶

Managing multiple long-term projects is hard. Using a flexible and simple directory structure and text-based metadata can be enormously helpful to staying organized and flexible over the long term. Every project and problem is unique, but imposing some limited structure on the problem (without over-doing it and losing flexibility) leads to an enormous payoff over time.

Michener et al. 2015 PLoS Computational Biology https://pubmed.ncbi.nlm.nih.gov/26492633/

Hart et al. 2016 PLoS Computational Biology https://pubmed.ncbi.nlm.nih.gov/27764088/

Wilson et al. 2014 PLoS Computational Biology https://pubmed.ncbi.nlm.nih.gov/28640806/

⁴

Wilson et al. 2017 PLoS Computational Biology https://pubmed.ncbi.nlm.nih.gov/28640806/

⁵

Taschuk et al. 2017 PLoS Computational Biology https://pubmed.ncbi.nlm.nih.gov/28407023/

⁶

Noble et al. 2009 PLoS Computational Biology https://pubmed.ncbi.nlm.nih.gov/19649301/