Nick George
Science Programming

Working with Ilastik HDF5 files using python and h5py

First published: May 21, 2020
Last updated: July 12, 2021

> Contents


Ilastik is an excellent image analysis suite which makes pixel and object classification easy. Ilastik recommends using HDF5 as an input/output file format for image segmentation/probability and measurement results, but the documentation on how the results are stored in this format are somewhat lacking (I made a small change to the documentation to clear some of this up).

I will focus on the reading/interacting with the output here, as Ilastik provides extensive tutorials on pixel and object classification.

UPDATE I added a quick section on this subject to the documentation. See the object classification workflow for more on this.

Stuff you need

  • Python3

  • numpy (pip install numpy)

  • h5py (pip install h5py)

  • Ilastik segmentation results and table results as HDF5 files

Opening and inspecting HDF5 files

HDF5 files are set up to be like a platform-agnostic filesystem. They can contain many different types of data, metadata, etc. but have a common organization structure. The top level is typically called a group, and you can query the HDF5 file to find what it contains like so:

  import h5py as h

  h5obj = h.File("/absolute/path/to/h5file/caspr.h5", "r")
  # <KeysViewHDF5 ['caspr']>

You can treat these h5py file objects like python dictionaries:

  for k,v in h5obj.items():
      print(f"\nkey: {k}\n value is {v}")

All those commands will all work as they would with Python dictionaries.

Ilastik will output the results of your segmentation (segmented binary image or probability maps) as an HDF5 dataset. You can access the dataset using the dataset name (in my case, caspr, based on h5obj.keys()):

  # <HDF5 dataset "caspr": shape (61, 1024, 1024, 1), type "|u1">
  # same as:
  # <HDF5 dataset "caspr": shape (61, 1024, 1024, 1), type "|u1">

HDF5 file references and getting your data

It is important to note that using h5py.File() creates a reference to the file, it does not read the whole file into memory.

doing this:

  h5obj = h.File("path/to/h5file/caspr.h5", "r")
  data = h5obj['caspr']
  # <class 'h5py._hl.dataset.Dataset'>
  # <Closed HDF5 dataset>

will close the HDF5 file and any attempts to access the contents of data will fail.

To work with the data, or iterate through a lot of h5 files, you could keep the reference open the whole time you are working with it, or you can copy the data you are interested in and close the reference (ideally using a context manager).

The dataset that I am interested in is a numpy ndarray with the following shape:

  h5obj = h.File("path/to/h5file/caspr.h5", "r")
  data = h5obj['caspr']
  # (61, 1024, 1024, 1)

I can copy it like so:

  dataset_data = data[:]
  # <class 'numpy.ndarray'>
  # (61, 1024, 1024, 1)

and closing the file reference you can still access the data:

  # <Closed HDF5 dataset>
  # ValueError: Not a dataset (not a dataset)
  # prints array...
  # (61, 1024, 1024, 1)

A simple HDF5 dataset-getter function

Now we can write a simple function which will use a context manager to open and close the h5 file and return the dataset we are interested in:

  def get_h5_dataset(fp, dset_name):
      with h.File(fp, 'r') as f:
          assert dset_name in f.keys(), f"dataset {dset_name} does not exist. Datasets are: {[k for k in f.keys()]}"
          data = f.get(dset_name)[:]
      return data

Which we can use like so:

  dataset = get_h5_dataset("/absolute/path/to/h5file/caspr.h5", "caspr")

The assert statement will print a useful message if you mistype a dataset name or if it doesn't exist.