This post describes a tool I wrote to solve an annoying problem I ran into in my PhD research when processing (relatively) large stacks of images in MATLAB.

To get the newmatic package or see an example, check out:

TLDR

newmatic provides a bit of extra control over data formatting, and the results are pretty nice. The example included in the package provides a whiff of the potential performance benefits. The table below compares timings for complete (i.e., whole variable) and partial IO for the native matfile and my newmatic functions.

  write-time-seconds read-time-seconds file-size-MB
native-complete 4.83 1.18 95.53
newmatic-complete 4.84 1.19 95.53
native-partial 87.72 5.93 116.33
newmatic-partial 4.87 0.24 95.82

Using newmatic makes partial access roughly the same speed as reading/writing whole variables, and does not have a significant impact on file size. For this specific example, setting a sane chunk size yields ~20x speedup.

A Chunky Problem

I had a directory full of images to analyze, and wanted to save the results to a single file with nicely defined coordinates. Given the size of the dataset, partial reading/writing was essential as well. NetCDF is a natural fit for this task, but my collegues found that to be an unfriendly format, and strongly prefered MAT-files.

This seemed fine, since MAT-files are HDF5 formatted under the hood, and MATLAB support partial reading/writing. Unfortunately, performance was truly terrible, and processing ground to a halt waiting for IO.

Peaking at the MAT-files with h5dump -H -p, it was clear that the problem was MATLAB had chosen a pathologically bad chunk size for my use case. For the uninitiated, HDF5 files can save arrays in arbitrary chunks, so that parts of the (compressed) data can be accessed without reading in the whole array (see XXXX).

https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html

As an example, here is the h5dump -H -p output for a MAT-file created by MATLAB containing a 2000 x 1000 x 50 array (one “page” for each grayscale image):

HDF5 "/tmp/tp7e17633a_4bbd_4dd1_aa26_024b9dc7cc18.mat" {
GROUP "/" {
   DATASET "images" {
      DATATYPE  H5T_STD_U8LE
      DATASPACE  SIMPLE { ( 500, 1000, 2000 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED ) }
      STORAGE_LAYOUT {
         CHUNKED ( 500, 125, 1 )
         SIZE 296 (3378378.378:1 COMPRESSION)
      }
      FILTERS {
         COMPRESSION DEFLATE { LEVEL 3 }
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
      ATTRIBUTE "MATLAB_class" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
   }
}
}

Note the array size ( 500, 1000, 2000 ) and the chunking CHUNKED ( 500, 125, 1 ). The axes are reordered in the HDF file to be pages, columns, rows to keep the data contiguous, so each chunk spans all pages, 125 columns, and a single row. This means that to read a single page, we have to access all the chunks. This is hard work for the ol’ laptop and shouldn’t have to be.

A newmatic Solution

Since MATLAB does not provide any control over chunk size, I wrote newmatic to do just that.

There are two key functions in the package:

  • newmatic_variable is used to define the name, data type, array size, and chunk size for variables created by newmatic
  • newmatic creates a MAT-file with the specified variables

Here is a small example demonstrating how to create a MAT-file with two arrays (created by newmatic) and a cell array (not created by newmatic).

mat = newmatic(...
    'my-mat-file.mat', ...
    newmatic_variable('x', 'double', [1000, 1000, 20], [1000, 1000, 1]), ...
    newmatic_variable('y', 'double', [1000, 10, 10], [5000, 10, 10]));
mat.z = {'a', 'cell', 'array'};

The way it works is to:

  1. Use MATLAB to create the original file. This is a shortcut to getting all the ancillary data MATLAB expects in the HDF5 file, so that MATLAB can read it when we are done
  2. Use the low-level HDF5 library tools packaged with MATLAB to create a new file with the same formatting but with the user-specified chunking.

The h5dump -H -p output for an equivalent file created with my newmatic tool (with chunks specified by me) is:

HDF5 "/tmp/tpc80a91c1_a460_4b52_bb11_dafa58630ca9.mat" {
GROUP "/" {
   DATASET "images" {
      DATATYPE  H5T_STD_U8LE
      DATASPACE  SIMPLE { ( 50, 1000, 2000 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED ) }
      STORAGE_LAYOUT {
         CHUNKED ( 1, 1000, 2000 )
         SIZE 437150 (228.754:1 COMPRESSION)
      }
      FILTERS {
         COMPRESSION DEFLATE { LEVEL 3 }
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_INCR
      }
      ATTRIBUTE "MATLAB_class" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
   }
}
}

Now our chunks are a much happier CHUNKED ( 1, 1000, 2000 ). Each page is a chunk, so the file is laid out to match the known access pattern.

This is nothing new, it is in fact the reason that HDF5 provides control over chunking. Hopefully, Mathworks will take notice and one fine day give us developers a bit more control over our data files.