newmatic: MAT-files optimized for partial reading and writing of large arrays
This post describes a tool I wrote to solve an annoying problem I ran into in my PhD research when processing (relatively) large stacks of images in MATLAB.
To get the newmatic
package or see an example, check out:
TLDR
newmatic
provides a bit of extra control over data formatting, and
the results are pretty nice. The example included in the
package provides a whiff of the potential performance benefits. The table below
compares timings for complete (i.e., whole variable) and partial IO for the
native matfile
and my newmatic
functions.
write-time-seconds | read-time-seconds | file-size-MB | |
---|---|---|---|
native-complete | 4.83 | 1.18 | 95.53 |
newmatic-complete | 4.84 | 1.19 | 95.53 |
native-partial | 87.72 | 5.93 | 116.33 |
newmatic-partial | 4.87 | 0.24 | 95.82 |
Using newmatic makes partial access roughly the same speed as reading/writing whole variables, and does not have a significant impact on file size. For this specific example, setting a sane chunk size yields ~20x speedup.
A Chunky Problem
I had a directory full of images to analyze, and wanted to save the results to a single file with nicely defined coordinates. Given the size of the dataset, partial reading/writing was essential as well. NetCDF is a natural fit for this task, but my collegues found that to be an unfriendly format, and strongly prefered MAT-files.
This seemed fine, since MAT-files are HDF5 formatted under the hood, and MATLAB support partial reading/writing. Unfortunately, performance was truly terrible, and processing ground to a halt waiting for IO.
Peaking at the MAT-files with h5dump -H -p
, it was clear that the problem was
MATLAB had chosen a pathologically bad chunk size for my use case. For the
uninitiated, HDF5 files can save arrays in arbitrary chunks, so that parts of
the (compressed) data can be accessed without reading in the whole array (see
XXXX).
https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html
As an example, here is the h5dump -H -p
output for a MAT-file created by MATLAB
containing a 2000
x 1000
x 50
array (one “page” for each grayscale
image):
HDF5 "/tmp/tp7e17633a_4bbd_4dd1_aa26_024b9dc7cc18.mat" {
GROUP "/" {
DATASET "images" {
DATATYPE H5T_STD_U8LE
DATASPACE SIMPLE { ( 500, 1000, 2000 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED ) }
STORAGE_LAYOUT {
CHUNKED ( 500, 125, 1 )
SIZE 296 (3378378.378:1 COMPRESSION)
}
FILTERS {
COMPRESSION DEFLATE { LEVEL 3 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
ATTRIBUTE "MATLAB_class" {
DATATYPE H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
}
}
}
Note the array size ( 500, 1000, 2000 )
and the chunking CHUNKED ( 500, 125, 1 )
.
The axes are reordered in the HDF file to be pages, columns, rows
to
keep the data contiguous, so each chunk spans all pages, 125 columns, and a
single row. This means that to read a single page, we have to access all
the chunks. This is hard work for the ol’ laptop and shouldn’t have to be.
A newmatic Solution
Since MATLAB does not provide any control over chunk size, I wrote newmatic
to do just that.
There are two key functions in the package:
newmatic_variable
is used to define the name, data type, array size, and chunk size for variables created bynewmatic
newmatic
creates a MAT-file with the specified variables
Here is a small example demonstrating how to create a MAT-file with two arrays
(created by newmatic
) and a cell array (not created by newmatic
).
mat = newmatic(...
'my-mat-file.mat', ...
newmatic_variable('x', 'double', [1000, 1000, 20], [1000, 1000, 1]), ...
newmatic_variable('y', 'double', [1000, 10, 10], [5000, 10, 10]));
mat.z = {'a', 'cell', 'array'};
The way it works is to:
- Use MATLAB to create the original file. This is a shortcut to getting all the ancillary data MATLAB expects in the HDF5 file, so that MATLAB can read it when we are done
- Use the low-level HDF5 library tools packaged with MATLAB to create a new file with the same formatting but with the user-specified chunking.
The h5dump -H -p
output for an equivalent file created with my newmatic
tool (with chunks specified by me) is:
HDF5 "/tmp/tpc80a91c1_a460_4b52_bb11_dafa58630ca9.mat" {
GROUP "/" {
DATASET "images" {
DATATYPE H5T_STD_U8LE
DATASPACE SIMPLE { ( 50, 1000, 2000 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED ) }
STORAGE_LAYOUT {
CHUNKED ( 1, 1000, 2000 )
SIZE 437150 (228.754:1 COMPRESSION)
}
FILTERS {
COMPRESSION DEFLATE { LEVEL 3 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
ATTRIBUTE "MATLAB_class" {
DATATYPE H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
}
}
}
Now our chunks are a much happier CHUNKED ( 1, 1000, 2000 )
. Each page is a
chunk, so the file is laid out to match the known access pattern.
This is nothing new, it is in fact the reason that HDF5 provides control over chunking. Hopefully, Mathworks will take notice and one fine day give us developers a bit more control over our data files.