Overview

Dask arrays implement a subset of the NumPy interface on large arrays using blocked algorithms and task scheduling.

Scope

The dask.array library supports the following interface from numpy.

  • Arithmetic and scalar mathematics, +, *, exp, log, ...
  • Reductions along axes, sum(), mean(), std(), sum(axis=0), ...
  • Tensor contractions / dot products / matrix multiply, tensordot
  • Axis reordering / transpose, transpose
  • Slicing, x[:100, 500:100:-2]
  • Fancy indexing along single axes with lists or numpy arrays, x[:, [10, 1, 5]]
  • The array protocol __array__

These operations should match the NumPy interface precisely.

Construct

We can construct dask array objects from other array objects that support numpy-style slicing. Here we wrap a dask array around an HDF5 dataset, chunking that dataset into blocks of size (1000, 1000).

>>> import h5py
>>> f = h5py.File('myfile.hdf5')
>>> dset = f['/data/path']

>>> import dask.array as da
>>> x = da.from_array(dset, chunks=(1000, 1000))

Often we have many such datasets. We can use the stack or concatenate functions to bind many dask arrays into one.

>>> dsets = [h5py.File(fn)['/data'] for fn in sorted(glob('myfiles.*.hdf5')]
>>> arrays = [da.from_array(dset, chunks=(1000, 1000))
                for dset in dsets]

>>> x = da.stack(arrays, axis=0)  # Stack along a new first axis

Interact

Dask copies the NumPy API for an important subset of operations, including arithmetic operators, ufuncs, slicing, dot products, and reductions.

>>> y = log(x + 1)[:5].sum(axis=1)

Store

In Memory

If your data is small you can call np.array on your dask array to turn it in to a normal NumPy array.

>>> x = da.arange(6, chunks=3)
>>> y = x**2
>>> np.array(y)
array([0, 1, 4, 9, 16, 25])

HDF5

Use the to_hdf5 function to store data into HDF5 using h5py.

>>> da.to_hdf5('myfile.hdf5', '/y', y)  # doctest: +SKIP

Store several arrays in one computation with the function da.to_hdf5 by passing in a dict.

>>> da.to_hdf5('myfile.hdf5', {'/x': x, '/y': y})  # doctest: +SKIP

Other On-Disk Storage

Alternatively you can store dask arrays in any object that supports numpy-style slice assignment like h5py.Dataset, or bcolz.carray.

>>> import bcolz  # doctest: +SKIP
>>> out = bcolz.zeros(shape=y.shape, rootdir='myfile.bcolz')  # doctest: +SKIP
>>> da.store(y, out)  # doctest: +SKIP

You can store several arrays in one computation by passing lists of sources and destinations.

>>> da.store([array1, array2], [output1, outpu2])  

On-Disk Storage

In the example above we used h5py but dask.array works equally well with pytables, bcolz, or any library that provides an array object from which we can slice out numpy arrays.

>>> x = dataset[1000:2000, :2000]  # pull out numpy array from on-disk object

This API has become a standard in Scientific Python. Dask works with any object that supports this operation and the equivalent assignment syntax.

>>> dataset[1000:2000, :2000] = x  # Store numpy array in on-disk object

Limitations

Dask.array does not implement the entire numpy interface. Users expecting this will be disappointed. Notably dask.array has the following failings:

  1. Dask does not implement all of np.linalg. This has been done by a number of excellent BLAS/LAPACK implementations and is the focus of numerous ongoing academic research projects.
  2. Dask.array does not support any operation where the resulting shape depends on the values of the array. In order to form the dask graph we must be able to infer the shape of the array before actually executing the operation. This precludes operations like indexing one dask array with another or operations like np.where.
  3. Dask.array does not attempt operations like sort which are notoriously difficult to do in parallel and are of somewhat diminished value on very large data (you rarely actually need a full sort). Often we include parallel-friendly alternatives like topk.
  4. Dask development is driven by immediate need, and so many lesser used functions have not been implemented. Community contributions are encouraged.