Specification¶
We represent a computation as a directed acyclic graph of tasks with data dependencies. Dask is a specification to encode such a graph using ordinary Python data structures, namely dicts, tuples, functions, and arbitrary Python values.
Definitions¶
A dask graph is a dictionary mapping data-keys to values or tasks.
{'x': 1,
'y': 2,
'z': (add, 'x', 'y'),
'w': (sum, ['x', 'y', 'z'])}
A key is any hashable value that is not a task.
'x'
('x', 2, 3)
A task is a tuple with a callable first element. Tasks represent atomic units of work meant to be run by a single worker.
(add, 'x', 'y')
We represent a task as a tuple such that the first element is a callable function (like add), and the succeeding elements are arguments for that function.
An argument may be one of the following:
- Any key present in the dask like 'x'
- Any other value like 1, to be interpreted literally
- Other tasks like (inc, 'x')
- List of arguments, like [1, 'x', (inc, 'x')]
So all of the following are valid tasks
(add, 1, 2)
(add, 'x', 2)
(add, (inc, 'x'), 2)
(sum, [1, 2])
(sum, ['x', (inc, 'x')])
(np.dot, np.array([...]), np.array([...]))
To encode keyword arguments we recommend the use of functools.partial or toolz.curry.
What functions should expect¶
In cases like (add, 'x', 'y') functions like add receive concrete values instead of keys. A dask scheduler replaces keys (like 'x' and 'y') with their computed values (like 1, and 2) before calling the add function.
If the argument is a list then a function should expect an Iterator of concrete values.
Entry Point - The get function¶
The get function serves as entry point to computation. This function gets the value associated to the given key. That key may refer to stored data as is the case with 'x' or a task as is the case with 'z'. In the latter case get should perform all necessary computation to retrieve the computed value.
>>> dsk = {'x': 1,
... 'y': 2,
... 'z': (add, 'x', 'y'),
... 'w': (sum, ['x', 'y', 'z'])}
>>> get(dsk, 'x')
1
>>> get(dsk, 'z')
3
>>> get(dsk, 'w')
6
Additionally if given a list get should simultaneously acquire values for multiple keys
>>> get(dsk, ['x', 'y', 'z'])
[1, 2, 3]
Because we accept lists of keys as keys we support nested lists.
>>> get(dsk, [['x', 'y'], ['z', 'w']])
[[1, 2], [3, 6]]
Internally get can be arbitrarily complex, calling out to distributed computing, using caches, etc..
Why use tuples¶
With (add, 'x', 'y') we wish to encode “the result of calling add on the values corresponding to the keys 'x' and 'y'.
We intend the following meaning:
add('x', 'y') # after x and y have been replaced
But this will err because Python executes the function immediately before we know values for 'x' and 'y'.
We delay the execution by moving the opening parenthesis one term to the left, creating a tuple.
Before: add( 'x', 'y')
After: (add, 'x', 'y')
This lets us store the desired computation as data that we can analyze using other Python code rather than cause immediate execution.
LISP users will identify this as an s-expression or as a rudimentary form of quoting.