Changelog¶
0.16.2 / 2018-MM-DD¶
Array¶
- Update error handling when len is called with empty chunks (GH#3058) Xander Johnson
- Fixes a metadata bug with
store
’sreturn_stored
option (GH#3064) John A Kirkham - Fix a bug in
optimization.fuse_slice
to properly handle when first input isNone
(GH#3076) James Bourbeau
DataFrame¶
Bag¶
Core¶
- Change default task ordering to prefer nodes with few dependents and then many downstream dependencies (GH#3056) Matthew Rocklin
- Add color= option to visualize to color by task order (GH#3057) Matthew Rocklin
- Deprecate
dask.bytes.open_text_files
(GH#3077) Jim Crist - Remove short-circuit hdfs reads handling due to maintenance costs. May be re-added in a more robust manner later (GH#3079) Jim Crist
- Add
dask.base.optimize
for optimizing multiple collections without computing. (GH#3071) Jim Crist - Rename
dask.optimize
module todask.optimization
(GH#3071) Jim Crist
0.16.1 / 2018-01-09¶
Array¶
- Fix handling of scalar percentile values in
percentile
(GH#3021) James Bourbeau - Prevent
bool()
coercion from calling compute (GH#2958) Albert DeFusco - Add
matmul
(GH#2904) John A Kirkham - Support N-D arrays with
matmul
(GH#2909) John A Kirkham - Add
vdot
(GH#2910) John A Kirkham - Explicit
chunks
argument forbroadcast_to
(GH#2943) Stephan Hoyer - Add
meshgrid
(GH#2938) John A Kirkham and (GH#3001) Markus Gonser - Preserve singleton chunks in
fftshift
/ifftshift
(GH#2733) John A Kirkham - Fix handling of negative indexes in
vindex
and raise errors for out of
bounds indexes (GH#2967) Stephan Hoyer
- Add flip
, flipud
, fliplr
(GH#2954) John A Kirkham
- Add float_power
ufunc (GH#2962) (GH#2969) John A Kirkham
- Compatability for changes to structured arrays in the upcoming NumPy 1.14 release (GH#2964) Tom Augspurger
- Add block
(GH#2650) John A Kirkham
- Add frompyfunc
(GH#3030) Jim Crist
- Add the return_stored
option to store
for chaining stored results (GH#2980) John A Kirkham
DataFrame¶
- Fixed naming bug in cumulative aggregations (GH#3037) Martijn Arts
- Fixed
dd.read_csv
whennames
is given butheader
is not set toNone
(GH#2976) Martijn Arts - Fixed
dd.read_csv
so that passing instances ofCategoricalDtype
indtype
will result in known categoricals (GH#2997) Tom Augspurger - Prevent
bool()
coercion from calling compute (GH#2958) Albert DeFusco DataFrame.read_sql()
(GH#2928) to an empty database tables returns an empty dask dataframe `Apostolos Vlachopoulos`_- Compatability for reading Parquet files written by PyArrow 0.8.0 (GH#2973) Tom Augspurger
- Correctly handle the column name (df.columns.name) when reading in
dd.read_parquet
(:pr:2973`) Tom Augspurger - Fixed
dd.concat
losing the index dtype when the data contained a categorical (GH#2932) Tom Augspurger - Add
dd.Series.rename
(GH#3027) Jim Crist DataFrame.merge()
now supports merging on a combination of columns and the index (GH#2960) Jon Mease- Removed the deprecated
dd.rolling*
methods, in preperation for their removal in the next pandas release (GH#2995) Tom Augspurger - Fix metadata inference bug in which single-partition series were mistakenly special cased (GH#3035) Jim Crist
- Add support for
Series.str.cat
(GH#3028) Jim Crist
Core¶
- Improve 32-bit compatibility (GH#2937) Matthew Rocklin
- Change task prioritization to avoid upwards branching (GH#3017) Matthew Rocklin
0.16.0 / 2017-11-17¶
This is a major release. It includes breaking changes, new protocols, and a large number of bug fixes.
Array¶
- Add
atleast_1d
,atleast_2d
, andatleast_3d
(GH#2760) (GH#2765) John A Kirkham - Add
allclose
(GH#2771) by John A Kirkham - Remove
random.different_seeds
from Dask Array API docs (GH#2772) John A Kirkham - Deprecate
vnorm
in favor ofdask.array.linalg.norm
(GH#2773) John A Kirkham - Reimplement
unique
to be lazy (GH#2775) John A Kirkham - Support broadcasting of Dask Arrays with 0-length dimensions (GH#2784) John A Kirkham
- Add
asarray
andasanyarray
to Dask Array API docs (GH#2787) James Bourbeau - Support
unique
’sreturn_*
arguments (GH#2779) John A Kirkham - Simplify
_unique_internal
(GH#2850) (GH#2855) John A Kirkham - Avoid removing some getter calls in array optimizations (GH#2826) Jim Crist
DataFrame¶
- Support
pyarrow
indd.to_parquet
(GH#2868) Jim Crist - Fixed
DataFrame.quantile
andSeries.quantile
returningnan
when missing values are present (GH#2791:) Tom Augspurger - Fixed
DataFrame.quantile
losing the result.name
whenq
is a scalar (GH#2791:) Tom Augspurger - Fixed
dd.concat
return adask.Dataframe
when concatenating a single series along the columns, matching pandas’ behavior (GH#2800) James Munroe - Fixed default inplace parameter for
DataFrame.eval
to match the pandas defualt for pandas >= 0.21.0 (GH#2838) Tom Augspurger - Fix exception when calling
DataFrame.set_index
on text column where one of the partitions was empty (GH#2831) Jesse Vogt - Do not raise exception when calling
DataFrame.set_index
on empty dataframe (GH#2827) `Jess Vogt`_ - Fixed bug in
Dataframe.fillna
when filling with aSeries
value (GH#2810) Tom Augspurger - Deprecate old argument ordering in
dd.to_parquet
to better match convention of putting the dataframe first (GH#2867) Jim Crist - df.astype(categorical_dtype -> known categoricals (GH#2835) Jim Crist
- Test against Pandas release candidate (GH#2814) Tom Augspurger
- Add more tests for read_parquet(engine=’pyarrow’) (GH#2822) Uwe Korn
- Remove unnecessary map_partitions in aggregate (GH#2712) Christopher Prohm
- Fix bug calling sample on empty partitions (GH#2818) @xwang777
- Error nicely when parsing dates in read_csv (GH#2863) Jim Crist
- Cleanup handling of passing filesystem objects to PyArrow readers (GH#2527) @fjetter
- Support repartitioning even if there are no divisions (GH#2873) @Ced4
- Support reading/writing to hdfs using
pyarrow
indd.to_parquet
(GH#2894:, GH#2881:) Jim Crist
Core¶
- Allow tuples as sharedict keys (GH#2763) Matthew Rocklin
- Calling compute within a dask.distributed task defaults to distributed scheduler (GH#2762) Matthew Rocklin
- Auto-import gcsfs when gcs:// protocol is used (GH#2776) Matthew Rocklin
- Fully remove dask.async module, use dask.local instead (GH#2828) Thomas Caswell
- Compatability with bokeh 0.12.10 (GH#:2844) Tom Augspurger
- Reduce test memory usage (GH#2782) Jim Crist
- Add Dask collection interface (GH#2748) Jim Crist
- Update Dask collection interface during XArray integration (GH#2847) Matthew Rocklin
- Close resource profiler process on __exit__ (GH#2871) Jim Crist
- Fix S3 tests (GH#2875) Jim Crist
- Fix port for bokeh dashboard in docs (GH#2889) Ian Hopkinson
- Wrap Dask filesystems for PyArrow compatibility (GH#2881) Jim Crist
0.15.3 / 2017-09-24¶
Array¶
- Add masked arrays (GH#2301)
- Add
*_like array creation functions
(GH#2640) - Indexing with unsigned integer array (GH#2647)
- Improved slicing with boolean arrays of different dimensions (GH#2658)
- Support literals in
top
andatop
(GH#2661) - Optional axis argument in cumulative functions (GH#2664)
- Improve tests on scalars with
assert_eq
(GH#2681) - Fix norm keepdims (GH#2683)
- Add
ptp
(GH#2691) - Add apply_along_axis (GH#2690) and apply_over_axes (GH#2702)
DataFrame¶
- Added
Series.str[index]
(GH#2634) - Allow the groupby by param to handle columns and index levels (GH#2636)
DataFrame.to_csv
andBag.to_textfiles
now return the filenames to- which they have written (GH#2655)
- Fix combination of
partition_on
andappend
into_parquet
(GH#2645) - Fix for parquet file schemes (GH#2667)
- Repartition works with mixed categoricals (GH#2676)
0.15.2 / 2017-08-25¶
Array¶
- Remove spurious keys from map_overlap graph (GH#2520)
- where works with non-bool condition and scalar values (GH#2543) (GH#2549)
- Improve compress (GH#2541) (GH#2545) (GH#2555)
- Add argwhere, _nonzero, and where(cond) (GH#2539)
- Generalize vindex in dask.array to handle multi-dimensional indices (GH#2573)
- Add choose method (GH#2584)
- Split code into reorganized files (GH#2595)
- Add linalg.norm (GH#2597)
- Add diff, ediff1d (GH#2607), (GH#2609)
- Improve dtype inference and reflection (GH#2571)
DataFrame¶
0.15.1 / 2017-07-08¶
0.15.0 / 2017-06-09¶
Array¶
- Add dask.array.stats submodule (GH#2269)
- Support
ufunc.outer
(GH#2345) - Optimize fancy indexing by reducing graph overhead (GH#2333) (GH#2394)
- Faster array tokenization using alternative hashes (GH#2377)
- Added the matmul
@
operator (GH#2349) - Improved coverage of the
numpy.fft
module (GH#2320) (GH#2322) (GH#2327) (GH#2323) - Support NumPy’s
__array_ufunc__
protocol (GH#2438)
Bag¶
0.14.2 / 2017-05-03¶
Array¶
- Add da.indices (GH#2268), da.tile (GH#2153), da.roll (GH#2135)
- Simultaneously support drop_axis and new_axis in da.map_blocks (GH#2264)
- Rechunk and concatenate work with unknown chunksizes (GH#2235) and (GH#2251)
- Support non-numpy container arrays, notably sparse arrays (GH#2234)
- Tensordot contracts over multiple axes (GH#2186)
- Allow delayed targets in da.store (GH#2181)
- Support interactions against lists and tuples (GH#2148)
- Constructor plugins for debugging (GH#2142)
- Multi-dimensional FFTs (single chunk) (GH#2116)
DataFrame¶
0.14.1 / 2017-03-22¶
Array¶
- Micro-optimize optimizations (GH#2058)
- Change slicing optimizations to avoid fusing raw numpy arrays (GH#2075) (GH#2080)
- Dask.array operations now work on numpy arrays (GH#2079)
- Reshape now works in a much broader set of cases (GH#2089)
- Support deepcopy python protocol (GH#2090)
- Allow user-provided FFT implementations in
da.fft
(GH#2093)
Bag¶
DataFrame¶
- Fix to_parquet with empty partitions (GH#2020)
- Optional
npartitions='auto'
mode inset_index
(GH#2025) - Optimize shuffle performance (GH#2032)
- Support efficient repartitioning along time windows like
repartition(freq='12h')
(GH#2059) - Improve speed of categorize (GH#2010)
- Support single-row dataframe arithmetic (GH#2085)
- Automatically avoid shuffle when setting index with a sorted column (GH#2091)
- Improve handling of integer-na handling in read_csv (GH#2098)
0.14.0 / 2017-02-24¶
Array¶
Bag¶
DataFrame¶
- Support non-uniform categoricals (GH#1877), (GH#1930)
- Groupby cumulative reductions (GH#1909)
- DataFrame.loc indexing now supports lists (GH#1913)
- Improve multi-level groupbys (GH#1914)
- Improved HTML and string repr for DataFrames (GH#1637)
- Parquet append (GH#1940)
- Add
dd.demo.daily_stock
function for teaching (GH#1992)
Delayed¶
Core¶
- Improve windows path parsing in corner cases (GH#1910)
- Rename tasks when fusing (GH#1919)
- Add top level
persist
function (GH#1927) - Propagate
errors=
keyword in byte handling (GH#1954) - Dask.compute traverses Python collections (GH#1975)
- Structural sharing between graphs in dask.array and dask.delayed (GH#1985)
0.13.0 / 2017-01-02¶
Array¶
- Mandatory dtypes on dask.array. All operations maintain dtype information and UDF functions like map_blocks now require a dtype= keyword if it can not be inferred. (GH#1755)
- Support arrays without known shapes, such as arises when slicing arrays with arrays or converting dataframes to arrays (GH#1838)
- Support mutation by setting one array with another (GH#1840)
- Tree reductions for covariance and correlations. (GH#1758)
- Add SerializableLock for better use with distributed scheduling (GH#1766)
- Improved atop support (GH#1800)
- Rechunk optimization (GH#1737), (GH#1827)
DataFrame¶
- Add
map_overlap
for custom rolling operations (GH#1769) - Add
shift
(GH#1773) - Add Parquet support (GH#1782) (GH#1792) (GH#1810), (GH#1843), (GH#1859), (GH#1863)
- Add missing methods combine, abs, autocorr, sem, nsmallest, first, last, prod, (GH#1787)
- Approximate nunique (GH#1807), (GH#1824)
- Reductions with multiple output partitions (for operations like drop_duplicates) (GH#1808), (GH#1823) (GH#1828)
- Add delitem and copy to DataFrames, increasing mutation support (GH#1858)
Delayed¶
- Changed behaviour for
delayed(nout=0)
anddelayed(nout=1)
:delayed(nout=1)
does not default toout=None
anymore, anddelayed(nout=0)
is also enabled. I.e. functions with return tuples of length 1 or 0 can be handled correctly. This is especially handy, if functions with a variable amount of outputs are wrapped bydelayed
. E.g. a trivial example:delayed(lambda *args: args, nout=len(vals))(*vals)
0.12.0 / 2016-11-03¶
DataFrame¶
- Return a series when functions given to
dataframe.map_partitions
return scalars (GH#1515) - Fix type size inference for series (GH#1513)
dataframe.DataFrame.categorize
no longer includes missing values in thecategories
. This is for compatibility with a pandas change (GH#1565)- Fix head parser error in
dataframe.read_csv
when some lines have quotes (GH#1495) - Add
dataframe.reduction
andseries.reduction
methods to apply generic row-wise reduction to dataframes and series (GH#1483) - Add
dataframe.select_dtypes
, which mirrors the pandas method (GH#1556) dataframe.read_hdf
now supports readingSeries
(GH#1564)- Support Pandas 0.19.0 (GH#1540)
- Implement
select_dtypes
(GH#1556) - String accessor works with indexes (GH#1561)
- Add pipe method to dask.dataframe (GH#1567)
- Add
indicator
keyword to merge (GH#1575) - Support Series in
read_hdf
(GH#1575) - Support Categories with missing values (GH#1578)
- Support inplace operators like
df.x += 1
(GH#1585) - Str accessor passes through args and kwargs (GH#1621)
- Improved groupby support for single-machine multiprocessing scheduler (GH#1625)
- Tree reductions (GH#1663)
- Pivot tables (GH#1665)
- Add clip (GH#1667), align (GH#1668), combine_first (GH#1725), and any/all (GH#1724)
- Improved handling of divisions on dask-pandas merges (GH#1666)
- Add
groupby.aggregate
method (GH#1678) - Add
dd.read_table
function (GH#1682) - Improve support for multi-level columns (GH#1697) (GH#1712)
- Support 2d indexing in
loc
(GH#1726) - Extend
resample
to include DataFrames (GH#1741) - Support dask.array ufuncs on dask.dataframe objects (GH#1669)
Array¶
- Add information about how
dask.array
chunks
argument work (GH#1504) - Fix field access with non-scalar fields in
dask.array
(GH#1484) - Add concatenate= keyword to atop to concatenate chunks of contracted dimensions
- Optimized slicing performance (GH#1539) (GH#1731)
- Extend
atop
with aconcatenate=
(GH#1609)new_axes=
(GH#1612) andadjust_chunks=
(GH#1716) keywords - Add clip (GH#1610) swapaxes (GH#1611) round (GH#1708) repeat
- Automatically align chunks in
atop
-backed operations (GH#1644) - Cull dask.arrays on slicing (GH#1709)
Bag¶
Administration¶
- Added changelog (GH#1526)
- Create new threadpool when operating from thread (GH#1487)
- Unify example documentation pages into one (GH#1520)
- Add versioneer for git-commit based versions (GH#1569)
- Pass through node_attr and edge_attr keywords in dot visualization (GH#1614)
- Add continuous testing for Windows with Appveyor (GH#1648)
- Remove use of multiprocessing.Manager (GH#1653)
- Add global optimizations keyword to compute (GH#1675)
- Micro-optimize get_dependencies (GH#1722)
0.11.0 / 2016-08-24¶
Major Points¶
DataFrames now enforce knowing full metadata (columns, dtypes) everywhere.
Previously we would operate in an ambiguous state when functions lost dtype
information (such as apply
). Now all dataframes always know their dtypes
and raise errors asking for information if they are unable to infer (which
they usually can). Some internal attributes like _pd
and
_pd_nonempty
have been moved.
The internals of the distributed scheduler have been refactored to transition tasks between explicit states. This improves resilience, reasoning about scheduling, plugin operation, and logging. It also makes the scheduler code easier to understand for newcomers.
Breaking Changes¶
- The
distributed.s3
anddistributed.hdfs
namespaces are gone. Use protocols in normal methods likeread_text('s3://...'
instead. Dask.array.reshape
now errs in some cases where previously it would have create a very large number of tasks
0.10.2 / 2016-07-27¶
- More Dataframe shuffles now work in distributed settings, ranging from setting-index to hash joins, to sorted joins and groupbys.
- Dask passes the full test suite when run when under in Python’s optimized-OO mode.
- On-disk shuffles were found to produce wrong results in some highly-concurrent situations, especially on Windows. This has been resolved by a fix to the partd library.
- Fixed a growth of open file descriptors that occurred under large data communications
- Support ports in the
--bokeh-whitelist
option ot dask-scheduler to better routing of web interface messages behind non-trivial network settings - Some improvements to resilience to worker failure (though other known failures persist)
- You can now start an IPython kernel on any worker for improved debugging and analysis
- Improvements to
dask.dataframe.read_hdf
, especially when reading from multiple files and docs
0.10.0 / 2016-06-13¶
Major Changes¶
- This version drops support for Python 2.6
- Conda packages are built and served from conda-forge
- The
dask.distributed
executables have been renamed from dfoo to dask-foo. For example dscheduler is renamed to dask-scheduler - Both Bag and DataFrame include a preliminary distributed shuffle.
Bag¶
- Add task-based shuffle for distributed groupbys
- Add accumulate for cumulative reductions
DataFrame¶
- Add a task-based shuffle suitable for distributed joins, groupby-applys, and set_index operations. The single-machine shuffle remains untouched (and much more efficient.)
- Add support for new Pandas rolling API with improved communication performance on distributed systems.
- Add
groupby.std/var
- Pass through S3/HDFS storage options in
read_csv
- Improve categorical partitioning
- Add eval, info, isnull, notnull for dataframes
Distributed¶
- Rename executables like dscheduler to dask-scheduler
- Improve scheduler performance in the many-fast-tasks case (important for shuffling)
- Improve work stealing to be aware of expected function run-times and data sizes. The drastically increases the breadth of algorithms that can be efficiently run on the distributed scheduler without significant user expertise.
- Support maximum buffer sizes in streaming queues
- Improve Windows support when using the Bokeh diagnostic web interface
- Support compression of very-large-bytestrings in protocol
- Support clean cancellation of submitted futures in Joblib interface
Other¶
- All dask-related projects (dask, distributed, s3fs, hdfs, partd) are now building conda packages on conda-forge.
- Change credential handling in s3fs to only pass around delegated credentials if explicitly given secret/key. The default now is to rely on managed environments. This can be changed back by explicitly providing a keyword argument. Anonymous mode must be explicitly declared if desired.
0.9.0 / 2016-05-11¶
API Changes¶
dask.do
anddask.value
have been renamed todask.delayed
dask.bag.from_filenames
has been renamed todask.bag.read_text
- All S3/HDFS data ingest functions like
db.from_s3
ordistributed.s3.read_csv
have been moved into the plainread_text
,read_csv functions
, which now support protocols, likedd.read_csv('s3://bucket/keys*.csv')
Array¶
- Add support for
scipy.LinearOperator
- Improve optional locking to on-disk data structures
- Change rechunk to expose the intermediate chunks
Bag¶
- Rename
from_filename``s to ``read_text
- Remove
from_s3
in favor ofread_text('s3://...')
DataFrame¶
- Fixed numerical stability issue for correlation and covariance
- Allow no-hash
from_pandas
for speedy round-trips to and from-pandas objects - Generally reengineered
read_csv
to be more in line with Pandas behavior - Support fast
set_index
operations for sorted columns
Delayed¶
- Rename
do/value
todelayed
- Rename
to/from_imperative
toto/from_delayed
Distributed¶
- Move s3 and hdfs functionality into the dask repository
- Adaptively oversubscribe workers for very fast tasks
- Improve PyPy support
- Improve work stealing for unbalanced workers
- Scatter data efficiently with tree-scatters
Other¶
- Add lzma/xz compression support
- Raise a warning when trying to split unsplittable compression types, like gzip or bz2
- Improve hashing for single-machine shuffle operations
- Add new callback method for start state
- General performance tuning
0.8.1 / 2016-03-11¶
Array¶
- Bugfix for range slicing that could periodically lead to incorrect results.
- Improved support and resiliency of
arg
reductions (argmin
,argmax
, etc.)
Bag¶
- Add
zip
function
DataFrame¶
- Add
corr
andcov
functions - Add
melt
function - Bugfixes for io to bcolz and hdf5
0.8.0 / 2016-02-20¶
Array¶
- Changed default array reduction split from 32 to 4
- Linear algebra,
tril
,triu
,LU
,inv
,cholesky
,solve
,solve_triangular
, eye``,lstsq
,diag
,corrcoef
.
Bag¶
- Add tree reductions
- Add range function
- drop
from_hdfs
function (better functionality now exists in hdfs3 and distributed projects)
DataFrame¶
- Refactor
dask.dataframe
to include a full empty pandas dataframe as metadata. Drop the.columns
attribute on Series - Add Series categorical accessor, series.nunique, drop the
.columns
attribute for series. read_csv
fixes (multi-column parse_dates, integer column names, etc. )- Internal changes to improve graph serialization
Other¶
- Documentation updates
- Add from_imperative and to_imperative functions for all collections
- Aesthetic changes to profiler plots
- Moved the dask project to a new dask organization
0.7.6 / 2016-01-05¶
Array¶
- Improve thread safety
- Tree reductions
- Add
view
,compress
,hstack
,dstack
,vstack
methods map_blocks
can now remove and add dimensions
DataFrame¶
- Improve thread safety
- Extend sampling to include replacement options
Imperative¶
- Removed optimization passes that fused results.
Core¶
- Removed
dask.distributed
- Improved performance of blocked file reading
- Serialization improvements
- Test Python 3.5
0.7.4 / 2015-10-23¶
This was mostly a bugfix release. Some notable changes:
- Fix minor bugs associated with the release of numpy 1.10 and pandas 0.17
- Fixed a bug with random number generation that would cause repeated blocks due to the birthday paradox
- Use locks in
dask.dataframe.read_hdf
by default to avoid concurrency issues - Change
dask.get
to point todask.async.get_sync
by default - Allow visualization functions to accept general graphviz graph options like rankdir=’LR’
- Add reshape and ravel to
dask.array
- Support the creation of
dask.arrays
fromdask.imperative
objects
Deprecation¶
This release also includes a deprecation warning for dask.distributed
, which
will be removed in the next version.
Future development in distributed computing for dask is happening here: https://distributed.readthedocs.io . General feedback on that project is most welcome from this community.
0.7.3 / 2015-09-25¶
Diagnostics¶
- A utility for profiling memory and cpu usage has been added to the
dask.diagnostics
module.
DataFrame¶
This release improves coverage of the pandas API. Among other things
it includes nunique
, nlargest
, quantile
. Fixes encoding issues
with reading non-ascii csv files. Performance improvements and bug fixes
with resample. More flexible read_hdf with globbing. And many more. Various
bug fixes in dask.imperative
and dask.bag
.
0.7.0 / 2015-08-15¶
DataFrame¶
This release includes significant bugfixes and alignment with the Pandas API. This has resulted both from use and from recent involvement by Pandas core developers.
- New operations: query, rolling operations, drop
- Improved operations: quantiles, arithmetic on full dataframes, dropna, constructor logic, merge/join, elemwise operations, groupby aggregations
Bag¶
- Fixed a bug in fold where with a null default argument
Array¶
- New operations: da.fft module, da.image.imread
Infrastructure¶
- The array and dataframe collections create graphs with deterministic keys. These tend to be longer (hash strings) but should be consistent between computations. This will be useful for caching in the future.
- All collections (Array, Bag, DataFrame) inherit from common subclass
0.6.1 / 2015-07-23¶
Distributed¶
- Improved (though not yet sufficient) resiliency for
dask.distributed
when workers die
DataFrame¶
- Improved writing to various formats, including to_hdf, to_castra, and to_csv
- Improved creation of dask DataFrames from dask Arrays and Bags
- Improved support for categoricals and various other methods
Array¶
- Various bug fixes
- Histogram function
Scheduling¶
- Added tie-breaking ordering of tasks within parallel workloads to better handle and clear intermediate results
Other¶
- Added the dask.do function for explicit construction of graphs with normal python code
- Traded pydot for graphviz library for graph printing to support Python3
- There is also a gitter chat room and a stackoverflow tag