API

Top level user functions:

DataFrame Implements out-of-core DataFrame as a sequence of pandas DataFrames
DataFrame.add(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator add).
DataFrame.append(other) Append rows of other to the end of this frame, returning a new object.
DataFrame.apply(func[, axis, args, columns]) Parallel version of pandas.DataFrame.apply
DataFrame.assign(**kwargs) Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
DataFrame.astype(dtype) Cast object to input numpy.dtype
DataFrame.cache([cache]) Evaluate Dataframe and store in local cache
DataFrame.categorize([columns])
DataFrame.column_info Return DataFrame.columns
DataFrame.columns
DataFrame.compute(**kwargs)
DataFrame.corr([method, min_periods]) Compute pairwise correlation of columns, excluding NA/null values
DataFrame.count([axis]) Return Series with number of non-NA/null observations over requested axis.
DataFrame.cov([min_periods]) Compute pairwise covariance of columns, excluding NA/null values
DataFrame.cummax([axis, skipna]) Return cumulative cummax over requested axis.
DataFrame.cummin([axis, skipna]) Return cumulative cummin over requested axis.
DataFrame.cumprod([axis, skipna]) Return cumulative cumprod over requested axis.
DataFrame.cumsum([axis, skipna]) Return cumulative cumsum over requested axis.
DataFrame.describe() Generate various summary statistics, excluding NaN values.
DataFrame.div(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.drop(labels[, axis]) Return new object with labels in requested axis removed.
DataFrame.drop_duplicates(**kwargs) Return DataFrame with duplicate rows removed, optionally only
DataFrame.dropna([how, subset]) Return object with labels on given axis omitted where alternately any
DataFrame.dtypes Return data types
DataFrame.fillna(value) Fill NA/NaN values using the specified method
DataFrame.floordiv(other[, axis, level, ...]) Integer division of dataframe and other, element-wise (binary operator floordiv).
DataFrame.get_division(n) Get nth division of the data
DataFrame.groupby(key, **kwargs) Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
DataFrame.head([n, compute]) First n rows of the dataset
DataFrame.iloc Not implemented
DataFrame.index Return dask Index instance
DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples() Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.
DataFrame.join(other[, on, how, lsuffix, ...]) Join columns with other DataFrame either on index or on a key column.
DataFrame.known_divisions Whether divisions are already known
DataFrame.loc Purely label-location based indexer for selection by label.
DataFrame.map_partitions(func[, columns]) Apply Python function on each DataFrame block
DataFrame.mask(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
DataFrame.max([axis, skipna]) This method returns the maximum of the values in the object.
DataFrame.mean([axis, skipna]) Return the mean of the values for the requested axis
DataFrame.merge(right[, how, on, left_on, ...]) Merge DataFrame objects by performing a database-style join operation by columns or indexes.
DataFrame.min([axis, skipna]) This method returns the minimum of the values in the object.
DataFrame.mod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator mod).
DataFrame.mul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator mul).
DataFrame.ndim Return dimensionality
DataFrame.nlargest([n, columns]) Get the rows of a DataFrame sorted by the n largest values of columns.
DataFrame.npartitions Return number of partitions
DataFrame.pow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator pow).
DataFrame.quantile([q, axis]) Approximate row-wise and precise column-wise quantiles of DataFrame
DataFrame.query(expr, **kwargs)
DataFrame.radd(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator radd).
DataFrame.random_split(p[, random_state]) Pseudorandomly split dataframe into different pieces row-wise
DataFrame.rdiv(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.rename([index, columns]) Alter axes input function or functions.
DataFrame.repartition([divisions, ...]) Repartition dataframe along new divisions
DataFrame.reset_index() For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc.
DataFrame.rfloordiv(other[, axis, level, ...]) Integer division of dataframe and other, element-wise (binary operator rfloordiv).
DataFrame.rmod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator rmod).
DataFrame.rmul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator rmul).
DataFrame.rpow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator rpow).
DataFrame.rsub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator rsub).
DataFrame.rtruediv(other[, axis, level, ...]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.sample(frac[, replace, random_state]) Random sample of items
DataFrame.set_index(other[, drop, sorted]) Set the DataFrame index (row labels) using an existing column
DataFrame.set_partition(column, divisions, ...) Set explicit divisions for new column index
DataFrame.std([axis, skipna, ddof]) Return sample standard deviation over requested axis.
DataFrame.sub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator sub).
DataFrame.sum([axis, skipna]) Return the sum of the values for the requested axis
DataFrame.tail([n, compute]) Last n rows of the dataset
DataFrame.to_bag([index]) Convert to a dask Bag of tuples of each row.
DataFrame.to_castra([fn, categories, ...]) Write DataFrame to Castra on-disk store
DataFrame.to_csv(filename[, get]) Write DataFrame to a comma-separated values (csv) file
DataFrame.to_hdf(path_or_buf, key[, mode, ...]) Activate the HDFStore.
DataFrame.to_delayed() Convert dataframe into dask Values
DataFrame.truediv(other[, axis, level, ...]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.var([axis, skipna, ddof]) Return unbiased variance over requested axis.
DataFrame.visualize([filename, format, ...])
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Rolling Operations

rolling.rolling_apply(arg, window, *args, ...) Generic moving function application.
rolling.rolling_chunk(func, part1, part2, ...)
rolling.rolling_count(arg, window, *args, ...) Rolling count of number of non-NaN observations inside provided window.
rolling.rolling_kurt(arg, window, *args, ...) Unbiased moving kurtosis.
rolling.rolling_max(arg, window, *args, **kwargs) Moving maximum.
rolling.rolling_mean(arg, window, *args, ...) Moving mean.
rolling.rolling_median(arg, window, *args, ...) Moving median.
rolling.rolling_min(arg, window, *args, **kwargs) Moving minimum.
rolling.rolling_quantile(arg, window, *args, ...) Moving quantile.
rolling.rolling_skew(arg, window, *args, ...) Unbiased moving skewness.
rolling.rolling_std(arg, window, *args, **kwargs) Moving standard deviation.
rolling.rolling_sum(arg, window, *args, **kwargs) Moving sum.
rolling.rolling_var(arg, window, *args, **kwargs) Moving variance.
rolling.rolling_window(arg, window, *args, ...) Applies a moving window of type window_type and size window on the data.

Create DataFrames

from_array(x[, chunksize, columns]) Read dask Dataframe from any slicable array
from_bcolz(x[, chunksize, categorize, ...]) Read dask Dataframe from bcolz.ctable
from_castra(x[, columns]) Load a dask DataFrame from a Castra.
read_csv(filename[, blocksize, chunkbytes, ...]) Read CSV files into a Dask.DataFrame
from_dask_array(x[, columns]) Convert dask Array to dask DataFrame
from_delayed(dfs[, metadata, divisions, ...]) Create DataFrame from many dask.delayed objects
from_pandas(data[, npartitions, chunksize, ...]) Construct a dask object from a pandas object.

DataFrame Methods

class dask.dataframe.DataFrame

Implements out-of-core DataFrame as a sequence of pandas DataFrames

Parameters:

dask: dict

The dask graph to compute this DataFrame

name: str

The key prefix that specifies which keys in the dask comprise this particular DataFrame

columns: list of str

Column names. This metadata aids usability

divisions: tuple of index values

Values along which we partition our blocks on the index

add(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator add).

Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.radd

Notes

Mismatched indices will be unioned together

append(other)

Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.

Parameters:

other : DataFrame or Series/dict-like object, or list of these

The data to append.

ignore_index : boolean, default False

If True, do not use the index labels.

verify_integrity : boolean, default False

If True, raise ValueError on creating index with duplicates.

Returns:

appended : DataFrame

See also

pandas.concat
General function to concatenate DataFrame, Series or Panel objects

Notes

Dask doesn’t supports following argument(s).

  • ignore_index
  • verify_integrity

Examples

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))    
>>> df    
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))    
>>> df.append(df2)    
   A  B
0  1  2
1  3  4
0  5  6
1  7  8

With ignore_index set to True:

>>> df.append(df2, ignore_index=True)    
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
apply(func, axis=0, args=(), columns='__no_default__', **kwds)

Parallel version of pandas.DataFrame.apply

This mimics the pandas version except for the following:

  1. The user must specify axis=1 explicitly.
  2. The user should provide output columns.
Parameters:

func: function

Function to apply to each column

axis: {0 or ‘index’, 1 or ‘columns’}, default 0

  • 0 or ‘index’: apply function to each column (NOT SUPPORTED)
  • 1 or ‘columns’: apply function to each row

columns: list, scalar or None

If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beggining of data. This inference may take some time and lead to unexpected result

args : tuple

Positional arguments to pass to function in addition to the array/series

Additional keyword arguments will be passed as keywords to the function

Returns:

applied : Series or DataFrame depending on name keyword

assign(**kwargs)

Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

New in version 0.16.0.

Parameters:

kwargs : keyword, value pairs

keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:

df : DataFrame

A new DataFrame with the new columns in addition to all the existing columns.

Notes

Since kwargs is a dictionary, the order of your arguments may not be preserved. The make things predicatable, the columns are inserted in alphabetical order, at the end of your DataFrame. Assigning multiple columns within the same assign is possible, but you cannot reference other columns created within the same assign call.

Examples

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})    

Where the value is a callable, evaluated on df:

>>> df.assign(ln_A = lambda x: np.log(x.A))    
    A         B      ln_A
0   1  0.426905  0.000000
1   2 -0.780949  0.693147
2   3 -0.418711  1.098612
3   4 -0.269708  1.386294
4   5 -0.274002  1.609438
5   6 -0.500792  1.791759
6   7  1.649697  1.945910
7   8 -1.495604  2.079442
8   9  0.549296  2.197225
9  10 -0.758542  2.302585

Where the value already exists and is inserted:

>>> newcol = np.log(df['A'])    
>>> df.assign(ln_A=newcol)    
    A         B      ln_A
0   1  0.426905  0.000000
1   2 -0.780949  0.693147
2   3 -0.418711  1.098612
3   4 -0.269708  1.386294
4   5 -0.274002  1.609438
5   6 -0.500792  1.791759
6   7  1.649697  1.945910
7   8 -1.495604  2.079442
8   9  0.549296  2.197225
9  10 -0.758542  2.302585
astype(dtype)

Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)

Parameters:

dtype : numpy.dtype or Python type

raise_on_error : raise on invalid input

kwargs : keyword arguments to pass on to the constructor

Returns:

casted : type of caller

Notes

Dask doesn’t supports following argument(s).

  • copy
  • raise_on_error
cache(cache=<type 'dict'>)

Evaluate Dataframe and store in local cache

Uses chest by default to store data on disk

column_info

Return DataFrame.columns

corr(method='pearson', min_periods=None)

Compute pairwise correlation of columns, excluding NA/null values

Parameters:

method : {‘pearson’, ‘kendall’, ‘spearman’}

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation

min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation

Returns:

y : DataFrame

count(axis=None)

Return Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None)

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame

numeric_only : boolean, default False

Include only float, int, boolean data

Returns:

count : Series (or DataFrame if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
cov(min_periods=None)

Compute pairwise covariance of columns, excluding NA/null values

Parameters:

min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result.

Returns:

y : DataFrame

Notes

y contains the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1 (unbiased estimator).

cummax(axis=None, skipna=True)

Return cumulative cummax over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummax : Series

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
cummin(axis=None, skipna=True)

Return cumulative cummin over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummin : Series

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
cumprod(axis=None, skipna=True)

Return cumulative cumprod over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumprod : Series

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
cumsum(axis=None, skipna=True)

Return cumulative cumsum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumsum : Series

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
describe()

Generate various summary statistics, excluding NaN values.

Parameters:

percentiles : array-like, optional

The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.

include, exclude : list-like, ‘all’, or None (default)

Specify the form of the returned result. Either:

  • None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
  • A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
  • If include is the string ‘all’, the output column-set will match the input one.
Returns:

summary: NDFrame of summary statistics

See also

DataFrame.select_dtypes

Notes

Dask doesn’t supports following argument(s).

  • percentiles
  • include
  • exclude
div(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

drop(labels, axis=0)

Return new object with labels in requested axis removed.

Parameters:

labels : single label or list-like

axis : int or axis name

level : int or level name, default None

For MultiIndex

inplace : bool, default False

If True, do operation inplace and return None.

errors : {‘ignore’, ‘raise’}, default ‘raise’

If ‘ignore’, suppress error and existing labels are dropped.

New in version 0.16.1.

Returns:

dropped : type of caller

Notes

Dask doesn’t supports following argument(s).

  • level
  • inplace
  • errors
drop_duplicates(**kwargs)

Return DataFrame with duplicate rows removed, optionally only considering certain columns

Parameters:

subset : column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

take_last : deprecated

inplace : boolean, default False

Whether to drop duplicates in place or to return a copy

Returns:

deduplicated : DataFrame

dropna(how='any', subset=None)

Return object with labels on given axis omitted where alternately any or all of the data are missing

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof

Pass tuple or list to drop on multiple axes

how : {‘any’, ‘all’}

  • any : if any NA values are present, drop that label
  • all : if all values are NA, drop that label

thresh : int, default None

int value : require that many non-NA values

subset : array-like

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include

inplace : boolean, default False

If True, do operation inplace and return None.

Returns:

dropped : DataFrame

Notes

Dask doesn’t supports following argument(s).

  • axis
  • thresh
  • inplace
dtypes

Return data types

eval(expr, inplace=None, **kwargs)

Evaluate an expression in the context of the calling DataFrame instance.

Parameters:

expr : string

The expression string to evaluate.

inplace : bool

If the expression contains an assignment, whether to return a new DataFrame or mutate the existing.

WARNING: inplace=None currently falls back to to True, but in a future version, will default to False. Use inplace=True explicitly rather than relying on the default.

New in version 0.18.0.

kwargs : dict

See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns:

ret : ndarray, scalar, or pandas object

See also

pandas.DataFrame.query, pandas.DataFrame.assign, pandas.eval

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

>>> from numpy.random import randn    
>>> from pandas import DataFrame    
>>> df = DataFrame(randn(10, 2), columns=list('ab'))    
>>> df.eval('a + b')    
>>> df.eval('c = a + b')    
fillna(value)

Fill NA/NaN values using the specified method

Parameters:

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0, ‘index’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:

filled : Series

See also

reindex, asfreq

Notes

Dask doesn’t supports following argument(s).

  • method
  • axis
  • inplace
  • limit
  • downcast
floordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator floordiv).

Equivalent to dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

get_division(n)

Get nth division of the data

groupby(key, **kwargs)

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Parameters:

by : mapping function / list of functions, dict, Series, or tuple /

list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups

axis : int, default 0

level : int, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : boolean, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : boolean, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

squeeze : boolean, default False

reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns:

GroupBy object

Notes

Dask doesn’t supports following argument(s).

  • by
  • axis
  • level
  • as_index
  • sort
  • group_keys
  • squeeze

Examples

DataFrame results

>>> data.groupby(func, axis=0).mean()    
>>> data.groupby(['col1', 'col2'])['col3'].mean()    

DataFrame with hierarchical index

>>> data.groupby(['col1', 'col2']).mean()    
head(n=5, compute=True)

First n rows of the dataset

Caveat, this only checks the first n rows of the first partition.

iloc

Not implemented

index

Return dask Index instance

isnull()

Return a boolean same-sized object indicating if the values are null.

See also

notnull
boolean inverse of isnull
iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

Returns:

it : generator

A generator that iterates over the rows of the frame.

See also

itertuples
Iterate over DataFrame rows as namedtuples of the values.
iteritems
Iterate over (column name, Series) pairs.

Notes

  1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])    
    >>> row = next(df.iterrows())[1]    
    >>> row    
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)    
    float64
    >>> print(df['int'].dtype)    
    int64
    

    To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

  2. You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

itertuples()

Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.

Parameters:

index : boolean, default True

If True, return the index as the first element of the tuple.

name : string, default “Pandas”

The name of the returned namedtuples or None to return regular tuples.

See also

iterrows
Iterate over DataFrame rows as (index, Series) pairs.
iteritems
Iterate over (column name, Series) pairs.

Notes

Dask doesn’t supports following argument(s).

  • index
  • name

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},    
                      index=['a', 'b'])
>>> df    
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():    
...     print(row)
...
Pandas(Index='a', col1=1, col2=0.10000000000000001)
Pandas(Index='b', col1=2, col2=0.20000000000000001)
join(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None)

Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.

Parameters:

other : DataFrame, Series with name field set, or list of DataFrame

Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame

on : column name, tuple/list of column names, or array-like

Column(s) to use for joining, otherwise join on index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

how : {‘left’, ‘right’, ‘outer’, ‘inner’}

How to handle indexes of the two objects. Default: ‘left’ for joining on index, None otherwise

  • left: use calling frame’s index
  • right: use input frame’s index
  • outer: form union of indexes
  • inner: use intersection of indexes

lsuffix : string

Suffix to use from left frame’s overlapping columns

rsuffix : string

Suffix to use from right frame’s overlapping columns

sort : boolean, default False

Order result DataFrame lexicographically by the join key. If False, preserves the index order of the calling (left) DataFrame

Returns:

joined : DataFrame

Notes

Dask doesn’t supports following argument(s).

  • sort
known_divisions

Whether divisions are already known

loc

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
map_partitions(func, columns='__no_default__', *args, **kwargs)

Apply Python function on each DataFrame block

When using map_partitions you should provide either the column names (if the result is a DataFrame) or the name of the Series (if the result is a Series). The output type will be determined by the type of columns.

Parameters:

func : function

Function applied to each blocks

columns : tuple or scalar

Column names or name of the output. Defaults to names of data itself. When tuple is passed, DataFrame is returned. When scalar is passed, Series is returned.

Examples

When str is passed as columns, the result will be Series.

>>> df.map_partitions(lambda df: df.x + 1, columns='x')  

When tuple is passed as columns, the result will be Series.

>>> df.map_partitions(lambda df: df.head(), columns=df.columns)  
mask(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error
max(axis=None, skipna=True)
This method returns the maximum of the values in the object.
If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

max : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mean(axis=None, skipna=True)

Return the mean of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

mean : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), npartitions=None)

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters:

right : DataFrame

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

  • left: use only keys from left frame (SQL: left outer join)
  • right: use only keys from right frame (SQL: right outer join)
  • outer: use union of keys from both frames (SQL: full outer join)
  • inner: use intersection of keys from both frames (SQL: inner join)

on : label or list

Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like

Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like

Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index

sort : boolean, default False

Sort the join keys lexicographically in the result DataFrame

suffixes : 2-length sequence (tuple, list, ...)

Suffix to apply to overlapping column names in the left and right side, respectively

copy : boolean, default True

If False, do not copy data unnecessarily

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

New in version 0.17.0.

Returns:

merged : DataFrame

The output type will the be same as ‘left’, if it is a subclass of DataFrame.

Notes

Dask doesn’t supports following argument(s).

  • sort
  • copy
  • indicator

Examples

>>> A              >>> B    
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')    
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7
min(axis=None, skipna=True)
This method returns the minimum of the values in the object.
If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

min : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator mod).

Equivalent to dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rmod

Notes

Mismatched indices will be unioned together

mul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rmul

Notes

Mismatched indices will be unioned together

ndim

Return dimensionality

nlargest(n=5, columns=None)

Get the rows of a DataFrame sorted by the n largest values of columns.

New in version 0.17.0.

Parameters:

n : int

Number of items to retrieve

columns : list or str

Column name or names to order by

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

Returns:

DataFrame

Notes

Dask doesn’t supports following argument(s).

  • keep

Examples

>>> df = DataFrame({'a': [1, 10, 8, 11, -1],    
...                 'b': list('abdce'),
...                 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
>>> df.nlargest(3, 'a')    
    a  b   c
3  11  c   3
1  10  b   2
2   8  d NaN
notnull()

Return a boolean same-sized object indicating if the values are not null.

See also

isnull
boolean inverse of notnull
npartitions

Return number of partitions

pow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator pow).

Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rpow

Notes

Mismatched indices will be unioned together

quantile(q=0.5, axis=0)

Approximate row-wise and precise column-wise quantiles of DataFrame

Parameters:

q : list/array of floats, default 0.5 (50%)

Iterable of numbers ranging from 0 to 1 for the desired quantiles

axis : {0, 1, ‘index’, ‘columns’} (default 0)

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wis

radd(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator radd).

Equivalent to other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.add

Notes

Mismatched indices will be unioned together

random_split(p, random_state=None)

Pseudorandomly split dataframe into different pieces row-wise

Parameters:

frac : float, optional

Fraction of axis items to return.

random_state: int or np.random.RandomState

If int create a new RandomState with this as the seed

Otherwise draw from the passed RandomState

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

rename(index=None, columns=None)

Alter axes input function or functions. Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Alternatively, change Series.name with a scalar value (Series only).

Parameters:

index, columns : scalar, list-like, dict-like or function, optional

Scalar or list-like will alter the Series.name attribute, and raise on DataFrame or Panel. dict-like or functions are transformations to apply to that axis’ values

copy : boolean, default True

Also copy underlying data

inplace : boolean, default False

Whether to return a new DataFrame. If True then value of copy is ignored.

Returns:

renamed : DataFrame (new object)

See also

pandas.NDFrame.rename_axis

Examples

>>> s = pd.Series([1, 2, 3])    
>>> s    
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name") # scalar, changes Series.name    
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels    
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels    
0    1
3    2
5    3
dtype: int64
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})    
>>> df.rename(2)    
...
TypeError: 'int' object is not callable
>>> df.rename(index=str, columns={"A": "a", "B": "c"})    
   a  c
0  1  4
1  2  5
2  3  6
repartition(divisions=None, npartitions=None, force=False)

Repartition dataframe along new divisions

Parameters:

divisions : list

List of partitions to be used

npartitions : int

Number of partitions of output, must be less than npartitions of input

force : bool, default False

Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
reset_index()

For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters:

level : int, str, tuple, or list, default None

Only remove the given levels from the index. Removes all levels by default

drop : boolean, default False

Do not try to insert index into dataframe columns. This resets the index to the default integer index.

inplace : boolean, default False

Modify the DataFrame in place (do not create a new object)

col_level : int or str, default 0

If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.

col_fill : object, default ‘’

If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

Returns:

resetted : DataFrame

Notes

Dask doesn’t supports following argument(s).

  • level
  • drop
  • inplace
  • col_level
  • col_fill
rfloordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator rfloordiv).

Equivalent to other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

rmod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator rmod).

Equivalent to other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.mod

Notes

Mismatched indices will be unioned together

rmul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator rmul).

Equivalent to other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.mul

Notes

Mismatched indices will be unioned together

rpow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator rpow).

Equivalent to other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.pow

Notes

Mismatched indices will be unioned together

rsub(other, axis='columns', level=None, fill_value=None)

Subtraction of dataframe and other, element-wise (binary operator rsub).

Equivalent to other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.sub

Notes

Mismatched indices will be unioned together

rtruediv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

sample(frac, replace=False, random_state=None)

Random sample of items

Parameters:

frac : float, optional

Fraction of axis items to return.

replace: boolean, optional

Sample with or without replacement. Default = False.

random_state: int or ``np.random.RandomState``

If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

set_index(other, drop=True, sorted=False, **kwargs)

Set the DataFrame index (row labels) using an existing column

This operation in dask.dataframe is expensive. If the input column is sorted then we accomplish the set_index in a single full read of that column. However, if the input column is not sorted then this operation triggers a full shuffle, which can take a while and only works on a single machine (not distributed).

Parameters:

other: Series or label

drop: boolean, default True

Delete columns to be used as the new index

sorted: boolean, default False

Set to True if the new index column is already sorted

Examples

>>> df.set_index('x')  
>>> df.set_index(d.x)  
>>> df.set_index(d.timestamp, sorted=True)  
set_partition(column, divisions, **kwargs)

Set explicit divisions for new column index

>>> df2 = df.set_partition('new-index-column', divisions=[10, 20, 50])  

See also

set_index

std(axis=None, skipna=True, ddof=1)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

std : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
sub(other, axis='columns', level=None, fill_value=None)

Subtraction of dataframe and other, element-wise (binary operator sub).

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rsub

Notes

Mismatched indices will be unioned together

sum(axis=None, skipna=True)

Return the sum of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

sum : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
tail(n=5, compute=True)

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(index=False)

Convert to a dask Bag of tuples of each row.

Parameters:

index : bool, optional

If True, the index is included as the first element of each tuple. Default is False.

to_castra(fn=None, categories=None, sorted_index_column=None, compute=True, get=<function get_sync>)

Write DataFrame to Castra on-disk store

See https://github.com/blosc/castra for details

See also

Castra.to_dask

to_csv(filename, get=<function get_sync>, **kwargs)

Write DataFrame to a comma-separated values (csv) file

Parameters:

path_or_buf : string or file handle, default None

File path or object, if None is provided the result is returned as a string.

sep : character, default ‘,’

Field delimiter for the output file.

na_rep : string, default ‘’

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename

line_terminator : string, default ‘n’

The newline character or character sequence to use in the output file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of quotechar inside a field

escapechar : string (length 1), default None

character used to escape sep and quotechar when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True) or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

New in version 0.16.0.

Notes

Dask doesn’t supports following argument(s).

  • path_or_buf
  • sep
  • na_rep
  • float_format
  • columns
  • header
  • index
  • index_label
  • mode
  • encoding
  • compression
  • quoting
  • quotechar
  • line_terminator
  • chunksize
  • tupleize_cols
  • date_format
  • doublequote
  • escapechar
  • decimal
to_delayed()

Convert dataframe into dask Values

Returns a list of values, one value per partition.

to_hdf(path_or_buf, key, mode='a', append=False, complevel=0, complib=None, fletcher32=False, get=<function get_sync>, **kwargs)

Activate the HDFStore.

Parameters:

path_or_buf : the path (string) or HDFStore object

key : string

indentifier for the group in the store

mode : optional, {‘a’, ‘w’, ‘r’, ‘r+’}, default ‘a’

'r'

Read-only; no data can be modified.

'w'

Write; a new file is created (an existing file with the same name would be deleted).

'a'

Append; an existing file is opened for reading and writing, and if the file does not exist it is created.

'r+'

It is similar to 'a', but the file must already exist.

format : ‘fixed(f)|table(t)’, default is ‘fixed’

fixed(f) : Fixed format

Fast writing/reading. Not-appendable, nor searchable

table(t) : Table format

Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data

append : boolean, default False

For Table formats, append the input data to the existing

complevel : int, 1-9, default 0

If a complib is specified compression will be applied where possible

complib : {‘zlib’, ‘bzip2’, ‘lzo’, ‘blosc’, None}, default None

If complevel is > 0 apply compression to objects written in the store wherever possible

fletcher32 : bool, default False

If applying compression use the fletcher32 checksum

dropna : boolean, default False.

If true, ALL nan rows will not be written to store.

truediv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

var(axis=None, skipna=True, ddof=1)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

var : Series or DataFrame (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
where(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error

Series Methods

class dask.dataframe.Series

Out-of-core Series object

Mimics pandas.Series.

Parameters:

dsk: dict

The dask graph to compute this Series

_name: str

The key prefix that specifies which keys in the dask comprise this particular Series

name: scalar or None

Series name. This metadata aids usability

divisions: tuple of index values

Values along which we partition our blocks on the index

add(other, level=None, fill_value=None, axis=0)

Addition of series and other, element-wise (binary operator add).

Equivalent to series + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.radd

append(other)

Concatenate two or more Series.

Parameters:

to_append : Series or list/tuple of Series

verify_integrity : boolean, default False

If True, raise Exception on creating index with duplicates

Returns:

appended : Series

Notes

Dask doesn’t supports following argument(s).

  • to_append
  • verify_integrity

Examples

>>> s1 = pd.Series([1, 2, 3])    
>>> s2 = pd.Series([4, 5, 6])    
>>> s3 = pd.Series([4, 5, 6], index=[3,4,5])    
>>> s1.append(s2)    
0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64
>>> s1.append(s3)    
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With verify_integrity set to True:

>>> s1.append(s2, verify_integrity=True)    
ValueError: Indexes have overlapping values: [0, 1, 2]
apply(func, convert_dtype=True, name='__no_default__', args=(), **kwds)

Parallel version of pandas.Series.apply

This mimics the pandas version except for the following:

  1. The user should provide output name.
Parameters:

func: function

Function to apply

convert_dtype: boolean, default True

Try to find better dtype for elementwise function results. If False, leave as dtype=object

name: list, scalar or None, optional

If list is given, the result is a DataFrame which columns is specified list. Otherwise, the result is a Series which name is given scalar or None (no name). If name keyword is not given, dask tries to infer the result type using its beggining of data. This inference may take some time and lead to unexpected result.

args: tuple

Positional arguments to pass to function in addition to the array/series

Additional keyword arguments will be passed as keywords to the function

Returns:

applied : Series or DataFrame depending on name keyword

astype(dtype)

Cast object to input numpy.dtype Return a copy when copy = True (be really careful with this!)

Parameters:

dtype : numpy.dtype or Python type

raise_on_error : raise on invalid input

kwargs : keyword arguments to pass on to the constructor

Returns:

casted : type of caller

Notes

Dask doesn’t supports following argument(s).

  • copy
  • raise_on_error
between(left, right, inclusive=True)

Return boolean Series equivalent to left <= series <= right. NA values will be treated as False

Parameters:

left : scalar

Left boundary

right : scalar

Right boundary

Returns:

is_between : Series

cache(cache=<type 'dict'>)

Evaluate Dataframe and store in local cache

Uses chest by default to store data on disk

clip(lower=None, upper=None)

Trim values at input threshold(s).

Parameters:

lower : float or array_like, default None

upper : float or array_like, default None

axis : int or string axis name, optional

Align object with lower and upper along the given axis.

Returns:

clipped : Series

Notes

Dask doesn’t supports following argument(s).

  • axis

Examples

>>> df    
  0         1
0  0.335232 -1.256177
1 -1.367855  0.746646
2  0.027753 -1.176076
3  0.230930 -0.679613
4  1.261967  0.570967
>>> df.clip(-1.0, 0.5)    
          0         1
0  0.335232 -1.000000
1 -1.000000  0.500000
2  0.027753 -1.000000
3  0.230930 -0.679613
4  0.500000  0.500000
>>> t    
0   -0.3
1   -0.2
2   -0.1
3    0.0
4    0.1
dtype: float64
>>> df.clip(t, t + 1, axis=0)    
          0         1
0  0.335232 -0.300000
1 -0.200000  0.746646
2  0.027753 -0.100000
3  0.230930  0.000000
4  1.100000  0.570967
column_info

Return Series.name

corr(other, method='pearson', min_periods=None)

Compute correlation with other Series, excluding missing values

Parameters:

other : Series

method : {‘pearson’, ‘kendall’, ‘spearman’}

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation

min_periods : int, optional

Minimum number of observations needed to have a valid result

Returns:

correlation : float

count()

Return number of non-NA/null observations in the Series

Parameters:

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series

Returns:

nobs : int or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
cov(other, min_periods=None)

Compute covariance with Series, excluding missing values

Parameters:

other : Series

min_periods : int, optional

Minimum number of observations needed to have a valid result

Returns:

covariance : float

Normalized by N-1 (unbiased estimator).

cummax(axis=None, skipna=True)

Return cumulative cummax over requested axis.

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummax : scalar

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
cummin(axis=None, skipna=True)

Return cumulative cummin over requested axis.

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummin : scalar

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
cumprod(axis=None, skipna=True)

Return cumulative cumprod over requested axis.

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumprod : scalar

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
cumsum(axis=None, skipna=True)

Return cumulative cumsum over requested axis.

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumsum : scalar

Notes

Dask doesn’t supports following argument(s).

  • dtype
  • out
describe()

Generate various summary statistics, excluding NaN values.

Parameters:

percentiles : array-like, optional

The percentiles to include in the output. Should all be in the interval [0, 1]. By default percentiles is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.

include, exclude : list-like, ‘all’, or None (default)

Specify the form of the returned result. Either:

  • None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.
  • A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=[‘O’])
  • If include is the string ‘all’, the output column-set will match the input one.
Returns:

summary: NDFrame of summary statistics

See also

DataFrame.select_dtypes

Notes

Dask doesn’t supports following argument(s).

  • percentiles
  • include
  • exclude
div(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rtruediv

drop_duplicates(**kwargs)

Return DataFrame with duplicate rows removed, optionally only considering certain columns

Parameters:

subset : column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

take_last : deprecated

inplace : boolean, default False

Whether to drop duplicates in place or to return a copy

Returns:

deduplicated : DataFrame

dropna()

Return Series without null values

Returns:

valid : Series

inplace : boolean, default False

Do operation in place.

Notes

Dask doesn’t supports following argument(s).

  • axis
  • inplace
dtype

Return data type

fillna(value)

Fill NA/NaN values using the specified method

Parameters:

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0, ‘index’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:

filled : Series

See also

reindex, asfreq

Notes

Dask doesn’t supports following argument(s).

  • method
  • axis
  • inplace
  • limit
  • downcast
floordiv(other, level=None, fill_value=None, axis=0)

Integer division of series and other, element-wise (binary operator floordiv).

Equivalent to series // other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rfloordiv

get_division(n)

Get nth division of the data

groupby(index, **kwargs)

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Parameters:

by : mapping function / list of functions, dict, Series, or tuple /

list of column names. Called on each element of the object index to determine the groups. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups

axis : int, default 0

level : int, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : boolean, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : boolean, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

squeeze : boolean, default False

reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns:

GroupBy object

Notes

Dask doesn’t supports following argument(s).

  • by
  • axis
  • level
  • as_index
  • sort
  • group_keys
  • squeeze

Examples

DataFrame results

>>> data.groupby(func, axis=0).mean()    
>>> data.groupby(['col1', 'col2'])['col3'].mean()    

DataFrame with hierarchical index

>>> data.groupby(['col1', 'col2']).mean()    
head(n=5, compute=True)

First n rows of the dataset

Caveat, this only checks the first n rows of the first partition.

iloc

Not implemented

index

Return dask Index instance

isin(other)

Return a boolean Series showing whether each element in the Series is exactly contained in the passed sequence of values.

Parameters:

values : set or list-like

The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

New in version 0.18.1.

Support for values as a set

Returns:

isin : Series (bool dtype)

Raises:

TypeError

  • If values is a string

See also

pandas.DataFrame.isin

Notes

Dask doesn’t supports following argument(s).

  • values

Examples

>>> s = pd.Series(list('abc'))    
>>> s.isin(['a', 'c', 'e'])    
0     True
1    False
2     True
dtype: bool

Passing a single string as s.isin('a') will raise an error. Use a list of one element instead:

>>> s.isin(['a'])    
0     True
1    False
2    False
dtype: bool
isnull()

Return a boolean same-sized object indicating if the values are null.

See also

notnull
boolean inverse of isnull
iteritems()

Lazily iterate over (index, value) tuples

known_divisions

Whether divisions are already known

loc

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
map(arg, na_action=None)

Map values of Series using input correspondence (which can be a dict, Series, or function)

Parameters:

arg : function, dict, or Series

na_action : {None, ‘ignore’}

If ‘ignore’, propagate NA values

Returns:

y : Series

same index as caller

Examples

>>> x    
one   1
two   2
three 3
>>> y    
1  foo
2  bar
3  baz
>>> x.map(y)    
one   foo
two   bar
three baz
map_partitions(func, columns='__no_default__', *args, **kwargs)

Apply Python function on each DataFrame block

When using map_partitions you should provide either the column names (if the result is a DataFrame) or the name of the Series (if the result is a Series). The output type will be determined by the type of columns.

Parameters:

func : function

Function applied to each blocks

columns : tuple or scalar

Column names or name of the output. Defaults to names of data itself. When tuple is passed, DataFrame is returned. When scalar is passed, Series is returned.

Examples

When str is passed as columns, the result will be Series.

>>> df.map_partitions(lambda df: df.x + 1, columns='x')  

When tuple is passed as columns, the result will be Series.

>>> df.map_partitions(lambda df: df.head(), columns=df.columns)  
mask(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error
max(axis=None, skipna=True)
This method returns the maximum of the values in the object.
If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

max : scalar or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mean(axis=None, skipna=True)

Return the mean of the values for the requested axis

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

mean : scalar or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
min(axis=None, skipna=True)
This method returns the minimum of the values in the object.
If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.
Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

min : scalar or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
mod(other, level=None, fill_value=None, axis=0)

Modulo of series and other, element-wise (binary operator mod).

Equivalent to series % other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rmod

mul(other, level=None, fill_value=None, axis=0)

Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rmul

ndim

Return dimensionality

nlargest(n=5)

Return the largest n elements.

Parameters:

n : int

Return this many descending sorted values

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

take_last : deprecated

Returns:

top_n : Series

The n largest values in the Series, in sorted order

See also

Series.nsmallest

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Examples

>>> import pandas as pd    
>>> import numpy as np    
>>> s = pd.Series(np.random.randn(1e6))    
>>> s.nlargest(10)  # only sorts up to the N requested    
notnull()

Return a boolean same-sized object indicating if the values are not null.

See also

isnull
boolean inverse of notnull
npartitions

Return number of partitions

nunique()

Return number of unique elements in the object.

Excludes NA values by default.

Parameters:

dropna : boolean, default True

Don’t include NaN in the count.

Returns:

nunique : int

Notes

Dask doesn’t supports following argument(s).

  • dropna
pow(other, level=None, fill_value=None, axis=0)

Exponential power of series and other, element-wise (binary operator pow).

Equivalent to series ** other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rpow

quantile(q=0.5)

Approximate quantiles of Series

q : list/array of floats, default 0.5 (50%)
Iterable of numbers ranging from 0 to 1 for the desired quantiles
radd(other, level=None, fill_value=None, axis=0)

Addition of series and other, element-wise (binary operator radd).

Equivalent to other + series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.add

random_split(p, random_state=None)

Pseudorandomly split dataframe into different pieces row-wise

Parameters:

frac : float, optional

Fraction of axis items to return.

random_state: int or np.random.RandomState

If int create a new RandomState with this as the seed

Otherwise draw from the passed RandomState

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.truediv

repartition(divisions=None, npartitions=None, force=False)

Repartition dataframe along new divisions

Parameters:

divisions : list

List of partitions to be used

npartitions : int

Number of partitions of output, must be less than npartitions of input

force : bool, default False

Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
resample(rule, how=None, closed=None, label=None)

Convenience method for frequency conversion and resampling of regular time-series data.

Parameters:

rule : string

the offset string or object representing target conversion

axis : int, optional, default 0

closed : {‘right’, ‘left’}

Which side of bin interval is closed

label : {‘right’, ‘left’}

Which bin edge label to label bucket with

convention : {‘start’, ‘end’, ‘s’, ‘e’}

loffset : timedelta

Adjust the resampled time labels

base : int, default 0

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0

Notes

Dask doesn’t supports following argument(s).

  • axis
  • fill_method
  • convention
  • kind
  • loffset
  • limit
  • base

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')    
>>> series = pd.Series(range(9), index=index)    
>>> series    
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()    
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label``2000-01-01 00:03:00`` does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()    
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()    
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5] #select first 5 rows    
2000-01-01 00:00:00     0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00     1
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00     2
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]    
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]    
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):    
...     return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler)    
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64
rfloordiv(other, level=None, fill_value=None, axis=0)

Integer division of series and other, element-wise (binary operator rfloordiv).

Equivalent to other // series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.floordiv

rmod(other, level=None, fill_value=None, axis=0)

Modulo of series and other, element-wise (binary operator rmod).

Equivalent to other % series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.mod

rmul(other, level=None, fill_value=None, axis=0)

Multiplication of series and other, element-wise (binary operator rmul).

Equivalent to other * series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.mul

rpow(other, level=None, fill_value=None, axis=0)

Exponential power of series and other, element-wise (binary operator rpow).

Equivalent to other ** series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.pow

rsub(other, level=None, fill_value=None, axis=0)

Subtraction of series and other, element-wise (binary operator rsub).

Equivalent to other - series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.sub

rtruediv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.truediv

sample(frac, replace=False, random_state=None)

Random sample of items

Parameters:

frac : float, optional

Fraction of axis items to return.

replace: boolean, optional

Sample with or without replacement. Default = False.

random_state: int or ``np.random.RandomState``

If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

std(axis=None, ddof=1, skipna=True)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

std : scalar or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
sub(other, level=None, fill_value=None, axis=0)

Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rsub

sum(axis=None, skipna=True)

Return the sum of the values for the requested axis

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

sum : scalar or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
tail(n=5, compute=True)

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(index=False)

Convert to a dask Bag.

Parameters:

index : bool, optional

If True, the elements are tuples of (index, value), otherwise they’re just the value. Default is False.

to_csv(filename, get=<function get_sync>, **kwargs)

Write DataFrame to a comma-separated values (csv) file

Parameters:

path_or_buf : string or file handle, default None

File path or object, if None is provided the result is returned as a string.

sep : character, default ‘,’

Field delimiter for the output file.

na_rep : string, default ‘’

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename

line_terminator : string, default ‘n’

The newline character or character sequence to use in the output file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of quotechar inside a field

escapechar : string (length 1), default None

character used to escape sep and quotechar when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True) or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

New in version 0.16.0.

Notes

Dask doesn’t supports following argument(s).

  • path_or_buf
  • sep
  • na_rep
  • float_format
  • columns
  • header
  • index
  • index_label
  • mode
  • encoding
  • compression
  • quoting
  • quotechar
  • line_terminator
  • chunksize
  • tupleize_cols
  • date_format
  • doublequote
  • escapechar
  • decimal
to_delayed()

Convert dataframe into dask Values

Returns a list of values, one value per partition.

to_frame(name=None)

Convert Series to DataFrame

Parameters:

name : object, default None

The passed name should substitute for the series name (if it has one).

Returns:

data_frame : DataFrame

to_hdf(path_or_buf, key, mode='a', append=False, complevel=0, complib=None, fletcher32=False, get=<function get_sync>, **kwargs)

Activate the HDFStore.

Parameters:

path_or_buf : the path (string) or HDFStore object

key : string

indentifier for the group in the store

mode : optional, {‘a’, ‘w’, ‘r’, ‘r+’}, default ‘a’

'r'

Read-only; no data can be modified.

'w'

Write; a new file is created (an existing file with the same name would be deleted).

'a'

Append; an existing file is opened for reading and writing, and if the file does not exist it is created.

'r+'

It is similar to 'a', but the file must already exist.

format : ‘fixed(f)|table(t)’, default is ‘fixed’

fixed(f) : Fixed format

Fast writing/reading. Not-appendable, nor searchable

table(t) : Table format

Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data

append : boolean, default False

For Table formats, append the input data to the existing

complevel : int, 1-9, default 0

If a complib is specified compression will be applied where possible

complib : {‘zlib’, ‘bzip2’, ‘lzo’, ‘blosc’, None}, default None

If complevel is > 0 apply compression to objects written in the store wherever possible

fletcher32 : bool, default False

If applying compression use the fletcher32 checksum

dropna : boolean, default False.

If true, ALL nan rows will not be written to store.

truediv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other: Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rtruediv

unique()

Return Series of unique values in the object. Includes NA values.

Returns:uniques : Series
value_counts()

Returns object containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:

normalize : boolean, default False

If True then the object returned will contain the relative frequencies of the unique values.

sort : boolean, default True

Sort by values

ascending : boolean, default False

Sort in ascending order

bins : integer, optional

Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data

dropna : boolean, default True

Don’t include counts of NaN.

Returns:

counts : Series

Notes

Dask doesn’t supports following argument(s).

  • normalize
  • sort
  • ascending
  • bins
  • dropna
var(axis=None, ddof=1, skipna=True)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean data. If None, will attempt to use everything, then use only numeric data

Returns:

var : scalar or Series (if level specified)

Notes

Dask doesn’t supports following argument(s).

  • level
  • numeric_only
where(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Parameters:

cond : boolean NDFrame, array or callable

If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as cond.

other : scalar, NDFrame, or callable

If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1.

A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Returns:

wh : same type as caller

Notes

Dask doesn’t supports following argument(s).

  • inplace
  • axis
  • level
  • try_cast
  • raise_on_error

Other functions

dask.dataframe.compute(*args, **kwargs)

Compute several dask collections at once.

Examples

>>> import dask.array as da
>>> a = da.arange(10, chunks=2).sum()
>>> b = da.arange(10, chunks=2).mean()
>>> compute(a, b)
(45, 4.5)
dask.dataframe.map_partitions(func, metadata, *args, **kwargs)

Apply Python function on each DataFrame block

Parameters:

metadata: _Frame, columns, name

Metadata for output

targets : list

List of target DataFrame / Series.

dask.dataframe.multi.concat(dfs, axis=0, join='outer', interleave_partitions=False)

Concatenate DataFrames along rows.

  • When axis=0 (default), concatenate DataFrames row-wise:
    • If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.
    • If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)
  • When axis=1, concatenate DataFrames column-wise:
    • Allowed if all divisions are known.
    • If any of division is unknown, it raises ValueError.
Parameters:

dfs : list

List of dask.DataFrames to be concatenated

axis : {0, 1, ‘index’, ‘columns’}, default 0

The axis to concatenate along

join : {‘inner’, ‘outer’}, default ‘outer’

How to handle indexes on other axis

interleave_partitions : bool, default False

Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.

Examples

If all divisions are known and ordered, divisions are kept.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(6, 8, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>

Unable to concatenate if divisions are not ordered.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(2, 3, 6)>
>>> dd.concat([a, b])                               
ValueError: All inputs have known divisions which cannnot be concatenated
in order. Specify interleave_partitions=True to ignore order

Specify interleave_partitions=True to ignore the division order.

>>> dd.concat([a, b], interleave_partitions=True)   
dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>

If any of division is unknown, the result division will be unknown

>>> a                                               
dd.DataFrame<x, divisions=(None, None)>
>>> b                                               
dd.DataFrame<y, divisions=(1, 4, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(None, None, None, None)>
dask.dataframe.multi.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), npartitions=None)
dask.dataframe.read_csv(filename, blocksize=33554432, chunkbytes=None, collection=True, lineterminator='\n', compression=None, sample=10000, enforce=False, storage_options=None, **kwargs)

Read CSV files into a Dask.DataFrame

This parallelizes the pandas.read_csv file in the following ways:

  1. It supports loading many files at once using globstrings as follows:

    >>> df = dd.read_csv('myfiles.*.csv')  
    
  2. In some cases it can break up large files as follows:

    >>> df = dd.read_csv('largefile.csv', blocksize=25e6)  # 25MB chunks  
    

Internally dd.read_csv uses pandas.read_csv and so supports many of the same keyword arguments with the same performance guarantees.

See the docstring for pandas.read_csv for more information on available keyword arguments.

Parameters:

filename: string

Filename or globstring for CSV files. May include protocols like s3://

blocksize: int or None

Number of bytes by which to cut up larger files

collection: boolean

Return a dask.dataframe if True or list of dask.delayed objects if False

sample: int

Number of bytes to use when determining dtypes

**kwargs: dict

Options to pass down to pandas.read_csv

dask.dataframe.from_array(x, chunksize=50000, columns=None)

Read dask Dataframe from any slicable array

Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax

x[50000:100000]

and have 2 dimensions:

x.ndim == 2

or have a record dtype:

x.dtype == [(‘name’, ‘O’), (‘balance’, ‘i8’)]
dask.dataframe.from_pandas(data, npartitions=None, chunksize=None, sort=True, name=None)

Construct a dask object from a pandas object.

If given a pandas.Series a dask.Series will be returned. If given a pandas.DataFrame a dask.DataFrame will be returned. All other pandas objects will raise a TypeError.

Parameters:

df : pandas.DataFrame or pandas.Series

The DataFrame/Series with which to construct a dask DataFrame/Series

npartitions : int, optional

The number of partitions of the index to create.

chunksize : int, optional

The size of the partitions of the index.

Returns:

dask.DataFrame or dask.Series

A dask DataFrame/Series partitioned along the index

Raises:

TypeError

If something other than a pandas.DataFrame or pandas.Series is passed in.

See also

from_array
Construct a dask.DataFrame from an array that has record dtype
from_bcolz
Construct a dask.DataFrame from a bcolz ctable
read_csv
Construct a dask.DataFrame from a CSV file

Examples

>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
...                   index=pd.date_range(start='20100101', periods=6))
>>> ddf = from_pandas(df, npartitions=3)
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', offset='D'),
 Timestamp('2010-01-03 00:00:00', offset='D'),
 Timestamp('2010-01-05 00:00:00', offset='D'),
 Timestamp('2010-01-06 00:00:00', offset='D'))
>>> ddf = from_pandas(df.a, npartitions=3)  # Works with Series too!
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', offset='D'),
 Timestamp('2010-01-03 00:00:00', offset='D'),
 Timestamp('2010-01-05 00:00:00', offset='D'),
 Timestamp('2010-01-06 00:00:00', offset='D'))
dask.dataframe.from_bcolz(x, chunksize=None, categorize=True, index=None, lock=<thread.lock object>, **kwargs)

Read dask Dataframe from bcolz.ctable

Parameters:

x : bcolz.ctable

Input data

chunksize : int, optional

The size of blocks to pull out from ctable. Ideally as large as can comfortably fit in memory

categorize : bool, defaults to True

Automatically categorize all string dtypes

index : string, optional

Column to make the index

lock: bool or Lock

Lock to use when reading or False for no lock (not-thread-safe)

See also

from_array
more generic function not optimized for bcolz
dask.dataframe.rolling.rolling_apply(arg, window, *args, **kwargs)

Generic moving function application.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

func : function

Must produce a single value from an ndarray input

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

args : tuple

Passed on to func

kwargs : dict

Passed on to func

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_chunk(func, part1, part2, window, *args)
dask.dataframe.rolling.rolling_count(arg, window, *args, **kwargs)

Rolling count of number of non-NaN observations inside provided window.

Parameters:

arg : DataFrame or numpy ndarray-like

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

how : string, default ‘mean’

Method for down- or re-sampling

Returns:

rolling_count : type of caller

Notes

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_kurt(arg, window, *args, **kwargs)

Unbiased moving kurtosis.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_max(arg, window, *args, **kwargs)

Moving maximum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’max’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_mean(arg, window, *args, **kwargs)

Moving mean.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_median(arg, window, *args, **kwargs)

Moving median.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’median’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_min(arg, window, *args, **kwargs)

Moving minimum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’min’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_quantile(arg, window, *args, **kwargs)

Moving quantile.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

quantile : float

0 <= quantile <= 1

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_skew(arg, window, *args, **kwargs)

Unbiased moving skewness.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_std(arg, window, *args, **kwargs)

Moving standard deviation.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

ddof : int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_sum(arg, window, *args, **kwargs)

Moving sum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_var(arg, window, *args, **kwargs)

Moving variance.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

ddof : int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_window(arg, window, *args, **kwargs)

Applies a moving window of type window_type and size window on the data.

Parameters:

arg : Series, DataFrame

window : int or ndarray

Weighting window specification. If the window is an integer, then it is treated as the window length and win_type is required

win_type : str, default None

Window type (see Notes)

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

mean : boolean, default True

If True computes weighted mean, else weighted sum

axis : {0, 1}, default 0

how : string, default ‘mean’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

The recognized window types are:

  • boxcar
  • triang
  • blackman
  • hamming
  • bartlett
  • parzen
  • bohman
  • blackmanharris
  • nuttall
  • barthann
  • kaiser (needs beta)
  • gaussian (needs std)
  • general_gaussian (needs power, width)
  • slepian (needs width).

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).