Module tsflex.chunking
Utilities for chunking time-series data before feeding it to the operators.
Expand source code
"""Utilities for chunking time-series data before feeding it to the operators.
"""
__author__ = "Jonas Van Der Donckt"
from .chunking import chunk_data
__all__ = ["chunk_data"]
API reference of tsflex.chunking
.chunking
-
(Advanced) tsflex utilities for chunking sequence data.
Functions
def chunk_data(data, fs_dict=None, chunk_range_margin=None, min_chunk_dur=None, max_chunk_dur=None, sub_chunk_overlap=None, copy=True, verbose=False)
-
Expand source code
def chunk_data( data: Union[ pd.Series, pd.DataFrame, List[Union[pd.Series, pd.DataFrame]], Dict[str, pd.DataFrame], ], fs_dict: Optional[Dict[str, float]] = None, chunk_range_margin: Optional[Union[float, str, pd.Timedelta]] = None, min_chunk_dur: Optional[Union[float, str, pd.Timedelta]] = None, max_chunk_dur: Optional[Union[float, str, pd.Timedelta]] = None, sub_chunk_overlap: Optional[Union[float, str, pd.Timedelta]] = None, copy: bool = True, verbose: bool = False, ) -> List[List[pd.Series]]: """Divide the time-series `data` in same time/sequence-range chunks. Does 2 things: 1. Detecting gaps in the `data`(-list) sequence series. 2. Divides the `data` into chunks, according to the parameter configuration and the detected gaps. Notes ----- * When you set `fs_dict`, the assumption is made that **each item** in `data` has a **fixed sample frequency**. If you do not set `fs_dict`, this variable will use the 1 / max time-diff of the corresponding series as key-value pair. * All subsequent series-chunks are matched against the time-ranges of the first series. This implies that **the first item in `data` serves as a reference** for gap-matching. * The term `sub-chunk` refers to the chunks who exceed the `max_chunk_duration` parameter and are therefore further divided into sub-chunks. Example ------- ```python df_acc # cols ['ACC_x', 'ACC_y`, 'ACC_z`, 'ACC_SMV`] - 32 Hz df_gyro # cols ['gyro_x', 'gyro_y`, 'gyro_z`, 'gyro_area`] - 100 Hz chunk_data({'acc': df_acc, 'g': df_gyro}, fs_dict={'acc': 32, 'g': 100}) ``` <br> .. Note:: If `chunk_range_margin` / `min_chunk_dur` / `max_chunk_dur` / `sub_chunk_overlap` is a int/float, it will be interpreted as numerical sequence range and a numerical-indexed `data` will be assumed. **These attributes must be all either time-based or numerical and match the data its index dtype** Parameters ----------- data: Union[pd.Series, pd.DataFrame, List[Union[pd.Series, pd.DataFrame]], Dict[str, pd.DataFrame]] The sequence data which will be chunked. Each item in `data` must have a monotonically increasing index. We assume that each `item` in data has a _nearly-constant_ sample frequency (when there are no gaps) and all indices have the same dtype. fs_dict: Dict[str, int], optional The sample frequency dict. If set, this dict must at least withhold all the keys from the items in `data`. .. note:: if you passed a **_DataFrame-dict_** (i.e., a dict with key=str; value=DataFrame) to `data`, then you can **use** the **corresponding dataframe str-key** to describe the `fs` for all the DataFrame its columns with the `fs_dict` attribute. See also the example above chunk_range_margin: Union[float, str, pd.Timedelta], optional The allowed margin for each `ts` chunk their start and end time to be seen as same time-range chunks with other `ts`. If `None` the margin will be set as: 2 / min(fs_dict.intersection(data.names).values()) Which is equivalent to twice the min-fs (= max-period) of the passed `data`, by default None.\n * if `pd.Timedelta`, it will be interpreted as a time-range margin * if `int` or `float`, it will be interpreted as a numerical range margin min_chunk_dur : Union[float, str, pd.Timedelta], optional The minimum duration of a chunk, by default None. Chunks with durations smaller than this will be discarded (and not returned).\n * if `pd.Timedelta`, it will be interpreted as a time-range margin * if `int` or `float`, it will be interpreted as a numerical range margin max_chunk_dur : Union[float, str, pd.Timedelta], optional The maximum duration of a chunk, by default None. Chunks with durations larger than this will be chunked in smaller `sub_chunks` where each sub-chunk has a maximum duration of `max_chunk_dur`.\n * if `pd.Timedelta`, it will be interpreted as a time-range margin * if `int` or `float`, it will be interpreted as a numerical range margin sub_chunk_overlap: Union[float, str, pd.Timedelta], optional The sub-chunk boundary overlap. If available, **this margin / 2 will be added to either side of the `sub_chunk`**. This is especially useful to not lose inter-`sub_chunk` data (as each `sub_chunk` is in fact a continuous chunk) when window-based aggregations are performed on these same time range output (sub_)chunks. This argument is only relevant if `max_chunk_dur` is set.\n * if `pd.Timedelta`, it will be interpreted as a time-range margin * if `int` or `float`, it will be interpreted as a numerical range margin copy: boolean, optional If set True will return a new view (on which you won't get a `SettingWithCopyWarning` if you change the content), by default False. verbose : bool, optional If set, will print more verbose output, by default False Returns ------- List[List[pd.Series]] A list of same time range chunks. """ if isinstance(data, dict): if isinstance(fs_dict, dict): out_dict = {} for k, fs in fs_dict.items(): if k in data and isinstance(data[k], pd.DataFrame): out_dict.update({c_name: fs for c_name in data[k].columns}) fs_dict.update(out_dict) # make `data` `to_series_list` convertable() data = list(data.values()) # Convert the input data series_list = to_series_list(data) # Assert that there are no duplicate series names assert len(series_list) == len(set([s.name for s in series_list])) # Assert that the index increases monotonically assert all(s.index.is_monotonic_increasing for s in series_list) return _dtype_to_chunk_method[AttributeParser.determine_type(data)]( series_list, fs_dict, chunk_range_margin, # type: ignore[arg-type] min_chunk_dur, # type: ignore[arg-type] max_chunk_dur, # type: ignore[arg-type] sub_chunk_overlap, # type: ignore[arg-type] copy, verbose, )
Divide the time-series
data
in same time/sequence-range chunks.Does 2 things:
- Detecting gaps in the
data
(-list) sequence series. - Divides the
data
into chunks, according to the parameter configuration and the detected gaps.
Notes
- When you set
fs_dict
, the assumption is made that each item indata
has a fixed sample frequency. If you do not setfs_dict
, this variable will use the 1 / max time-diff of the corresponding series as key-value pair. - All subsequent series-chunks are matched against the time-ranges of the first
series. This implies that the first item in
data
serves as a reference for gap-matching. - The term
sub-chunk
refers to the chunks who exceed themax_chunk_duration
parameter and are therefore further divided into sub-chunks.
Example
df_acc # cols ['ACC_x', 'ACC_y`, 'ACC_z`, 'ACC_SMV`] - 32 Hz df_gyro # cols ['gyro_x', 'gyro_y`, 'gyro_z`, 'gyro_area`] - 100 Hz chunk_data({'acc': df_acc, 'g': df_gyro}, fs_dict={'acc': 32, 'g': 100})
Note
If
chunk_range_margin
/min_chunk_dur
/max_chunk_dur
/sub_chunk_overlap
is a int/float, it will be interpreted as numerical sequence range and a numerical-indexeddata
will be assumed. These attributes must be all either time-based or numerical and match the data its index dtypeParameters
data
:Union[pd.Series, pd.DataFrame, List[Union[pd.Series, pd.DataFrame]], Dict[str, pd.DataFrame]]
- The sequence data which will be chunked. Each item in
data
must have a monotonically increasing index. We assume that eachitem
in data has a nearly-constant sample frequency (when there are no gaps) and all indices have the same dtype. fs_dict
:Dict[str, int]
, optional- The sample frequency dict. If set, this dict must at least withhold all the keys
from the items in
data
.Note
if you passed a DataFrame-dict (i.e., a dict with key=str; value=DataFrame) todata
, then you can use the corresponding dataframe str-key to describe thefs
for all the DataFrame its columns with thefs_dict
attribute. See also the example above chunk_range_margin
:Union[float, str, pd.Timedelta]
, optional-
The allowed margin for each
ts
chunk their start and end time to be seen as same time-range chunks with otherts
. IfNone
the margin will be set as:2 / min(fs_dict.intersection(data.names).values())
Which is equivalent to twice the min-fs (= max-period) of the passed
data
, by default None.- if
pd.Timedelta
, it will be interpreted as a time-range margin - if
int
orfloat
, it will be interpreted as a numerical range margin
- if
min_chunk_dur
:Union[float, str, pd.Timedelta]
, optional-
The minimum duration of a chunk, by default None. Chunks with durations smaller than this will be discarded (and not returned).
- if
pd.Timedelta
, it will be interpreted as a time-range margin - if
int
orfloat
, it will be interpreted as a numerical range margin
- if
max_chunk_dur
:Union[float, str, pd.Timedelta]
, optional-
The maximum duration of a chunk, by default None. Chunks with durations larger than this will be chunked in smaller
sub_chunks
where each sub-chunk has a maximum duration ofmax_chunk_dur
.- if
pd.Timedelta
, it will be interpreted as a time-range margin - if
int
orfloat
, it will be interpreted as a numerical range margin
- if
sub_chunk_overlap
:Union[float, str, pd.Timedelta]
, optional-
The sub-chunk boundary overlap. If available, this margin / 2 will be added to either side of the
sub_chunk
. This is especially useful to not lose inter-sub_chunk
data (as eachsub_chunk
is in fact a continuous chunk) when window-based aggregations are performed on these same time range output (sub_)chunks. This argument is only relevant ifmax_chunk_dur
is set.- if
pd.Timedelta
, it will be interpreted as a time-range margin - if
int
orfloat
, it will be interpreted as a numerical range margin
- if
copy
:boolean
, optional- If set True will return a new view (on which you won't get a
SettingWithCopyWarning
if you change the content), by default False. verbose
:bool
, optional- If set, will print more verbose output, by default False
Returns
List[List[pd.Series]]
- A list of same time range chunks.
- Detecting gaps in the