Dataset configuration and data preprocessing#
Data preprocessing steps#
Data is fetched for the complete period between the train start and end time.
Data is then aggregated by
aggregation_methodsat givenresolution. This step includes interpolating values that might be missing. For this, the specifiedinterpolation_methodis used with itsinterpolation_limit. The limit specifies how long from last valid data point values will be interpolated.Known periods specified in
known_filter_periodsare dropped from dataset.Row filter is applied. All numerical filtering criteria in the provided list are evaluated and joined by logical
&. This step also involves therow_filter_buffer_size, which removes observations of the given size before and after the ones that have already been filtered out.Data is filtered by global minima/maxima conditions (
low_threshold/high_threshold).- Filter periods is applied for filtering out noisy data points.
See below for some more details.
If applied, the information is written to the metadata as
filtered_periods.
Between each step, the samples of the resulting dataset is compared to n_samples_threshold to ensure at least minimum size is returned.
Filter periods algorithms#
This executes a set of algorithms that will filter out observations. There are currently two algorithms implemented: Rolling median and isolation forest. See this API section for details.
window is used by the rolling median
Parameter |
Method |
Description |
|---|---|---|
|
|
Window for rolling median |
|
|
Number of interquartile (middle 50%) to add/subtract from rolling median |
|
|
If exponential weighted smoothing should be applied to data before isolation forest |
|
|
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. |