Time Series Dataset#

Default implementation of gordo_core.base.GordoBaseDataset. gordo_core.time_series.TimeSeriesDataset supports multiple customizable preprocessing and filtering algorithms, check here for details.

exception gordo_core.time_series.NotEnoughDataWarning[source]#: Bases: RuntimeWarning

class gordo_core.time_series.RandomDataset(train_start_date: datetime | str, train_end_date: datetime | str, tag_list: list, **kwargs)[source]#

Bases: TimeSeriesDataset

Get a TimeSeriesDataset backed by gordo_core.data_provider.providers.RandomDataProvider

Creates a TimeSeriesDataset backed by a provided dataprovider.

A TimeSeriesDataset is a dataset backed by timeseries, but resampled, aligned, and (optionally) filtered.

Parameters:

train_start_date – Earliest possible point in the dataset (inclusive)
train_end_date – Earliest possible point in the dataset (exclusive)
tag_list – List of tags to include in the dataset. The elements can be strings, dictionaries or gordo_core.SensorTag, tuple().
target_tag_list – List of tags to set as the dataset y. These will be treated the same as tag_list when fetching and pre-processing (resampling) but will be split into the y return from get_data
data_provider – A dataprovider which can provide dataframes for tags from train_start_date to train_end_date
resolution –
The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries

Note

If this parameter is None or False, then _no_ aggregation/resampling is applied to the data.
row_filter – Filter on the rows. Only rows satisfying the filter will be in the dataset. See gordo_core.filter_rows.pandas_filter_rows() for further documentation of the filter format.
known_filter_periods – List of periods to drop in the format [~('2020-04-08 04:00:00+00:00' < index < '2020-04-08 10:00:00+00:00')]. Note the time-zone suffix (+00:00), which is required.
aggregation_methods – Aggregation method(s) to use for the resampled buckets. If a single resample method is provided then the resulting dataframe will have names identical to the names of the series it got in. If several aggregation-methods are provided then the resulting dataframe will have a multi-level column index, with the series-name as the first level, and the aggregation method as the second level. See pandas.Series.resample() for more information on possible aggregation methods.
row_filter_buffer_size – Whatever elements are selected for removal based on the row_filter, will also have this amount of elements removed fore and aft. Default is zero 0
asset – Asset for which the tags are associated with.
n_samples_threshold – The threshold at which the generated DataFrame is considered to have too few rows of data.
interpolation_method – How should missing values be interpolated. Either forward fill (ffill) or by linear interpolation (default, linear_interpolation).
interpolation_limit – Parameter sets how long from last valid data point values will be interpolated/forward filled. If None, all missing values are interpolated/forward filled. Also, it’s used as max time limit of point for look-back to find latest point before window’s start (if needed).
filter_periods – Performs a series of algorithms that drops noisy data is specified. See filter_periods class for details.
kwargs – Deprecated arguments.
deprecated:: (..) – asset will be removed in the future.

Bases: DatasetWithProvider

Creates a TimeSeriesDataset backed by a provided dataprovider.

A TimeSeriesDataset is a dataset backed by timeseries, but resampled, aligned, and (optionally) filtered.

Parameters:

train_start_date – Earliest possible point in the dataset (inclusive)
train_end_date – Earliest possible point in the dataset (exclusive)
tag_list – List of tags to include in the dataset. The elements can be strings, dictionaries or gordo_core.SensorTag, tuple().
target_tag_list – List of tags to set as the dataset y. These will be treated the same as tag_list when fetching and pre-processing (resampling) but will be split into the y return from get_data
data_provider – A dataprovider which can provide dataframes for tags from train_start_date to train_end_date
resolution –
The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries

Note

If this parameter is None or False, then _no_ aggregation/resampling is applied to the data.
row_filter – Filter on the rows. Only rows satisfying the filter will be in the dataset. See gordo_core.filter_rows.pandas_filter_rows() for further documentation of the filter format.
known_filter_periods – List of periods to drop in the format [~('2020-04-08 04:00:00+00:00' < index < '2020-04-08 10:00:00+00:00')]. Note the time-zone suffix (+00:00), which is required.
aggregation_methods – Aggregation method(s) to use for the resampled buckets. If a single resample method is provided then the resulting dataframe will have names identical to the names of the series it got in. If several aggregation-methods are provided then the resulting dataframe will have a multi-level column index, with the series-name as the first level, and the aggregation method as the second level. See pandas.Series.resample() for more information on possible aggregation methods.
row_filter_buffer_size – Whatever elements are selected for removal based on the row_filter, will also have this amount of elements removed fore and aft. Default is zero 0
asset – Asset for which the tags are associated with.
n_samples_threshold – The threshold at which the generated DataFrame is considered to have too few rows of data.
interpolation_method – How should missing values be interpolated. Either forward fill (ffill) or by linear interpolation (default, linear_interpolation).
interpolation_limit – Parameter sets how long from last valid data point values will be interpolated/forward filled. If None, all missing values are interpolated/forward filled. Also, it’s used as max time limit of point for look-back to find latest point before window’s start (if needed).
filter_periods – Performs a series of algorithms that drops noisy data is specified. See filter_periods class for details.
kwargs – Deprecated arguments.
deprecated:: (..) – asset will be removed in the future.

data_provider#: Descriptor for attributes requiring type gordo_core.data_providers.base.GordoBaseDataProvider

fill_series_nans(series: Series, tag: str | SensorTag, resampling_startpoint: datetime, resampling_endpoint: datetime, resolution: str, interpolation_limit: str) → Series[source]#

Try to fill Nans from look-back interpolated point.

Only uses point from past to Nans filling if it was found not far then interpolation limit.

Return type:: Same not changed Series or Series with attempt to fill Nans.

get_client_data(build_dataset_metadata: dict) → Tuple[DataFrame, DataFrame | None][source]#

The version of get_data() used by gordo-client

Parameters:: build_dataset_metadata – build_metadata.dataset part of the metadata

get_data() → Tuple[DataFrame, DataFrame | None][source]#: Return X, y data as numpy or pandas’ dataframes given current state

get_data_provider() → GordoBaseDataProvider[source]#

get_metadata()[source]#: Get metadata about the current state of the dataset

kwargs#: Descriptor for attributes requiring type gordo_core.base.GordoBaseDataset

tag_list#: Descriptor for attributes requiring a non-empty list of strings

target_tag_list#: Descriptor for attributes requiring a non-empty list of strings

to_dict()[source]#: Serialize this object into a dict representation, which can be used to initialize a new object using from_dict()

train_end_date#: Descriptor for attributes requiring valid datetime.datetime attribute

train_start_date#: Descriptor for attributes requiring valid datetime.datetime attribute

classmethod with_data_provider(data_provider: dict[str, Any] | GordoBaseDataProvider | None, args: dict[str, Any], *, back_compatibles: dict[tuple[Optional[str], str], tuple[Optional[str], str]] | None = None)[source]#

gordo_core.time_series.compat(init)[source]#

__init__ decorator for compatibility where the Gordo config file’s dataset keys have drifted from what kwargs are actually expected in the given dataset. For example, using train_start_date is common in the configs, but TimeSeriesDataset takes this parameter as train_start_date, as well as RandomDataset

Renames old/other acceptable kwargs to the ones that the dataset type expects

Utils function for working with sensor tags metadata in gordo_core.time_series.TimeSeriesDataset.

gordo_core.metadata.sensor_tags_from_build_metadata(build_dataset_metadata: dict, tag_names: Set[str]) → dict[str, gordo_core.sensor_tag.SensorTag][source]#

Fetch sensor tags information from the metadata. This info should be placed in build_dataset_metadata["dataset_meta"]["tag_loading_metadata"]["tags"]

Parameters:

build_dataset_metadata – build_metadata.dataset part of the metadata
tag_names – Contains tag names for which we should fetch information

Return type:

Key here is tag name passed though tag_names argument

gordo_core.metadata.tags_to_json_representation(tags: Iterable[str | SensorTag]) → dict[source]#