Time Series Dataset#

Default implementation of gordo_core.base.GordoBaseDataset. gordo_core.time_series.TimeSeriesDataset supports multiple customizable preprocessing and filtering algorithms, check here for details.

exception gordo_core.time_series.NotEnoughDataWarning[source]#

Bases: RuntimeWarning

class gordo_core.time_series.RandomDataset(train_start_date: datetime | str, train_end_date: datetime | str, tag_list: list, **kwargs)[source]#

Bases: TimeSeriesDataset

Get a TimeSeriesDataset backed by gordo_core.data_provider.providers.RandomDataProvider

Creates a TimeSeriesDataset backed by a provided dataprovider.

A TimeSeriesDataset is a dataset backed by timeseries, but resampled, aligned, and (optionally) filtered.

Parameters:
  • train_start_date – Earliest possible point in the dataset (inclusive)

  • train_end_date – Earliest possible point in the dataset (exclusive)

  • tag_list – List of tags to include in the dataset. The elements can be strings, dictionaries or gordo_core.SensorTag, tuple().

  • target_tag_list – List of tags to set as the dataset y. These will be treated the same as tag_list when fetching and pre-processing (resampling) but will be split into the y return from get_data

  • data_provider – A dataprovider which can provide dataframes for tags from train_start_date to train_end_date

  • resolution

    The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries

    Note

    If this parameter is None or False, then _no_ aggregation/resampling is applied to the data.

  • row_filter – Filter on the rows. Only rows satisfying the filter will be in the dataset. See gordo_core.filter_rows.pandas_filter_rows() for further documentation of the filter format.

  • known_filter_periods – List of periods to drop in the format [~('2020-04-08 04:00:00+00:00' < index < '2020-04-08 10:00:00+00:00')]. Note the time-zone suffix (+00:00), which is required.

  • aggregation_methods – Aggregation method(s) to use for the resampled buckets. If a single resample method is provided then the resulting dataframe will have names identical to the names of the series it got in. If several aggregation-methods are provided then the resulting dataframe will have a multi-level column index, with the series-name as the first level, and the aggregation method as the second level. See pandas.Series.resample() for more information on possible aggregation methods.

  • row_filter_buffer_size – Whatever elements are selected for removal based on the row_filter, will also have this amount of elements removed fore and aft. Default is zero 0

  • asset – Asset for which the tags are associated with.

  • n_samples_threshold – The threshold at which the generated DataFrame is considered to have too few rows of data.

  • interpolation_method – How should missing values be interpolated. Either forward fill (ffill) or by linear interpolation (default, linear_interpolation).

  • interpolation_limit – Parameter sets how long from last valid data point values will be interpolated/forward filled. If None, all missing values are interpolated/forward filled. Also, it’s used as max time limit of point for look-back to find latest point before window’s start (if needed).

  • filter_periods – Performs a series of algorithms that drops noisy data is specified. See filter_periods class for details.

  • kwargs – Deprecated arguments.

  • deprecated:: (..) – asset will be removed in the future.

class gordo_core.time_series.TimeSeriesDataset(train_start_date: datetime | str, train_end_date: datetime | str, tag_list: Sequence[dict[str, Optional[str]] | str | SensorTag], target_tag_list: Sequence[dict[str, Optional[str]] | str | SensorTag] | None = None, additional_tags: Sequence[dict[str, Optional[str]] | str | SensorTag] | None = None, default_tag: dict[str, Optional[str]] | None = None, data_provider: GordoBaseDataProvider | None = None, resolution: str | None = '10T', row_filter: str | list = '', known_filter_periods: list | None = None, aggregation_methods: str | list[str] | Callable = 'mean', row_filter_buffer_size: int = 0, asset: str | None = None, n_samples_threshold: int = 0, interpolation_method: str = 'linear_interpolation', interpolation_limit: str = '48H', filter_periods: dict | FilterPeriods | None = None, **kwargs)[source]#

Bases: DatasetWithProvider

Creates a TimeSeriesDataset backed by a provided dataprovider.

A TimeSeriesDataset is a dataset backed by timeseries, but resampled, aligned, and (optionally) filtered.

Parameters:
  • train_start_date – Earliest possible point in the dataset (inclusive)

  • train_end_date – Earliest possible point in the dataset (exclusive)

  • tag_list – List of tags to include in the dataset. The elements can be strings, dictionaries or gordo_core.SensorTag, tuple().

  • target_tag_list – List of tags to set as the dataset y. These will be treated the same as tag_list when fetching and pre-processing (resampling) but will be split into the y return from get_data

  • data_provider – A dataprovider which can provide dataframes for tags from train_start_date to train_end_date

  • resolution

    The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries

    Note

    If this parameter is None or False, then _no_ aggregation/resampling is applied to the data.

  • row_filter – Filter on the rows. Only rows satisfying the filter will be in the dataset. See gordo_core.filter_rows.pandas_filter_rows() for further documentation of the filter format.

  • known_filter_periods – List of periods to drop in the format [~('2020-04-08 04:00:00+00:00' < index < '2020-04-08 10:00:00+00:00')]. Note the time-zone suffix (+00:00), which is required.

  • aggregation_methods – Aggregation method(s) to use for the resampled buckets. If a single resample method is provided then the resulting dataframe will have names identical to the names of the series it got in. If several aggregation-methods are provided then the resulting dataframe will have a multi-level column index, with the series-name as the first level, and the aggregation method as the second level. See pandas.Series.resample() for more information on possible aggregation methods.

  • row_filter_buffer_size – Whatever elements are selected for removal based on the row_filter, will also have this amount of elements removed fore and aft. Default is zero 0

  • asset – Asset for which the tags are associated with.

  • n_samples_threshold – The threshold at which the generated DataFrame is considered to have too few rows of data.

  • interpolation_method – How should missing values be interpolated. Either forward fill (ffill) or by linear interpolation (default, linear_interpolation).

  • interpolation_limit – Parameter sets how long from last valid data point values will be interpolated/forward filled. If None, all missing values are interpolated/forward filled. Also, it’s used as max time limit of point for look-back to find latest point before window’s start (if needed).

  • filter_periods – Performs a series of algorithms that drops noisy data is specified. See filter_periods class for details.

  • kwargs – Deprecated arguments.

  • deprecated:: (..) – asset will be removed in the future.

data_provider#

Descriptor for attributes requiring type gordo_core.data_providers.base.GordoBaseDataProvider

fill_series_nans(series: Series, tag: str | SensorTag, resampling_startpoint: datetime, resampling_endpoint: datetime, resolution: str, interpolation_limit: str) Series[source]#

Try to fill Nans from look-back interpolated point.

Only uses point from past to Nans filling if it was found not far then interpolation limit.

Return type:

Same not changed Series or Series with attempt to fill Nans.

get_client_data(build_dataset_metadata: dict) Tuple[DataFrame, DataFrame | None][source]#

The version of get_data() used by gordo-client

Parameters:

build_dataset_metadatabuild_metadata.dataset part of the metadata

get_data() Tuple[DataFrame, DataFrame | None][source]#

Return X, y data as numpy or pandas’ dataframes given current state

get_data_provider() GordoBaseDataProvider[source]#
get_metadata()[source]#

Get metadata about the current state of the dataset

kwargs#

Descriptor for attributes requiring type gordo_core.base.GordoBaseDataset

tag_list#

Descriptor for attributes requiring a non-empty list of strings

target_tag_list#

Descriptor for attributes requiring a non-empty list of strings

to_dict()[source]#

Serialize this object into a dict representation, which can be used to initialize a new object using from_dict()

train_end_date#

Descriptor for attributes requiring valid datetime.datetime attribute

train_start_date#

Descriptor for attributes requiring valid datetime.datetime attribute

classmethod with_data_provider(data_provider: dict[str, Any] | GordoBaseDataProvider | None, args: dict[str, Any], *, back_compatibles: dict[tuple[Optional[str], str], tuple[Optional[str], str]] | None = None)[source]#
gordo_core.time_series.compat(init)[source]#

__init__ decorator for compatibility where the Gordo config file’s dataset keys have drifted from what kwargs are actually expected in the given dataset. For example, using train_start_date is common in the configs, but TimeSeriesDataset takes this parameter as train_start_date, as well as RandomDataset

Renames old/other acceptable kwargs to the ones that the dataset type expects

Utils function for working with sensor tags metadata in gordo_core.time_series.TimeSeriesDataset.

gordo_core.metadata.sensor_tags_from_build_metadata(build_dataset_metadata: dict, tag_names: Set[str]) dict[str, gordo_core.sensor_tag.SensorTag][source]#

Fetch sensor tags information from the metadata. This info should be placed in build_dataset_metadata["dataset_meta"]["tag_loading_metadata"]["tags"]

Parameters:
  • build_dataset_metadatabuild_metadata.dataset part of the metadata

  • tag_names – Contains tag names for which we should fetch information

Return type:

Key here is tag name passed though tag_names argument

gordo_core.metadata.tags_to_json_representation(tags: Iterable[str | SensorTag]) dict[source]#