Time Series Dataset#
Default implementation of gordo_core.base.GordoBaseDataset.
gordo_core.time_series.TimeSeriesDataset supports multiple customizable preprocessing and filtering algorithms,
check here for details.
- exception gordo_core.time_series.NotEnoughDataWarning[source]#
Bases:
RuntimeWarning
- class gordo_core.time_series.RandomDataset(train_start_date: datetime | str, train_end_date: datetime | str, tag_list: list, **kwargs)[source]#
Bases:
TimeSeriesDatasetGet a TimeSeriesDataset backed by
gordo_core.data_provider.providers.RandomDataProviderCreates a
TimeSeriesDatasetbacked by a provided dataprovider.A
TimeSeriesDatasetis a dataset backed by timeseries, but resampled, aligned, and (optionally) filtered.- Parameters:
train_start_date – Earliest possible point in the dataset (inclusive)
train_end_date – Earliest possible point in the dataset (exclusive)
tag_list – List of tags to include in the dataset. The elements can be strings, dictionaries or
gordo_core.SensorTag,tuple().target_tag_list – List of tags to set as the dataset y. These will be treated the same as tag_list when fetching and pre-processing (resampling) but will be split into the y return from
get_datadata_provider – A dataprovider which can provide dataframes for tags from train_start_date to train_end_date
resolution –
The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries
Note
If this parameter is
NoneorFalse, then _no_ aggregation/resampling is applied to the data.row_filter – Filter on the rows. Only rows satisfying the filter will be in the dataset. See
gordo_core.filter_rows.pandas_filter_rows()for further documentation of the filter format.known_filter_periods – List of periods to drop in the format
[~('2020-04-08 04:00:00+00:00' < index < '2020-04-08 10:00:00+00:00')].Note the time-zone suffix (+00:00), which is required.aggregation_methods – Aggregation method(s) to use for the resampled buckets. If a single resample method is provided then the resulting dataframe will have names identical to the names of the series it got in. If several aggregation-methods are provided then the resulting dataframe will have a multi-level column index, with the series-name as the first level, and the aggregation method as the second level. See
pandas.Series.resample()for more information on possible aggregation methods.row_filter_buffer_size – Whatever elements are selected for removal based on the
row_filter, will also have this amount of elements removed fore and aft. Default is zero 0asset – Asset for which the tags are associated with.
n_samples_threshold – The threshold at which the generated DataFrame is considered to have too few rows of data.
interpolation_method – How should missing values be interpolated. Either forward fill (
ffill) or by linear interpolation (default,linear_interpolation).interpolation_limit – Parameter sets how long from last valid data point values will be interpolated/forward filled. If None, all missing values are interpolated/forward filled. Also, it’s used as max time limit of point for look-back to find latest point before window’s start (if needed).
filter_periods – Performs a series of algorithms that drops noisy data is specified. See
filter_periodsclass for details.kwargs – Deprecated arguments.
deprecated:: (..) – asset will be removed in the future.
- class gordo_core.time_series.TimeSeriesDataset(train_start_date: datetime | str, train_end_date: datetime | str, tag_list: Sequence[dict[str, Optional[str]] | str | SensorTag], target_tag_list: Sequence[dict[str, Optional[str]] | str | SensorTag] | None = None, additional_tags: Sequence[dict[str, Optional[str]] | str | SensorTag] | None = None, default_tag: dict[str, Optional[str]] | None = None, data_provider: GordoBaseDataProvider | None = None, resolution: str | None = '10T', row_filter: str | list = '', known_filter_periods: list | None = None, aggregation_methods: str | list[str] | Callable = 'mean', row_filter_buffer_size: int = 0, asset: str | None = None, n_samples_threshold: int = 0, interpolation_method: str = 'linear_interpolation', interpolation_limit: str = '48H', filter_periods: dict | FilterPeriods | None = None, **kwargs)[source]#
Bases:
DatasetWithProviderCreates a
TimeSeriesDatasetbacked by a provided dataprovider.A
TimeSeriesDatasetis a dataset backed by timeseries, but resampled, aligned, and (optionally) filtered.- Parameters:
train_start_date – Earliest possible point in the dataset (inclusive)
train_end_date – Earliest possible point in the dataset (exclusive)
tag_list – List of tags to include in the dataset. The elements can be strings, dictionaries or
gordo_core.SensorTag,tuple().target_tag_list – List of tags to set as the dataset y. These will be treated the same as tag_list when fetching and pre-processing (resampling) but will be split into the y return from
get_datadata_provider – A dataprovider which can provide dataframes for tags from train_start_date to train_end_date
resolution –
The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries
Note
If this parameter is
NoneorFalse, then _no_ aggregation/resampling is applied to the data.row_filter – Filter on the rows. Only rows satisfying the filter will be in the dataset. See
gordo_core.filter_rows.pandas_filter_rows()for further documentation of the filter format.known_filter_periods – List of periods to drop in the format
[~('2020-04-08 04:00:00+00:00' < index < '2020-04-08 10:00:00+00:00')].Note the time-zone suffix (+00:00), which is required.aggregation_methods – Aggregation method(s) to use for the resampled buckets. If a single resample method is provided then the resulting dataframe will have names identical to the names of the series it got in. If several aggregation-methods are provided then the resulting dataframe will have a multi-level column index, with the series-name as the first level, and the aggregation method as the second level. See
pandas.Series.resample()for more information on possible aggregation methods.row_filter_buffer_size – Whatever elements are selected for removal based on the
row_filter, will also have this amount of elements removed fore and aft. Default is zero 0asset – Asset for which the tags are associated with.
n_samples_threshold – The threshold at which the generated DataFrame is considered to have too few rows of data.
interpolation_method – How should missing values be interpolated. Either forward fill (
ffill) or by linear interpolation (default,linear_interpolation).interpolation_limit – Parameter sets how long from last valid data point values will be interpolated/forward filled. If None, all missing values are interpolated/forward filled. Also, it’s used as max time limit of point for look-back to find latest point before window’s start (if needed).
filter_periods – Performs a series of algorithms that drops noisy data is specified. See
filter_periodsclass for details.kwargs – Deprecated arguments.
deprecated:: (..) – asset will be removed in the future.
- data_provider#
Descriptor for attributes requiring type
gordo_core.data_providers.base.GordoBaseDataProvider
- fill_series_nans(series: Series, tag: str | SensorTag, resampling_startpoint: datetime, resampling_endpoint: datetime, resolution: str, interpolation_limit: str) Series[source]#
Try to fill Nans from look-back interpolated point.
Only uses point from past to Nans filling if it was found not far then interpolation limit.
- Return type:
Same not changed Series or Series with attempt to fill Nans.
- get_client_data(build_dataset_metadata: dict) Tuple[DataFrame, DataFrame | None][source]#
The version of
get_data()used by gordo-client- Parameters:
build_dataset_metadata –
build_metadata.datasetpart of the metadata
- get_data() Tuple[DataFrame, DataFrame | None][source]#
Return X, y data as numpy or pandas’ dataframes given current state
- get_data_provider() GordoBaseDataProvider[source]#
- kwargs#
Descriptor for attributes requiring type
gordo_core.base.GordoBaseDataset
- tag_list#
Descriptor for attributes requiring a non-empty list of strings
- target_tag_list#
Descriptor for attributes requiring a non-empty list of strings
- to_dict()[source]#
Serialize this object into a dict representation, which can be used to initialize a new object using
from_dict()
- train_end_date#
Descriptor for attributes requiring valid
datetime.datetimeattribute
- train_start_date#
Descriptor for attributes requiring valid
datetime.datetimeattribute
- gordo_core.time_series.compat(init)[source]#
__init__decorator for compatibility where the Gordo config file’sdatasetkeys have drifted from what kwargs are actually expected in the given dataset. For example, usingtrain_start_dateis common in the configs, butTimeSeriesDatasettakes this parameter astrain_start_date, as well asRandomDatasetRenames old/other acceptable kwargs to the ones that the dataset type expects
Utils function for working with sensor tags metadata in gordo_core.time_series.TimeSeriesDataset.
- gordo_core.metadata.sensor_tags_from_build_metadata(build_dataset_metadata: dict, tag_names: Set[str]) dict[str, gordo_core.sensor_tag.SensorTag][source]#
Fetch sensor tags information from the metadata. This info should be placed in
build_dataset_metadata["dataset_meta"]["tag_loading_metadata"]["tags"]- Parameters:
build_dataset_metadata –
build_metadata.datasetpart of the metadatatag_names – Contains tag names for which we should fetch information
- Return type:
Key here is tag name passed though
tag_namesargument