Filters#

Row filter#

gordo_core.filters.rows.apply_buffer(mask: Series, buffer_size: int = 0)[source]#

Take a mask (boolean series) where True indicates keeping a value, and False represents removing the value. This will ‘expand’ those indexes marked as False to the symmetrical bounds of buffer_size

Parameters:

mask – Boolean pandas series
buffer_size – Size to buffer around False values

Examples

>>> import pandas as pd
>>> series = pd.Series([True, True, False, True, True])
>>> series = apply_buffer(series, buffer_size=1)
>>> series
0     True
1    False
2    False
3    False
4     True
dtype: bool

Return type:: None

gordo_core.filters.rows.escape_python_identifier(name: str) → str[source]#

Escapes special symbols such as: ., - etc., from the python variable identifier.

Parameters:: name – Python variable name.

gordo_core.filters.rows.pandas_filter_rows(df: DataFrame, filter_str: str | list, buffer_size: int = 0) → DataFrame[source]#

Filter pandas data frame based on list or string of conditions.

Note

pd.DataFrame.eval() of a list returns a numpy.ndarray and is limited to 100 list items. The sparse evaluation with numexpr pd.DataFrame.eval() of a combined string logic, can only consist of a maximum 32 (current dependency) or 242 logical parts (latest release) and returns a pd.Series Therefore, list elements are evaluated in batches of n=15 (to be safe) and evaluate iterative.

Parameters:

df – Dataframe to filter rows from. Does not modify the parameter
filter_str – String representing the filter. Can be a boolean combination of conditions, where conditions are comparisons of column names and either other columns or numeric values. The rows matching the filter are kept. Column names with spaces must be quoted with backticks, names without spaces could be quoted with backticks or be unquoted. Example of legal filters are `Tag A` > 5 , (`Tag B` > 1) | (`Tag C` > 4) (`Tag D` < 5), (TagB > 5) The parameter can also be a list, in which the items will be joined by logical “ & “.
buffer_size – Area fore and aft of the application of fitler_str to also mark for removal.

Return type:

The dataframe containing only rows matching the filter

Examples

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(list(np.ndindex((3,3))), columns=list('AB'))
>>> df
   A  B
0  0  0
1  0  1
2  0  2
3  1  0
4  1  1
5  1  2
6  2  0
7  2  1
8  2  2
>>> pandas_filter_rows(df, "`A`>1")
   A  B
6  2  0
7  2  1
8  2  2
>>> pandas_filter_rows(df, "`A`> B")
   A  B
3  1  0
6  2  0
7  2  1
>>> pandas_filter_rows(df, "(`A`>1) | (`B`<1)")
   A  B
0  0  0
3  1  0
6  2  0
7  2  1
8  2  2
>>> pandas_filter_rows(df, "(`A`>1) & (`B`<1)")
   A  B
6  2  0
>>> pandas_filter_rows(df, ["A>1", "B<1"])
   A  B
6  2  0
>>> pandas_filter_rows(df, ["A!=1", "B<3"])
   A  B
0  0  0
1  0  1
2  0  2
6  2  0
7  2  1
8  2  2
>>> pandas_filter_rows(df, ["A!=1", "B<3"], buffer_size=1)
   A  B
0  0  0
1  0  1
7  2  1
8  2  2

gordo_core.filters.rows.parse_pandas_filter_vars(pandas_filter: str | list[str], with_special_vars: bool = False) → list[str][source]#

Parsing pandas.eval() expression and returns list of all used variables. Uses python build-in ast parser under the hood.

Parameters:

pandas_filter – Pandas eval expression
with_special_vars – Include special variables such as index, math functions sin, log10 etc into the output

Examples

>>> vars_list = parse_pandas_filter_vars('Col1 > 0 & Col2 < 100')
>>> sorted(vars_list)
['Col1', 'Col2']

gordo_core.filters.rows.unescape_python_identifier(name: str) → str[source]#

Does opposite to escape_python_identifier(). Takes an escaped string and converts it to the python variable identifier.

Parameters:: name – Escaped python variable name.

Filter periods#

class gordo_core.filters.periods.FilterPeriods(granularity: str = '10T', filter_method: str = 'median', window: int = 144, n_iqr: int = 5, iforest_smooth: bool = False, contamination: float = 0.03, quantile_lower: float = 0.05, quantile_upper: float = 0.95)[source]#

Bases: object

Model class with methods for data pre-processing.

Performs a series of algorithms that drops noisy data.

Either a rolling median or an isolation forest algorithm is executed. Both provide drop periods in a dict-type element on the class object object.drop_periods["iforest"] and object.drop_periods["median"], and data is filtered accordingly.

Parameters:

granularity – The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries.
filter_method – Which method should be used for data cleaning, either “median” (default), “iforest” or “all” which returns results for both methods.
iforest_smooth – If exponential weighted smoothing should be applied to data before isolation forest algorithm is run.

filter_data(data: DataFrame)[source]#

Method for filtering data. Returns the filtered dataset, a dict containing the different periods that have been dropped arranged by filtering method and the actual predictions from the filter model.

data: Data frame containing already filtered data (global max/min + dropped known periods). Time consecutively is not required.

exception gordo_core.filters.periods.WrongFilterMethodType[source]#: Bases: TypeError