Filters#
Row filter#
- gordo_core.filters.rows.apply_buffer(mask: Series, buffer_size: int = 0)[source]#
Take a mask (boolean series) where True indicates keeping a value, and False represents removing the value. This will ‘expand’ those indexes marked as False to the symmetrical bounds of
buffer_size- Parameters:
mask – Boolean pandas series
buffer_size – Size to buffer around
Falsevalues
Examples
>>> import pandas as pd >>> series = pd.Series([True, True, False, True, True]) >>> series = apply_buffer(series, buffer_size=1) >>> series 0 True 1 False 2 False 3 False 4 True dtype: bool
- Return type:
None
- gordo_core.filters.rows.escape_python_identifier(name: str) str[source]#
Escapes special symbols such as:
.,-etc., from the python variable identifier.- Parameters:
name – Python variable name.
- gordo_core.filters.rows.pandas_filter_rows(df: DataFrame, filter_str: str | list, buffer_size: int = 0) DataFrame[source]#
Filter pandas data frame based on list or string of conditions.
Note
pd.DataFrame.eval()of a list returns a numpy.ndarray and is limited to 100 list items. The sparse evaluation with numexprpd.DataFrame.eval()of a combined string logic, can only consist of a maximum 32 (current dependency) or 242 logical parts (latest release) and returns a pd.Series Therefore, list elements are evaluated in batches ofn=15(to be safe) and evaluate iterative.- Parameters:
df – Dataframe to filter rows from. Does not modify the parameter
filter_str – String representing the filter. Can be a boolean combination of conditions, where conditions are comparisons of column names and either other columns or numeric values. The rows matching the filter are kept. Column names with spaces must be quoted with backticks, names without spaces could be quoted with backticks or be unquoted. Example of legal filters are
`Tag A` > 5,(`Tag B` > 1) | (`Tag C` > 4)(`Tag D` < 5),(TagB > 5)The parameter can also be a list, in which the items will be joined by logical “ & “.buffer_size – Area fore and aft of the application of
fitler_strto also mark for removal.
- Return type:
The dataframe containing only rows matching the filter
Examples
>>> import numpy as np >>> import pandas as pd >>> df = pd.DataFrame(list(np.ndindex((3,3))), columns=list('AB')) >>> df A B 0 0 0 1 0 1 2 0 2 3 1 0 4 1 1 5 1 2 6 2 0 7 2 1 8 2 2 >>> pandas_filter_rows(df, "`A`>1") A B 6 2 0 7 2 1 8 2 2 >>> pandas_filter_rows(df, "`A`> B") A B 3 1 0 6 2 0 7 2 1 >>> pandas_filter_rows(df, "(`A`>1) | (`B`<1)") A B 0 0 0 3 1 0 6 2 0 7 2 1 8 2 2 >>> pandas_filter_rows(df, "(`A`>1) & (`B`<1)") A B 6 2 0 >>> pandas_filter_rows(df, ["A>1", "B<1"]) A B 6 2 0 >>> pandas_filter_rows(df, ["A!=1", "B<3"]) A B 0 0 0 1 0 1 2 0 2 6 2 0 7 2 1 8 2 2 >>> pandas_filter_rows(df, ["A!=1", "B<3"], buffer_size=1) A B 0 0 0 1 0 1 7 2 1 8 2 2
- gordo_core.filters.rows.parse_pandas_filter_vars(pandas_filter: str | list[str], with_special_vars: bool = False) list[str][source]#
Parsing
pandas.eval()expression and returns list of all used variables. Uses python build-inastparser under the hood.- Parameters:
pandas_filter – Pandas eval expression
with_special_vars – Include special variables such as
index, math functionssin,log10etc into the output
Examples
>>> vars_list = parse_pandas_filter_vars('Col1 > 0 & Col2 < 100') >>> sorted(vars_list) ['Col1', 'Col2']
- gordo_core.filters.rows.unescape_python_identifier(name: str) str[source]#
Does opposite to
escape_python_identifier(). Takes an escaped string and converts it to the python variable identifier.- Parameters:
name – Escaped python variable name.
Filter periods#
- class gordo_core.filters.periods.FilterPeriods(granularity: str = '10T', filter_method: str = 'median', window: int = 144, n_iqr: int = 5, iforest_smooth: bool = False, contamination: float = 0.03, quantile_lower: float = 0.05, quantile_upper: float = 0.95)[source]#
Bases:
objectModel class with methods for data pre-processing.
Performs a series of algorithms that drops noisy data.
Either a rolling median or an isolation forest algorithm is executed. Both provide drop periods in a dict-type element on the class object
object.drop_periods["iforest"]andobject.drop_periods["median"], and data is filtered accordingly.- Parameters:
granularity – The bucket size for grouping all incoming time data (e.g. “10T”). Available strings come from timeseries.
filter_method – Which method should be used for data cleaning, either “median” (default), “iforest” or “all” which returns results for both methods.
iforest_smooth – If exponential weighted smoothing should be applied to data before isolation forest algorithm is run.
- filter_data(data: DataFrame)[source]#
Method for filtering data. Returns the filtered dataset, a dict containing the different periods that have been dropped arranged by filtering method and the actual predictions from the filter model.
- data
Data frame containing already filtered data (global max/min + dropped known periods). Time consecutively is not required.