Validation
The validation module provides functions for detecting and correcting data quality issues in the Nutrition-Sensitive Food Environment Index (NFEI) workflow.
Accurate data is critical for food environment analysis. Errors in numeric values, especially spatial coordinates, can significantly distort distance calculations, spatial aggregation, and density estimation. Even a small number of incorrect coordinates can lead to misleading conclusions about access and exposure.
This module focuses on robust detection and correction of outliers using the Median Absolute Deviation (MAD) method, which is well suited for skewed and real-world datasets.
The module supports two key tasks:
- Outlier detection, using a robust statistical approach that identifies extreme values without being influenced by them.
- Spatial data correction, where erroneous latitude and longitude values are detected and replaced using representative central values.
These functions are typically applied early in the NFEI workflow to ensure that subsequent spatial and density analyses are based on reliable and consistent data.
Spatial outlier correction
fix_spatial_outliers(df: DataFrame, latitude: str, longitude: str, threshold: float = 3.0, return_outlier_flag: bool = False, outlier_col: str = 'spatial_outlier', show_plots: bool = True) -> pd.DataFrame
Detect and correct spatial coordinate outliers.
This function identifies and corrects outliers in geographic coordinate data using the Median Absolute Deviation (MAD) method. It is designed for NFEI workflows where inaccurate latitude and longitude values can distort distance calculations, spatial aggregation, and density estimation.
The function detects outliers separately in latitude and longitude, then combines both flags. A row is treated as a spatial outlier if either its latitude or longitude is flagged as an outlier.
Outlying coordinates are replaced using the median latitude and longitude calculated from non-outlier observations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input dataframe containing coordinate columns. |
required |
latitude
|
str
|
Name of the latitude column. |
required |
longitude
|
str
|
Name of the longitude column. |
required |
threshold
|
float
|
Modified Z-score threshold used for MAD-based outlier detection. The default is 3.0. |
3.0
|
return_outlier_flag
|
bool
|
If True, adds a boolean column indicating which rows were flagged as spatial outliers. |
False
|
outlier_col
|
str
|
Name of the outlier flag column when |
'spatial_outlier'
|
show_plots
|
bool
|
If True, displays before and after scatter plots showing detected and corrected outliers. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Copy of the dataframe with corrected coordinates. If
|
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
TypeError
|
If coordinate columns are not numeric. |
ValueError
|
If all rows are flagged as outliers, making it impossible to compute replacement values. |
Notes
This function applies MAD-based outlier detection independently to latitude and longitude, then combines both flags.
Replacement values are computed as:
- median latitude of non-outliers
- median longitude of non-outliers
This ensures that corrected coordinates remain within the central spatial distribution of the dataset.
Plotting is optional and controlled by show_plots. In production
workflows, it is recommended to set show_plots=False to avoid rendering
overhead.
Examples:
Correct spatial outliers in coordinate data:
>>> import pandas as pd
>>> import nfei
>>>
>>> df = pd.DataFrame(
... {
... "lat": [-1.30, -1.31, -50.0],
... "lon": [36.80, 36.81, 100.0],
... }
... )
>>> result = nfei.fix_spatial_outliers(
... df,
... latitude="lat",
... longitude="lon",
... show_plots=False,
... )
Return an outlier flag column:
>>> result = nfei.fix_spatial_outliers(
... df,
... latitude="lat",
... longitude="lon",
... return_outlier_flag=True,
... show_plots=False,
... )
MAD-based outlier detection
mad_based_outlier(values: Series | ndarray, threshold: float = 3.0) -> pd.Series
Detect outliers using the Median Absolute Deviation (MAD) method.
This function identifies outliers in numeric data using the Median Absolute Deviation (MAD) approach. It is used in the NFEI workflow to detect extreme values in skewed distributions, particularly for spatial coordinates where traditional mean-based methods are not robust.
The MAD method is preferred because it is resistant to the influence of extreme values and works well for non-normal data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
Series | ndarray
|
One-dimensional numeric values, provided as a pandas Series or NumPy array. |
required |
threshold
|
float
|
Modified Z-score threshold used to flag outliers. Observations with a modified Z-score greater than this threshold are classified as outliers. The default is 3.0. |
3.0
|
Returns:
| Type | Description |
|---|---|
Series
|
Boolean Series where True indicates an outlier. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
TypeError
|
If |
Notes
The modified Z-score is calculated as:
0.6745 * |x - median| / MAD
where MAD is the median absolute deviation.
If MAD is equal to zero, the function returns False for all observations, as no variation exists to detect outliers.
This function is commonly used as a building block for spatial data cleaning,
particularly in :func:fix_spatial_outliers.
Examples:
Detect outliers in a numeric series:
>>> import pandas as pd
>>> import nfei
>>>
>>> values = pd.Series([1, 2, 2, 3, 100])
>>> outliers = nfei.mad_based_outlier(values)