Skip to content

Validation

The validation module provides functions for detecting and correcting data quality issues in the Nutrition-Sensitive Food Environment Index (NFEI) workflow.

Accurate data is critical for food environment analysis. Errors in numeric values, especially spatial coordinates, can significantly distort distance calculations, spatial aggregation, and density estimation. Even a small number of incorrect coordinates can lead to misleading conclusions about access and exposure.

This module focuses on robust detection and correction of outliers using the Median Absolute Deviation (MAD) method, which is well suited for skewed and real-world datasets.

The module supports two key tasks:

  • Outlier detection, using a robust statistical approach that identifies extreme values without being influenced by them.
  • Spatial data correction, where erroneous latitude and longitude values are detected and replaced using representative central values.

These functions are typically applied early in the NFEI workflow to ensure that subsequent spatial and density analyses are based on reliable and consistent data.

Spatial outlier correction

fix_spatial_outliers(df: DataFrame, latitude: str, longitude: str, threshold: float = 3.0, return_outlier_flag: bool = False, outlier_col: str = 'spatial_outlier', show_plots: bool = True) -> pd.DataFrame

Detect and correct spatial coordinate outliers.

This function identifies and corrects outliers in geographic coordinate data using the Median Absolute Deviation (MAD) method. It is designed for NFEI workflows where inaccurate latitude and longitude values can distort distance calculations, spatial aggregation, and density estimation.

The function detects outliers separately in latitude and longitude, then combines both flags. A row is treated as a spatial outlier if either its latitude or longitude is flagged as an outlier.

Outlying coordinates are replaced using the median latitude and longitude calculated from non-outlier observations.

Parameters:

Name Type Description Default
df DataFrame

Input dataframe containing coordinate columns.

required
latitude str

Name of the latitude column.

required
longitude str

Name of the longitude column.

required
threshold float

Modified Z-score threshold used for MAD-based outlier detection. The default is 3.0.

3.0
return_outlier_flag bool

If True, adds a boolean column indicating which rows were flagged as spatial outliers.

False
outlier_col str

Name of the outlier flag column when return_outlier_flag=True.

'spatial_outlier'
show_plots bool

If True, displays before and after scatter plots showing detected and corrected outliers.

True

Returns:

Type Description
DataFrame

Copy of the dataframe with corrected coordinates. If return_outlier_flag=True, an additional boolean column is included.

Raises:

Type Description
KeyError

If latitude or longitude columns are not found in the dataframe.

TypeError

If coordinate columns are not numeric.

ValueError

If all rows are flagged as outliers, making it impossible to compute replacement values.

Notes

This function applies MAD-based outlier detection independently to latitude and longitude, then combines both flags.

Replacement values are computed as:

  • median latitude of non-outliers
  • median longitude of non-outliers

This ensures that corrected coordinates remain within the central spatial distribution of the dataset.

Plotting is optional and controlled by show_plots. In production workflows, it is recommended to set show_plots=False to avoid rendering overhead.

Examples:

Correct spatial outliers in coordinate data:

>>> import pandas as pd
>>> import nfei
>>>
>>> df = pd.DataFrame(
...     {
...         "lat": [-1.30, -1.31, -50.0],
...         "lon": [36.80, 36.81, 100.0],
...     }
... )
>>> result = nfei.fix_spatial_outliers(
...     df,
...     latitude="lat",
...     longitude="lon",
...     show_plots=False,
... )

Return an outlier flag column:

>>> result = nfei.fix_spatial_outliers(
...     df,
...     latitude="lat",
...     longitude="lon",
...     return_outlier_flag=True,
...     show_plots=False,
... )

MAD-based outlier detection

mad_based_outlier(values: Series | ndarray, threshold: float = 3.0) -> pd.Series

Detect outliers using the Median Absolute Deviation (MAD) method.

This function identifies outliers in numeric data using the Median Absolute Deviation (MAD) approach. It is used in the NFEI workflow to detect extreme values in skewed distributions, particularly for spatial coordinates where traditional mean-based methods are not robust.

The MAD method is preferred because it is resistant to the influence of extreme values and works well for non-normal data.

Parameters:

Name Type Description Default
values Series | ndarray

One-dimensional numeric values, provided as a pandas Series or NumPy array.

required
threshold float

Modified Z-score threshold used to flag outliers. Observations with a modified Z-score greater than this threshold are classified as outliers. The default is 3.0.

3.0

Returns:

Type Description
Series

Boolean Series where True indicates an outlier.

Raises:

Type Description
ValueError

If values is empty.

TypeError

If values is not numeric.

Notes

The modified Z-score is calculated as:

0.6745 * |x - median| / MAD

where MAD is the median absolute deviation.

If MAD is equal to zero, the function returns False for all observations, as no variation exists to detect outliers.

This function is commonly used as a building block for spatial data cleaning, particularly in :func:fix_spatial_outliers.

Examples:

Detect outliers in a numeric series:

>>> import pandas as pd
>>> import nfei
>>>
>>> values = pd.Series([1, 2, 2, 3, 100])
>>> outliers = nfei.mad_based_outlier(values)