Spatial

The spatial module provides functions for distance calculation and buffer-based spatial aggregation used in the Nutrition-Sensitive Food Environment Index (NFEI).

Spatial relationships are central to food environment analysis because consumers do not interact only with a single vendor. They interact with clusters of food vendors, nearby infrastructure, and the broader set of food options available within walking distance or another meaningful exposure radius.

This module supports two main spatial tasks:

Nearest-distance calculation, which measures the closest distance between one set of observations and another, such as households to vendors or vendors to water and sanitation facilities.
Buffer-based feature aggregation, which summarizes the features available within a defined radius around each observation, such as food group diversity within 50 metres or sanitation access within 500 metres.

The module is designed to accept ordinary pandas DataFrames with longitude and latitude columns. For Buffer-based feature aggregation, coordinates are projected internally before spatial operations are performed, so users can work with decimal-degree input data while still using metre-based buffers.

These functions help users construct environment-level exposure indicators that complement vendor-level food diversity, availability, density, and infrastructure measures in the NFEI workflow.

Nearest distance

calc_distance(data1: DataFrame, data2: DataFrame, include_col: str | None = None, col_title: str = 'distance_km', data1_lon: str = '_longitude', data1_lat: str = '_latitude', data2_lon: str = '_longitude', data2_lat: str = '_latitude') -> pd.DataFrame

Calculate nearest distance between two sets of observations.

This function calculates the nearest distance from each observation in data1 to observations in data2 using longitude and latitude coordinates. It supports NFEI workflows where users need to measure proximity between food environment features, such as households and vendors, vendors and water points, or vendors and sanitation facilities.

For each row in data1, the function computes the Haversine distance to all rows in data2 and keeps the closest distance.

Parameters:

Name	Type	Description	Default
`data1`	`DataFrame`	Primary dataframe. Each row receives the nearest distance to observations in `data2`.	required
`data2`	`DataFrame`	Secondary dataframe containing candidate destination points.	required
`include_col`	`str \| None`	Optional column from `data2` to bring back from the nearest observation. For example, if `include_col="vendor_type"`, the returned dataframe receives a column named `"closest_vendor_type"`.	`None`
`col_title`	`str`	Name of the distance column to add. Distances are returned in kilometres. The default is `"distance_km"`.	`'distance_km'`
`data1_lon`	`str`	Longitude column name in `data1`. The default is `"_longitude"`.	`'_longitude'`
`data1_lat`	`str`	Latitude column name in `data1`. The default is `"_latitude"`.	`'_latitude'`
`data2_lon`	`str`	Longitude column name in `data2`. The default is `"_longitude"`.	`'_longitude'`
`data2_lat`	`str`	Latitude column name in `data2`. The default is `"_latitude"`.	`'_latitude'`

Returns:

Type	Description
`DataFrame`	Copy of `data1` with a nearest-distance column added. If `include_col` is provided, a `"closest_{include_col}"` column is also added.

Raises:

Type	Description
`KeyError`	If required longitude or latitude columns are missing from either dataframe, or if `include_col` is provided but not found in `data2`.
`ValueError`	If `data2` contains no observations.

Notes

This function uses geographic coordinates directly through the Haversine formula and returns distances in kilometres.

It is appropriate for nearest-neighbour distance calculations. It does not perform buffer-based spatial joins. For buffer-based aggregation, use :func:features_proximity_agg.

Examples:

Calculate the nearest vendor distance for each household:

>>> import pandas as pd
>>> import nfei
>>>
>>> households = pd.DataFrame(
...     {
...         "household_id": [1, 2],
...         "_longitude": [36.8001, 36.9000],
...         "_latitude": [-1.3001, -1.4000],
...     }
... )
>>> vendors = pd.DataFrame(
...     {
...         "vendor_id": [1, 2],
...         "vendor_type": ["shop", "kiosk"],
...         "_longitude": [36.8000, 36.8500],
...         "_latitude": [-1.3000, -1.3500],
...     }
... )
>>> result = nfei.calc_distance(
...     data1=households,
...     data2=vendors,
...     include_col="vendor_type",
... )

Use custom coordinate column names:

>>> result = nfei.calc_distance(
...     data1=households.rename(
...         columns={"_longitude": "lon", "_latitude": "lat"}
...     ),
...     data2=vendors.rename(
...         columns={"_longitude": "x", "_latitude": "y"}
...     ),
...     data1_lon="lon",
...     data1_lat="lat",
...     data2_lon="x",
...     data2_lat="y",
...     col_title="nearest_vendor_km",
... )

Buffer-based feature aggregation

features_proximity_agg(df1: DataFrame, df2: DataFrame, buffer: float, col_to_agg: list[str] | None = None, self_count: bool = False, include_sum: bool = False, method: str = 'sum', df1_lat: str = '_latitude', df1_lon: str = '_longitude', df2_lat: str = '_latitude', df2_lon: str = '_longitude', overall_title: str = 'Overall_aggregate', drop_col_to_agg: bool = False, input_crs: str | int = 'EPSG:4326', projected_crs: str | int | None = None) -> pd.DataFrame

Aggregate features within a spatial buffer.

This function aggregates features from df2 within a specified buffer around each observation in df1. It supports NFEI workflows where users need to construct environment-level indicators, such as food diversity within 50 metres of each vendor or access to sanitation facilities within a specified distance.

The function accepts ordinary pandas DataFrames with longitude and latitude columns. Internally, it converts the data to projected GeoDataFrames, creates buffers around df1 points, performs a spatial join, and aggregates matching df2 observations within each buffer.

Parameters:

Name	Type	Description	Default
`df1`	`DataFrame`	Primary dataframe. A buffer is created around each row in this dataframe, and aggregated values are returned at this level.	required
`df2`	`DataFrame`	Secondary dataframe containing the features to aggregate within each buffer.	required
`buffer`	`float`	Buffer radius in metres.	required
`col_to_agg`	`list[str] \| None`	List of columns in `df2` to aggregate. Required when `method` is `"sum"`, `"mean"`, or `"max"`. Not required when `method="count"`.	`None`
`self_count`	`bool`	If False and `df1` and `df2` are the same object, each row is excluded from its own buffer before aggregation. If True, each row can contribute to its own buffer.	`False`
`include_sum`	`bool`	If True and `method` is not `"count"`, adds an overall aggregate column by summing the aggregated columns row-wise.	`False`
`method`	`str`	Aggregation method. Must be one of `"sum"`, `"mean"`, `"max"`, or `"count"`.	`'sum'`
`df1_lat`	`str`	Latitude column name in `df1`. The default is `"_latitude"`.	`'_latitude'`
`df1_lon`	`str`	Longitude column name in `df1`. The default is `"_longitude"`.	`'_longitude'`
`df2_lat`	`str`	Latitude column name in `df2`. The default is `"_latitude"`.	`'_latitude'`
`df2_lon`	`str`	Longitude column name in `df2`. The default is `"_longitude"`.	`'_longitude'`
`overall_title`	`str`	Name of the output count column when `method="count"`, or the overall aggregate column when `include_sum=True`.	`'Overall_aggregate'`
`drop_col_to_agg`	`bool`	If True, drops the individual aggregated columns and keeps only `overall_title`. This is only valid when `include_sum=True` and `method` is not `"count"`.	`False`
`input_crs`	`str \| int`	Coordinate reference system of the input longitude and latitude coordinates. The default is `"EPSG:4326"`.	`'EPSG:4326'`
`projected_crs`	`str \| int \| None`	Projected coordinate reference system used for buffer and distance operations. If None, a local UTM CRS is estimated from the input coordinates.	`None`

Returns:

Type	Description
`DataFrame`	Copy of `df1` with buffer-based aggregate columns added.

Raises:

Type	Description
`ValueError`	If `buffer` is less than or equal to zero, if `method` is invalid, if `col_to_agg` is missing when required, or if `drop_col_to_agg=True` is used without `include_sum=True` for non-count aggregation.
`KeyError`	If coordinate columns are missing, or if any column listed in `col_to_agg` is not found in `df2`.

Notes

The buffer argument is in metres because the function performs buffer operations in a projected CRS.

When df1 and df2 are the same object, self_count=False excludes each observation from its own buffer. This is useful for neighbour-only calculations. Use self_count=True when the focal observation should be included in its own environment, such as when computing food diversity available within a vendor's immediate environment including the vendor itself.

Output column names for non-count aggregation are generated by appending "_within_{buffer}m" to aggregated column names.

Rows with no matching features within the buffer receive 0 in the added aggregate columns.

Examples:

Count vendors within 100 metres of each household:

>>> import pandas as pd
>>> import nfei
>>>
>>> households = pd.DataFrame(
...     {
...         "household_id": [1, 2],
...         "_longitude": [36.8001, 36.9000],
...         "_latitude": [-1.3001, -1.4000],
...     }
... )
>>> vendors = pd.DataFrame(
...     {
...         "vendor_id": [1, 2],
...         "_longitude": [36.8000, 36.8500],
...         "_latitude": [-1.3000, -1.3500],
...     }
... )
>>> result = nfei.features_proximity_agg(
...     df1=households,
...     df2=vendors,
...     buffer=100,
...     method="count",
...     overall_title="vendors_within_100m",
... )

Sum food-group availability within 50 metres of each vendor:

>>> vendors = pd.DataFrame(
...     {
...         "vendor_id": [1, 2],
...         "_longitude": [36.8000, 36.8002],
...         "_latitude": [-1.3000, -1.3002],
...         "grains": [1, 1],
...         "legumes_pulses": [1, 0],
...         "other_vegetables": [0, 1],
...     }
... )
>>> result = nfei.features_proximity_agg(
...     df1=vendors,
...     df2=vendors,
...     buffer=50,
...     col_to_agg=["grains", "legumes_pulses", "other_vegetables"],
...     method="sum",
...     self_count=True,
... )

Create a single overall aggregate column and drop individual columns:

>>> result = nfei.features_proximity_agg(
...     df1=vendors,
...     df2=vendors,
...     buffer=50,
...     col_to_agg=["grains", "legumes_pulses", "other_vegetables"],
...     method="sum",
...     self_count=True,
...     include_sum=True,
...     overall_title="food_group_items_within_50m",
...     drop_col_to_agg=True,
... )

Haversine distance

haversine_vectorized(lon1, lat1, lon2, lat2) -> np.ndarray

Calculate great-circle distances using the Haversine formula.

This function computes the distance between longitude and latitude coordinates using the Haversine formula. It is used in the NFEI spatial workflow for direct distance calculations on geographic coordinates without requiring users to first convert their data into GeoDataFrames.

Distances are returned in kilometres.

Parameters:

Name	Description	Default
`lon1`	Longitude of the first point or points, in decimal degrees.	required
`lat1`	Latitude of the first point or points, in decimal degrees.	required
`lon2`	Longitude of the second point or points, in decimal degrees.	required
`lat2`	Latitude of the second point or points, in decimal degrees.	required

Returns:

Type	Description
`ndarray`	Great-circle distance or distances in kilometres.

Notes

Input coordinates must be in decimal degrees. The function internally converts coordinates to radians before applying the Haversine formula.

This function is most useful for direct point-to-point or one-to-many distance calculations. For buffer-based exposure indicators, use :func:features_proximity_agg.

Examples:

Calculate the distance between two points:

>>> import nfei
>>>
>>> distance = nfei.haversine_vectorized(
...     lon1=36.8000,
...     lat1=-1.3000,
...     lon2=36.8002,
...     lat2=-1.3002,
... )

Calculate distances from one point to several candidate points:

>>> import numpy as np
>>> distances = nfei.haversine_vectorized(
...     lon1=36.8000,
...     lat1=-1.3000,
...     lon2=np.array([36.8002, 36.9000]),
...     lat2=np.array([-1.3002, -1.4000]),
... )