Spatial
The spatial module provides functions for distance calculation and buffer-based spatial aggregation used in the Nutrition-Sensitive Food Environment Index (NFEI).
Spatial relationships are central to food environment analysis because consumers do not interact only with a single vendor. They interact with clusters of food vendors, nearby infrastructure, and the broader set of food options available within walking distance or another meaningful exposure radius.
This module supports two main spatial tasks:
- Nearest-distance calculation, which measures the closest distance between one set of observations and another, such as households to vendors or vendors to water and sanitation facilities.
- Buffer-based feature aggregation, which summarizes the features available within a defined radius around each observation, such as food group diversity within 50 metres or sanitation access within 500 metres.
The module is designed to accept ordinary pandas DataFrames with longitude and latitude columns. For Buffer-based feature aggregation, coordinates are projected internally before spatial operations are performed, so users can work with decimal-degree input data while still using metre-based buffers.
These functions help users construct environment-level exposure indicators that complement vendor-level food diversity, availability, density, and infrastructure measures in the NFEI workflow.
Nearest distance
calc_distance(data1: DataFrame, data2: DataFrame, include_col: str | None = None, col_title: str = 'distance_km', data1_lon: str = '_longitude', data1_lat: str = '_latitude', data2_lon: str = '_longitude', data2_lat: str = '_latitude') -> pd.DataFrame
Calculate nearest distance between two sets of observations.
This function calculates the nearest distance from each observation in
data1 to observations in data2 using longitude and latitude
coordinates. It supports NFEI workflows where users need to measure proximity
between food environment features, such as households and vendors, vendors
and water points, or vendors and sanitation facilities.
For each row in data1, the function computes the Haversine distance to
all rows in data2 and keeps the closest distance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data1
|
DataFrame
|
Primary dataframe. Each row receives the nearest distance to
observations in |
required |
data2
|
DataFrame
|
Secondary dataframe containing candidate destination points. |
required |
include_col
|
str | None
|
Optional column from |
None
|
col_title
|
str
|
Name of the distance column to add. Distances are returned in
kilometres. The default is |
'distance_km'
|
data1_lon
|
str
|
Longitude column name in |
'_longitude'
|
data1_lat
|
str
|
Latitude column name in |
'_latitude'
|
data2_lon
|
str
|
Longitude column name in |
'_longitude'
|
data2_lat
|
str
|
Latitude column name in |
'_latitude'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Copy of |
Raises:
| Type | Description |
|---|---|
KeyError
|
If required longitude or latitude columns are missing from either
dataframe, or if |
ValueError
|
If |
Notes
This function uses geographic coordinates directly through the Haversine formula and returns distances in kilometres.
It is appropriate for nearest-neighbour distance calculations. It does not
perform buffer-based spatial joins. For buffer-based aggregation, use
:func:features_proximity_agg.
Examples:
Calculate the nearest vendor distance for each household:
>>> import pandas as pd
>>> import nfei
>>>
>>> households = pd.DataFrame(
... {
... "household_id": [1, 2],
... "_longitude": [36.8001, 36.9000],
... "_latitude": [-1.3001, -1.4000],
... }
... )
>>> vendors = pd.DataFrame(
... {
... "vendor_id": [1, 2],
... "vendor_type": ["shop", "kiosk"],
... "_longitude": [36.8000, 36.8500],
... "_latitude": [-1.3000, -1.3500],
... }
... )
>>> result = nfei.calc_distance(
... data1=households,
... data2=vendors,
... include_col="vendor_type",
... )
Use custom coordinate column names:
>>> result = nfei.calc_distance(
... data1=households.rename(
... columns={"_longitude": "lon", "_latitude": "lat"}
... ),
... data2=vendors.rename(
... columns={"_longitude": "x", "_latitude": "y"}
... ),
... data1_lon="lon",
... data1_lat="lat",
... data2_lon="x",
... data2_lat="y",
... col_title="nearest_vendor_km",
... )
Buffer-based feature aggregation
features_proximity_agg(df1: DataFrame, df2: DataFrame, buffer: float, col_to_agg: list[str] | None = None, self_count: bool = False, include_sum: bool = False, method: str = 'sum', df1_lat: str = '_latitude', df1_lon: str = '_longitude', df2_lat: str = '_latitude', df2_lon: str = '_longitude', overall_title: str = 'Overall_aggregate', drop_col_to_agg: bool = False, input_crs: str | int = 'EPSG:4326', projected_crs: str | int | None = None) -> pd.DataFrame
Aggregate features within a spatial buffer.
This function aggregates features from df2 within a specified buffer
around each observation in df1. It supports NFEI workflows where users
need to construct environment-level indicators, such as food diversity
within 50 metres of each vendor or access to sanitation facilities within a
specified distance.
The function accepts ordinary pandas DataFrames with longitude and latitude
columns. Internally, it converts the data to projected GeoDataFrames, creates
buffers around df1 points, performs a spatial join, and aggregates
matching df2 observations within each buffer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df1
|
DataFrame
|
Primary dataframe. A buffer is created around each row in this dataframe, and aggregated values are returned at this level. |
required |
df2
|
DataFrame
|
Secondary dataframe containing the features to aggregate within each buffer. |
required |
buffer
|
float
|
Buffer radius in metres. |
required |
col_to_agg
|
list[str] | None
|
List of columns in |
None
|
self_count
|
bool
|
If False and |
False
|
include_sum
|
bool
|
If True and |
False
|
method
|
str
|
Aggregation method. Must be one of |
'sum'
|
df1_lat
|
str
|
Latitude column name in |
'_latitude'
|
df1_lon
|
str
|
Longitude column name in |
'_longitude'
|
df2_lat
|
str
|
Latitude column name in |
'_latitude'
|
df2_lon
|
str
|
Longitude column name in |
'_longitude'
|
overall_title
|
str
|
Name of the output count column when |
'Overall_aggregate'
|
drop_col_to_agg
|
bool
|
If True, drops the individual aggregated columns and keeps only
|
False
|
input_crs
|
str | int
|
Coordinate reference system of the input longitude and latitude
coordinates. The default is |
'EPSG:4326'
|
projected_crs
|
str | int | None
|
Projected coordinate reference system used for buffer and distance operations. If None, a local UTM CRS is estimated from the input coordinates. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Copy of |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
KeyError
|
If coordinate columns are missing, or if any column listed in
|
Notes
The buffer argument is in metres because the function performs buffer
operations in a projected CRS.
When df1 and df2 are the same object, self_count=False excludes
each observation from its own buffer. This is useful for neighbour-only
calculations. Use self_count=True when the focal observation should be
included in its own environment, such as when computing food diversity
available within a vendor's immediate environment including the vendor
itself.
Output column names for non-count aggregation are generated by appending
"_within_{buffer}m" to aggregated column names.
Rows with no matching features within the buffer receive 0 in the added aggregate columns.
Examples:
Count vendors within 100 metres of each household:
>>> import pandas as pd
>>> import nfei
>>>
>>> households = pd.DataFrame(
... {
... "household_id": [1, 2],
... "_longitude": [36.8001, 36.9000],
... "_latitude": [-1.3001, -1.4000],
... }
... )
>>> vendors = pd.DataFrame(
... {
... "vendor_id": [1, 2],
... "_longitude": [36.8000, 36.8500],
... "_latitude": [-1.3000, -1.3500],
... }
... )
>>> result = nfei.features_proximity_agg(
... df1=households,
... df2=vendors,
... buffer=100,
... method="count",
... overall_title="vendors_within_100m",
... )
Sum food-group availability within 50 metres of each vendor:
>>> vendors = pd.DataFrame(
... {
... "vendor_id": [1, 2],
... "_longitude": [36.8000, 36.8002],
... "_latitude": [-1.3000, -1.3002],
... "grains": [1, 1],
... "legumes_pulses": [1, 0],
... "other_vegetables": [0, 1],
... }
... )
>>> result = nfei.features_proximity_agg(
... df1=vendors,
... df2=vendors,
... buffer=50,
... col_to_agg=["grains", "legumes_pulses", "other_vegetables"],
... method="sum",
... self_count=True,
... )
Create a single overall aggregate column and drop individual columns:
>>> result = nfei.features_proximity_agg(
... df1=vendors,
... df2=vendors,
... buffer=50,
... col_to_agg=["grains", "legumes_pulses", "other_vegetables"],
... method="sum",
... self_count=True,
... include_sum=True,
... overall_title="food_group_items_within_50m",
... drop_col_to_agg=True,
... )
Haversine distance
haversine_vectorized(lon1, lat1, lon2, lat2) -> np.ndarray
Calculate great-circle distances using the Haversine formula.
This function computes the distance between longitude and latitude coordinates using the Haversine formula. It is used in the NFEI spatial workflow for direct distance calculations on geographic coordinates without requiring users to first convert their data into GeoDataFrames.
Distances are returned in kilometres.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lon1
|
Longitude of the first point or points, in decimal degrees. |
required | |
lat1
|
Latitude of the first point or points, in decimal degrees. |
required | |
lon2
|
Longitude of the second point or points, in decimal degrees. |
required | |
lat2
|
Latitude of the second point or points, in decimal degrees. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Great-circle distance or distances in kilometres. |
Notes
Input coordinates must be in decimal degrees. The function internally converts coordinates to radians before applying the Haversine formula.
This function is most useful for direct point-to-point or one-to-many
distance calculations. For buffer-based exposure indicators, use
:func:features_proximity_agg.
Examples:
Calculate the distance between two points:
>>> import nfei
>>>
>>> distance = nfei.haversine_vectorized(
... lon1=36.8000,
... lat1=-1.3000,
... lon2=36.8002,
... lat2=-1.3002,
... )
Calculate distances from one point to several candidate points:
>>> import numpy as np
>>> distances = nfei.haversine_vectorized(
... lon1=36.8000,
... lat1=-1.3000,
... lon2=np.array([36.8002, 36.9000]),
... lat2=np.array([-1.3002, -1.4000]),
... )