Skip to content

Spatial

The spatial module provides functions for distance calculation and buffer-based spatial aggregation used in the Nutrition-Sensitive Food Environment Index (NFEI).

Spatial relationships are central to food environment analysis because consumers do not interact only with a single vendor. They interact with clusters of food vendors, nearby infrastructure, and the broader set of food options available within walking distance or another meaningful exposure radius.

This module supports two main spatial tasks:

  • Nearest-distance calculation, which measures the closest distance between one set of observations and another, such as households to vendors or vendors to water and sanitation facilities.
  • Buffer-based feature aggregation, which summarizes the features available within a defined radius around each observation, such as food group diversity within 50 metres or sanitation access within 500 metres.

The module is designed to accept ordinary pandas DataFrames with longitude and latitude columns. For Buffer-based feature aggregation, coordinates are projected internally before spatial operations are performed, so users can work with decimal-degree input data while still using metre-based buffers.

These functions help users construct environment-level exposure indicators that complement vendor-level food diversity, availability, density, and infrastructure measures in the NFEI workflow.

Nearest distance

calc_distance(data1: DataFrame, data2: DataFrame, include_col: str | None = None, col_title: str = 'distance_km', data1_lon: str = '_longitude', data1_lat: str = '_latitude', data2_lon: str = '_longitude', data2_lat: str = '_latitude') -> pd.DataFrame

Calculate nearest distance between two sets of observations.

This function calculates the nearest distance from each observation in data1 to observations in data2 using longitude and latitude coordinates. It supports NFEI workflows where users need to measure proximity between food environment features, such as households and vendors, vendors and water points, or vendors and sanitation facilities.

For each row in data1, the function computes the Haversine distance to all rows in data2 and keeps the closest distance.

Parameters:

Name Type Description Default
data1 DataFrame

Primary dataframe. Each row receives the nearest distance to observations in data2.

required
data2 DataFrame

Secondary dataframe containing candidate destination points.

required
include_col str | None

Optional column from data2 to bring back from the nearest observation. For example, if include_col="vendor_type", the returned dataframe receives a column named "closest_vendor_type".

None
col_title str

Name of the distance column to add. Distances are returned in kilometres. The default is "distance_km".

'distance_km'
data1_lon str

Longitude column name in data1. The default is "_longitude".

'_longitude'
data1_lat str

Latitude column name in data1. The default is "_latitude".

'_latitude'
data2_lon str

Longitude column name in data2. The default is "_longitude".

'_longitude'
data2_lat str

Latitude column name in data2. The default is "_latitude".

'_latitude'

Returns:

Type Description
DataFrame

Copy of data1 with a nearest-distance column added. If include_col is provided, a "closest_{include_col}" column is also added.

Raises:

Type Description
KeyError

If required longitude or latitude columns are missing from either dataframe, or if include_col is provided but not found in data2.

ValueError

If data2 contains no observations.

Notes

This function uses geographic coordinates directly through the Haversine formula and returns distances in kilometres.

It is appropriate for nearest-neighbour distance calculations. It does not perform buffer-based spatial joins. For buffer-based aggregation, use :func:features_proximity_agg.

Examples:

Calculate the nearest vendor distance for each household:

>>> import pandas as pd
>>> import nfei
>>>
>>> households = pd.DataFrame(
...     {
...         "household_id": [1, 2],
...         "_longitude": [36.8001, 36.9000],
...         "_latitude": [-1.3001, -1.4000],
...     }
... )
>>> vendors = pd.DataFrame(
...     {
...         "vendor_id": [1, 2],
...         "vendor_type": ["shop", "kiosk"],
...         "_longitude": [36.8000, 36.8500],
...         "_latitude": [-1.3000, -1.3500],
...     }
... )
>>> result = nfei.calc_distance(
...     data1=households,
...     data2=vendors,
...     include_col="vendor_type",
... )

Use custom coordinate column names:

>>> result = nfei.calc_distance(
...     data1=households.rename(
...         columns={"_longitude": "lon", "_latitude": "lat"}
...     ),
...     data2=vendors.rename(
...         columns={"_longitude": "x", "_latitude": "y"}
...     ),
...     data1_lon="lon",
...     data1_lat="lat",
...     data2_lon="x",
...     data2_lat="y",
...     col_title="nearest_vendor_km",
... )

Buffer-based feature aggregation

features_proximity_agg(df1: DataFrame, df2: DataFrame, buffer: float, col_to_agg: list[str] | None = None, self_count: bool = False, include_sum: bool = False, method: str = 'sum', df1_lat: str = '_latitude', df1_lon: str = '_longitude', df2_lat: str = '_latitude', df2_lon: str = '_longitude', overall_title: str = 'Overall_aggregate', drop_col_to_agg: bool = False, input_crs: str | int = 'EPSG:4326', projected_crs: str | int | None = None) -> pd.DataFrame

Aggregate features within a spatial buffer.

This function aggregates features from df2 within a specified buffer around each observation in df1. It supports NFEI workflows where users need to construct environment-level indicators, such as food diversity within 50 metres of each vendor or access to sanitation facilities within a specified distance.

The function accepts ordinary pandas DataFrames with longitude and latitude columns. Internally, it converts the data to projected GeoDataFrames, creates buffers around df1 points, performs a spatial join, and aggregates matching df2 observations within each buffer.

Parameters:

Name Type Description Default
df1 DataFrame

Primary dataframe. A buffer is created around each row in this dataframe, and aggregated values are returned at this level.

required
df2 DataFrame

Secondary dataframe containing the features to aggregate within each buffer.

required
buffer float

Buffer radius in metres.

required
col_to_agg list[str] | None

List of columns in df2 to aggregate. Required when method is "sum", "mean", or "max". Not required when method="count".

None
self_count bool

If False and df1 and df2 are the same object, each row is excluded from its own buffer before aggregation. If True, each row can contribute to its own buffer.

False
include_sum bool

If True and method is not "count", adds an overall aggregate column by summing the aggregated columns row-wise.

False
method str

Aggregation method. Must be one of "sum", "mean", "max", or "count".

'sum'
df1_lat str

Latitude column name in df1. The default is "_latitude".

'_latitude'
df1_lon str

Longitude column name in df1. The default is "_longitude".

'_longitude'
df2_lat str

Latitude column name in df2. The default is "_latitude".

'_latitude'
df2_lon str

Longitude column name in df2. The default is "_longitude".

'_longitude'
overall_title str

Name of the output count column when method="count", or the overall aggregate column when include_sum=True.

'Overall_aggregate'
drop_col_to_agg bool

If True, drops the individual aggregated columns and keeps only overall_title. This is only valid when include_sum=True and method is not "count".

False
input_crs str | int

Coordinate reference system of the input longitude and latitude coordinates. The default is "EPSG:4326".

'EPSG:4326'
projected_crs str | int | None

Projected coordinate reference system used for buffer and distance operations. If None, a local UTM CRS is estimated from the input coordinates.

None

Returns:

Type Description
DataFrame

Copy of df1 with buffer-based aggregate columns added.

Raises:

Type Description
ValueError

If buffer is less than or equal to zero, if method is invalid, if col_to_agg is missing when required, or if drop_col_to_agg=True is used without include_sum=True for non-count aggregation.

KeyError

If coordinate columns are missing, or if any column listed in col_to_agg is not found in df2.

Notes

The buffer argument is in metres because the function performs buffer operations in a projected CRS.

When df1 and df2 are the same object, self_count=False excludes each observation from its own buffer. This is useful for neighbour-only calculations. Use self_count=True when the focal observation should be included in its own environment, such as when computing food diversity available within a vendor's immediate environment including the vendor itself.

Output column names for non-count aggregation are generated by appending "_within_{buffer}m" to aggregated column names.

Rows with no matching features within the buffer receive 0 in the added aggregate columns.

Examples:

Count vendors within 100 metres of each household:

>>> import pandas as pd
>>> import nfei
>>>
>>> households = pd.DataFrame(
...     {
...         "household_id": [1, 2],
...         "_longitude": [36.8001, 36.9000],
...         "_latitude": [-1.3001, -1.4000],
...     }
... )
>>> vendors = pd.DataFrame(
...     {
...         "vendor_id": [1, 2],
...         "_longitude": [36.8000, 36.8500],
...         "_latitude": [-1.3000, -1.3500],
...     }
... )
>>> result = nfei.features_proximity_agg(
...     df1=households,
...     df2=vendors,
...     buffer=100,
...     method="count",
...     overall_title="vendors_within_100m",
... )

Sum food-group availability within 50 metres of each vendor:

>>> vendors = pd.DataFrame(
...     {
...         "vendor_id": [1, 2],
...         "_longitude": [36.8000, 36.8002],
...         "_latitude": [-1.3000, -1.3002],
...         "grains": [1, 1],
...         "legumes_pulses": [1, 0],
...         "other_vegetables": [0, 1],
...     }
... )
>>> result = nfei.features_proximity_agg(
...     df1=vendors,
...     df2=vendors,
...     buffer=50,
...     col_to_agg=["grains", "legumes_pulses", "other_vegetables"],
...     method="sum",
...     self_count=True,
... )

Create a single overall aggregate column and drop individual columns:

>>> result = nfei.features_proximity_agg(
...     df1=vendors,
...     df2=vendors,
...     buffer=50,
...     col_to_agg=["grains", "legumes_pulses", "other_vegetables"],
...     method="sum",
...     self_count=True,
...     include_sum=True,
...     overall_title="food_group_items_within_50m",
...     drop_col_to_agg=True,
... )

Haversine distance

haversine_vectorized(lon1, lat1, lon2, lat2) -> np.ndarray

Calculate great-circle distances using the Haversine formula.

This function computes the distance between longitude and latitude coordinates using the Haversine formula. It is used in the NFEI spatial workflow for direct distance calculations on geographic coordinates without requiring users to first convert their data into GeoDataFrames.

Distances are returned in kilometres.

Parameters:

Name Type Description Default
lon1

Longitude of the first point or points, in decimal degrees.

required
lat1

Latitude of the first point or points, in decimal degrees.

required
lon2

Longitude of the second point or points, in decimal degrees.

required
lat2

Latitude of the second point or points, in decimal degrees.

required

Returns:

Type Description
ndarray

Great-circle distance or distances in kilometres.

Notes

Input coordinates must be in decimal degrees. The function internally converts coordinates to radians before applying the Haversine formula.

This function is most useful for direct point-to-point or one-to-many distance calculations. For buffer-based exposure indicators, use :func:features_proximity_agg.

Examples:

Calculate the distance between two points:

>>> import nfei
>>>
>>> distance = nfei.haversine_vectorized(
...     lon1=36.8000,
...     lat1=-1.3000,
...     lon2=36.8002,
...     lat2=-1.3002,
... )

Calculate distances from one point to several candidate points:

>>> import numpy as np
>>> distances = nfei.haversine_vectorized(
...     lon1=36.8000,
...     lat1=-1.3000,
...     lon2=np.array([36.8002, 36.9000]),
...     lat2=np.array([-1.3002, -1.4000]),
... )