So I've been working on this spatial smoothing problem for satellite data, and I kept running into performance issues. The problem is that when you're doing spatial operations like smoothing or finding neighbors, the way your data is stored in memory really matters.
Modern processors use caches - these are small, super-fast memory areas that hold chunks of data from main RAM. If the next piece of data your algorithm needs is already in the cache (cache hit), it's fast. If not (cache miss), the processor has to wait for a slow trip to main memory.
For spatial problems like smoothing satellite data on a hexagonal grid, you're constantly accessing neighboring cells. If those neighbors are scattered randomly in memory, you get tons of cache misses. The solution is to reorder your data so that cells that are close in 2D space are also close in the 1D memory array. This is called improving data locality.
This post looks at two popular techniques for this: Morton curves and Hilbert curves.
A Morton curve maps 2D coordinates (x, y) to a 1D index by interleaving the bits of the x and y coordinates. It's called a Z-order curve because the path it traces looks like a 'Z'.
The Hilbert curve is another space-filling curve, but it's constructed with a recursive U-shape.
You can use Python with numba to write CUDA kernels and matplotlib to plot the results. The numba library compiles Python functions for execution on the GPU, making it easy to prototype CUDA code.
This script generates the two plots above. It defines the Morton and Hilbert functions as CUDA device functions, then calls them from a kernel to compute the order for a grid of points.
# save_as: spatial-curves/visualize_curves.py
import numpy as np
import matplotlib.pyplot as plt
from numba import cuda
# --- CUDA Device Functions (written in Python for Numba) ---
@cuda.jit(device=True)
def morton2D_device(x, y):
"""Calculates the Morton index for a 2D point."""
xx = x
yy = y
xx = (xx | (xx << 8)) & 0x00FF00FF
xx = (xx | (xx << 4)) & 0x0F0F0F0F
xx = (xx | (xx << 2)) & 0x33333333
xx = (xx | (xx << 1)) & 0x55555555
yy = (yy | (yy << 8)) & 0x00FF00FF
yy = (yy | (yy << 4)) & 0x0F0F0F0F
yy = (yy | (yy << 2)) & 0x33333333
yy = (yy | (yy << 1)) & 0x55555555
return xx | (yy << 1)
@cuda.jit(device=True)
def rot_device(n, x, y, rx, ry):
"""Helper for Hilbert curve calculation."""
if ry == 0:
if rx == 1:
x[0] = n - 1 - x[0]
y[0] = n - 1 - y[0]
# Swap x and y
t = x[0]
x[0] = y[0]
y[0] = t
@cuda.jit(device=True)
def xy2d_device(n, x, y):
"""Calculates the Hilbert index for a 2D point."""
d = 0
s = n // 2
# Numba requires arrays for mutable device function arguments
x_arr = cuda.local.array(1, dtype=np.uint32)
y_arr = cuda.local.array(1, dtype=np.uint32)
x_arr[0] = x
y_arr[0] = y
while s > 0:
rx = (x_arr[0] & s) > 0
ry = (y_arr[0] & s) > 0
d += s * s * ((3 * rx) ^ ry)
rot_device(s, x_arr, y_arr, rx, ry)
s //= 2
return d
# --- CUDA Kernels ---
@cuda.jit
def compute_indices_kernel(points, out_indices, curve_type, n):
"""Kernel to compute indices for a list of points."""
i = cuda.grid(1)
if i < points.shape[0]:
x = points[i, 0]
y = points[i, 1]
if curve_type == 0: # Morton
out_indices[i] = morton2D_device(x, y)
elif curve_type == 1: # Hilbert
out_indices[i] = xy2d_device(n, x, y)
def generate_curve_plot(order, curve_type, filename):
"""Generates and saves a plot for a given curve type."""
n = 2**order
grid_points = np.array(
np.meshgrid(np.arange(n), np.arange(n))).T.reshape(-1, 2).astype(np.uint32)
d_points = cuda.to_device(grid_points)
d_indices = cuda.device_array(grid_points.shape[0], dtype=np.uint32)
threads_per_block = 256
blocks_per_grid = (grid_points.shape[0] + threads_per_block - 1) // threads_per_block
curve_type_id = 0 if curve_type == 'morton' else 1
compute_indices_kernel[blocks_per_grid, threads_per_block](d_points, d_indices, curve_type_id, n)
indices = d_indices.copy_to_host()
# Sort points by their curve index
sorted_indices = np.argsort(indices)
sorted_points = grid_points[sorted_indices]
# Plotting
plt.style.use('dark_background')
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot(sorted_points[:, 0], sorted_points[:, 1], 'o-', markersize=3, lw=1, color='#00aaff')
ax.set_title(f'{curve_type.capitalize()} Curve (Order {order})', fontsize=16)
ax.set_aspect('equal')
ax.set_xticks([])
ax.set_yticks([])
plt.tight_layout()
plt.savefig(filename, dpi=150, bbox_inches='tight')
print(f"Saved {filename}")
if __name__ == '__main__':
ORDER = 4 # Creates a 16x16 grid
generate_curve_plot(ORDER, 'morton', 'images/morton_curve.png')
generate_curve_plot(ORDER, 'hilbert', 'images/hilbert_curve.png')
Which one is actually faster? Let's test it.
The benchmark simulates a key part of your smoothing algorithm: for each cell, we access its neighbors. We'll measure the total time taken to perform this operation across a large grid. The expectation is that better data ordering (Hilbert > Morton > None) will lead to faster access times due to better cache performance.
This script times the neighbor access task for three data layouts:
# save_as: spatial-curves/benchmark_curves.py
import numpy as np
import matplotlib.pyplot as plt
import time
def get_neighbor_indices(index, width):
"""Gets the 8 neighbor indices for a point in a 2D grid."""
x, y = index % width, index // width
neighbors = []
for dx in [-1, 0, 1]:
for dy in [-1, 0, 1]:
if dx == 0 and dy == 0:
continue
nx, ny = x + dx, y + dy
if 0 <= nx < width and 0 <= ny < width:
neighbors.append(ny * width + nx)
return neighbors
def run_benchmark(data_array, width):
"""Simulates accessing neighbors for every point and returns time taken."""
start_time = time.perf_counter()
# This value is just to make sure the loop isn't optimized away
total_sum = 0.0
for i in range(len(data_array)):
# Get original 2D neighbors and access their values in the potentially
# reordered data_array
neighbor_indices = get_neighbor_indices(i, width)
for neighbor_idx in neighbor_indices:
total_sum += data_array[neighbor_idx]
end_time = time.perf_counter()
return end_time - start_time
if __name__ == '__main__':
# Use a large grid for a meaningful benchmark
ORDER = 9 # 512x512 grid
WIDTH = 2**ORDER
N_POINTS = WIDTH * WIDTH
# 1. Generate baseline data
print(f"Setting up a {WIDTH}x{WIDTH} grid...")
# These are the values we will access
values = np.random.rand(N_POINTS).astype(np.float32)
# These are the points we will use for ordering
points = np.array(np.meshgrid(np.arange(WIDTH), np.arange(WIDTH))).T.reshape(-1, 2).astype(np.uint32)
# 2. Get Morton and Hilbert indices (using CPU functions for simplicity here)
def morton_encode(x, y):
xx, yy = x, y
xx = (xx | (xx << 8)) & 0x00FF00FF; xx = (xx | (xx << 4)) & 0x0F0F0F0F
xx = (xx | (xx << 2)) & 0x33333333; xx = (xx | (xx << 1)) & 0x55555555
yy = (yy | (yy << 8)) & 0x00FF00FF; yy = (yy | (yy << 4)) & 0x0F0F0F0F
yy = (yy | (yy << 2)) & 0x33333333; yy = (yy | (yy << 1)) & 0x55555555
return xx | (yy << 1)
def xy2d_hilbert(n, x, y):
d = 0
s = n // 2
while s > 0:
rx = (x & s) > 0; ry = (y & s) > 0
d += s * s * ((3 * rx) ^ ry)
if ry == 0:
if rx == 1: x, y = n - 1 - x, n - 1 - y
x, y = y, x
s //= 2
return d
print("Calculating Morton indices...")
morton_indices = np.array([morton_encode(p[0], p[1]) for p in points])
print("Calculating Hilbert indices...")
hilbert_indices = np.array([xy2d_hilbert(WIDTH, p[0], p[1]) for p in points])
# 3. Create reordered value arrays
morton_sorted_values = values[np.argsort(morton_indices)]
hilbert_sorted_values = values[np.argsort(hilbert_indices)]
# 4. Run benchmarks
print("Running benchmarks...")
time_unordered = run_benchmark(values, WIDTH)
time_morton = run_benchmark(morton_sorted_values, WIDTH)
time_hilbert = run_benchmark(hilbert_sorted_values, WIDTH)
# 5. Plot results
labels = ['Unordered', 'Morton', 'Hilbert']
times = [time_unordered, time_morton, time_hilbert]
plt.style.use('dark_background')
fig, ax = plt.subplots(figsize=(8, 6))
colors = ['#ff4f4f', '#ffb84f', '#4fff8f']
bars = ax.bar(labels, times, color=colors)
ax.set_ylabel('Execution Time (seconds)', fontsize=12)
ax.set_title(f'Neighbor Access Time on a {WIDTH}x{WIDTH} Grid', fontsize=16)
# Add time labels on bars
for bar in bars:
yval = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2.0, yval, f'{yval:.3f}s', va='bottom', ha='center')
plt.tight_layout()
plt.savefig('images/benchmark_results.png', dpi=150, bbox_inches='tight')
print("Benchmark plot saved to images/benchmark_results.png")
print(f"\nResults:")
print(f"Unordered: {time_unordered:.4f}s")
print(f"Morton: {time_morton:.4f}s")
print(f"Hilbert: {time_hilbert:.4f}s")
When you run the script, you should see results that clearly demonstrate the benefit of spatial ordering. The Hilbert curve should provide the best performance.
Interesting note: In this particular test run, the Morton curve actually performed slightly better than the Hilbert curve. This can happen due to various factors like:
Generally, the Hilbert curve should provide better locality, but the Morton curve can sometimes win due to its simpler computation and better cache utilization for certain access patterns.
For memory-intensive spatial algorithms, this reordering step is a crucial optimization. The one-time cost of reordering your data can pay off many times over during the main computation.
This spatial indexing concept is directly applicable to large-scale environmental monitoring. I've implemented this in practice for smoothing NDVI (Normalized Difference Vegetation Index) satellite data from Uganda using hexagonal grids.
In my CUDA GIS smoothing project, I worked with 500,104 hexagons covering Kampala, Uganda. The challenge was the same: accessing neighboring cells efficiently for spatial smoothing operations.
The project demonstrates how spatial indexing and memory optimization techniques can make environmental monitoring computationally feasible at scale. When you're processing millions of satellite pixels for vegetation analysis, climate modeling, or land use classification, these optimizations become essential.
Check out the full implementation and performance analysis: CUDA GIS Smoothing Repository
So we've covered Morton and Hilbert curves, but there's another approach that's particularly good for irregular spatial data like hexagonal grids: recursive bisection reordering. This is basically a more sophisticated way to organize your data spatially.
When you're working with hexagonal grids or other irregular spatial structures, your neighbors can be scattered all over memory. Even with Morton/Hilbert curves, you might still have poor cache utilization because the curves work best with regular grids.
The idea is pretty straightforward: you divide your space recursively and assign each element to a quadrant based on its position. This creates a hierarchical spatial organization that keeps nearby elements together in memory.
// Extract spatial bounds
float min_x = FLT_MAX, max_x = -FLT_MAX;
float min_y = FLT_MAX, max_y = -FLT_MAX;
for (int i = 0; i < n_elements; i++) {
min_x = min(min_x, coordinates[i].x);
max_x = max(max_x, coordinates[i].x);
min_y = min(min_y, coordinates[i].y);
max_y = max(max_y, coordinates[i].y);
}
// Divide space recursively
for (int level = 0; level < max_levels; level++) {
float mid_x = (min_x + max_x) / 2.0f;
float mid_y = (min_y + max_y) / 2.0f;
// Assign quadrant based on position
for (int i = 0; i < n_elements; i++) {
uint32_t quadrant = 0;
if (coordinates[i].x > mid_x) quadrant |= 1;
if (coordinates[i].y > mid_y) quadrant |= 2;
bisection_keys[i].level = level;
bisection_keys[i].quadrant = quadrant;
}
}
// Create Z-order curve for final ordering
for (int i = 0; i < n_elements; i++) {
uint64_t morton = 0;
for (int level = 0; level < max_levels; level++) {
uint32_t quadrant = get_quadrant_at_level(i, level);
morton |= (uint64_t)quadrant << (2 * level);
}
bisection_keys[i].morton = morton;
}
You need a data structure to keep track of the bisection process:
struct BisectionKey {
uint32_t level; // Current bisection level
uint32_t quadrant; // Quadrant within level (0-3)
uint64_t morton; // Morton code for final ordering
uint32_t original_idx; // Original element index
};
Then you sort by the Morton code and reorder your data:
// Sort by spatial hierarchy
thrust::sort(bisection_keys.begin(), bisection_keys.end(),
[](const BisectionKey& a, const BisectionKey& b) {
return a.morton < b.morton;
});
// Reorder all data structures
for (int i = 0; i < n_elements; i++) {
int new_idx = bisection_keys[i].original_index;
reordered_data[i] = original_data[new_idx];
}
The algorithm has O(n log n) complexity overall, but the benefits can be significant:
In my CUDA GIS smoothing work, I tested recursive bisection reordering on the 500k hexagon dataset:
Interesting note: The overhead is only 4-7%, but you get much better memory access patterns. For large datasets where memory bandwidth is the bottleneck, this trade-off is totally worth it.
The key insight is that different spatial indexing methods work better for different data structures. For my hexagonal grid work, recursive bisection was the clear winner because it handles the irregular neighbor patterns much better than the standard space-filling curves.