Converting Large GeoParquet Files to Vector Tiles
Converting large GeoParquet files to vector tiles requires a streaming, chunk-based pipeline that bypasses in-memory DataFrame loading. Read the dataset column-by-column using pyarrow, filter and simplify geometries at the source, stream features as GeoJSON Lines (NDJSON) to tippecanoe, and output optimized .mbtiles or directory-based MVTs. For production workloads, never load multi-gigabyte spatial files into a single geopandas DataFrame; instead, leverage spatial partitioning, attribute pruning, and zoom-level coalescing to keep the conversion process under 4GB RAM while generating cache-ready tiles.
Core Pipeline Architecture
Large-scale vector tile generation fails when the ingestion step attempts to materialize the entire dataset. GeoParquet’s columnar layout allows you to read only the geometry column and a strict subset of attributes, dramatically reducing I/O overhead. The conversion pipeline should follow a strict three-stage flow:
- Chunked Spatial Read: Partition the GeoParquet file by bounding box or row group. Use
pyarrow.parquet.ParquetFileorpyarrow.datasetto scan only required columns and apply predicate pushdown for spatial extent filtering. - Streaming Serialization: Convert each chunk to GeoJSON features and write them to a temporary NDJSON stream. This format is natively supported by
tippecanoeand avoids the memory spike of wrapping millions of features in a JSON array. - Tile Generation & Caching: Pipe the NDJSON stream directly into
tippecanoewith aggressive simplification flags, outputting either a single.mbtilesSQLite database or a directory structure compatible with standard web tile servers.
When designing GeoParquet Input Processing stages, prioritize row-group alignment with your target tile grid. Misaligned partitions cause redundant geometry reads and increase conversion latency by 30–50%. For authoritative details on how GeoParquet encodes geometries and metadata, consult the official GeoParquet Specification.
Production-Ready Streaming Code
The following Python script demonstrates a memory-safe approach that reads a GeoParquet file in configurable batches, streams GeoJSON features to tippecanoe via standard input, and applies zoom-level optimization. It requires pyarrow>=12.0, shapely>=2.0, and tippecanoe installed on the host system.
import subprocess
import json
import sys
from pathlib import Path
import pyarrow.parquet as pq
from shapely.geometry import mapping
from shapely import wkb
def stream_geoparquet_to_mvt(
parquet_path: str,
output_path: str,
max_zoom: int = 14,
min_zoom: int = 5,
simplification: float = 1.0,
chunk_size: int = 100_000
) -> None:
"""Stream GeoParquet chunks directly to tippecanoe for MVT generation."""
if not Path(parquet_path).exists():
raise FileNotFoundError(f"Parquet file not found: {parquet_path}")
# Open file for chunked iteration (avoids full load)
pf = pq.ParquetFile(parquet_path)
# Build tippecanoe command with production-safe defaults
cmd = [
"tippecanoe",
"--output", output_path,
"--maximum-zoom", str(max_zoom),
"--minimum-zoom", str(min_zoom),
"--simplification", str(simplification),
"--drop-densest-as-needed",
"--coalesce-densest-as-needed",
"--force",
"--quiet"
]
# Start subprocess with stdin pipe for NDJSON streaming
proc = subprocess.Popen(
cmd,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
encoding="utf-8"
)
try:
for batch in pf.iter_batches(chunk_size=chunk_size):
# Extract columns
geom_col = batch.column("geometry")
attr_cols = {name: batch.column(name) for name in batch.column_names if name != "geometry"}
# Pre-convert attribute columns to Python lists for faster indexing
attr_lists = {k: v.to_pylist() for k, v in attr_cols.items()}
geom_list = geom_col.to_pylist()
for i, wkb_bytes in enumerate(geom_list):
if wkb_bytes is None:
continue
# Decode WKB -> Shapely -> GeoJSON mapping
geom = wkb.loads(wkb_bytes)
feature = {
"type": "Feature",
"geometry": mapping(geom),
"properties": {k: v[i] for k, v in attr_lists.items()}
}
proc.stdin.write(json.dumps(feature, separators=(",", ":")) + "\n")
# Signal end of stream
proc.stdin.close()
proc.wait()
if proc.returncode != 0:
err = proc.stderr.read()
raise RuntimeError(f"tippecanoe exited with code {proc.returncode}: {err}")
print(f"✅ Successfully generated tiles at {output_path}")
except Exception as e:
proc.kill()
raise RuntimeError(f"Streaming pipeline failed: {e}") from e
if __name__ == "__main__":
# Example usage
stream_geoparquet_to_mvt(
parquet_path="data/large_dataset.parquet",
output_path="output/tiles.mbtiles",
max_zoom=14,
min_zoom=4,
simplification=1.0,
chunk_size=250_000
)
Memory & Performance Optimization
Streaming alone isn’t enough. You must tune both the read and tile-generation phases to prevent bottlenecks:
- Attribute Pruning: Only read columns that will appear in tile properties. Every unused column increases RAM pressure during batch deserialization.
- Geometry Simplification: Apply
shapely.simplify()orST_Simplify(via DuckDB) before serialization if your source data contains survey-grade precision. Tippecanoe’s--simplificationflag operates at render time; pre-simplifying reduces NDJSON payload size by 40–60%. - Zoom-Level Coalescing: Use
--drop-densest-as-neededand--coalesce-densest-as-neededto prevent tile bloat at lower zoom levels. For dense urban datasets, cap--maximum-zoomat 15 unless street-level detail is explicitly required. - Row-Group Alignment: When writing the source GeoParquet file, align row groups to spatial tiles (e.g., 100k rows per group, sorted by
ST_Centroid). This allowspyarrowto skip irrelevant chunks entirely during predicate filtering.
For advanced flag combinations and layer configuration, reference the official Tippecanoe Documentation. When integrating this workflow into broader infrastructure, review Automated Generation Pipelines with Tippecanoe for CI/CD patterns and cache-invalidation strategies.
Validation & Deployment
After generation, verify tile integrity before deploying to a CDN or tile server:
- Schema Check: Use
mbutilorsqlite3to inspect thetilestable and confirmzoom_level,tile_column, andtile_rowranges match your expected bounds. - Render Test: Serve the
.mbtilesviatileserver-glormbviewand pan across extreme zoom levels to check for geometry clipping or missing attributes. - Size Audit: Ensure
.mbtilessize scales logarithmically with zoom. If a single zoom level exceeds 2GB, increase--simplificationor raise--minimum-zoom.
By enforcing chunked I/O, strict attribute filtering, and direct NDJSON piping, you can reliably convert multi-gigabyte GeoParquet datasets into production-grade vector tiles without exceeding standard container memory limits.