GeoParquet Input Processing for Vector Tile Pipelines

Raw GeoParquet files almost never arrive in a state that Tippecanoe CLI can consume reliably. Mixed CRS values, invalid polygon rings, oversized attribute payloads, and missing spatial metadata all surface as silent failures or outright crashes during tile encoding. This guide covers the full ingestion sequence — schema validation, reprojection, geometry repair, attribute pruning, and spatial partitioning — so every file that enters the tile generation queue is deterministic, memory-efficient, and free of the defects that inflate MBTiles container sizes or corrupt tile boundaries.

Prerequisites

Requirement	Minimum version	Notes
Python	3.10	`venv` or `uv` isolation required
`geopandas`	1.0	`GeoDataFrame.from_arrow()` API
`pyarrow`	14.0	Row-group streaming, schema inspection
`shapely`	2.0	`make_valid()`, vectorised geometry ops
`duckdb`	0.9	Optional; parallel CRS transforms and column pushdown
`pandas`	2.0	Copy-on-write semantics
GDAL/OGR	3.6+	Parquet + FlatGeobuf drivers; needed for `ogr2ogr` conversion
RAM	16 GB	For inputs under 5 GB; 32 GB+ for continental-scale datasets
Storage	NVMe SSD	Chunked read/write requires high-IOPS I/O

The GeoParquet specification mandates that the Parquet file schema carries a geo extension metadata key containing the geometry column name, geometry type, CRS authority, and pre-computed bounding box statistics. Pipelines that skip metadata validation before loading geometry rows risk OOM crashes and silent CRS errors downstream.

Core Concept: GeoParquet Metadata Structure

A well-formed .geoparquet file embeds spatial metadata in the Parquet schema.metadata dict under the key b"geo". The JSON payload follows this structure:

json

{
  "version": "1.0.0",
  "primary_column": "geometry",
  "columns": {
    "geometry": {
      "encoding": "WKB",
      "geometry_types": ["Polygon", "MultiPolygon"],
      "crs": { "id": { "authority": "EPSG", "code": 4326 } },
      "bbox": [-180.0, -90.0, 180.0, 90.0]
    }
  }
}

{
  "version": "1.0.0",
  "primary_column": "geometry",
  "columns": {
    "geometry": {
      "encoding": "WKB",
      "geometry_types": ["Polygon", "MultiPolygon"],
      "crs": { "id": { "authority": "EPSG", "code": 4326 } },
      "bbox": [-180.0, -90.0, 180.0, 90.0]
    }
  }
}

Key fields to validate before processing:

Field	Expected value	Failure mode if absent
`version`	`"1.0.0"`	Reader may apply wrong encoding rules
`primary_column`	matches actual column name	`geopandas` raises `KeyError`
`encoding`	`"WKB"` or `"WKT"`	Silent geometry `None` values
`crs.id.code`	`4326` or known EPSG	Tile misalignment; no reprojection error
`bbox`	non-null, covers features	Spatial index and chunked reads fail
`geometry_types`	homogeneous list	Tippecanoe rejects mixed-type layers

Step-by-Step Implementation

Step 1 — Schema inspection without loading geometry

Read the Parquet metadata footprint (a few kilobytes) before allocating any geometry memory. Reject files that are missing the geo key or declare an incompatible geometry encoding.

python

import json
import pyarrow.parquet as pq
from pathlib import Path

def validate_geoparquet_schema(path: Path) -> dict:
    meta = pq.read_metadata(str(path))
    arrow_schema = meta.schema.to_arrow_schema()
    raw_meta = arrow_schema.metadata or {}

    if b"geo" not in raw_meta:
        raise ValueError(f"{path.name}: missing 'geo' metadata key — not a valid GeoParquet file")

    geo_meta = json.loads(raw_meta[b"geo"])
    primary_col = geo_meta.get("primary_column", "geometry")
    col_info = geo_meta["columns"][primary_col]

    if col_info.get("encoding") not in ("WKB", "WKT"):
        raise ValueError(f"Unsupported geometry encoding: {col_info.get('encoding')}")

    return geo_meta  # pass downstream for CRS inspection

import json
import pyarrow.parquet as pq
from pathlib import Path

def validate_geoparquet_schema(path: Path) -> dict:
    meta = pq.read_metadata(str(path))
    arrow_schema = meta.schema.to_arrow_schema()
    raw_meta = arrow_schema.metadata or {}

    if b"geo" not in raw_meta:
        raise ValueError(f"{path.name}: missing 'geo' metadata key — not a valid GeoParquet file")

    geo_meta = json.loads(raw_meta[b"geo"])
    primary_col = geo_meta.get("primary_column", "geometry")
    col_info = geo_meta["columns"][primary_col]

    if col_info.get("encoding") not in ("WKB", "WKT"):
        raise ValueError(f"Unsupported geometry encoding: {col_info.get('encoding')}")

    return geo_meta  # pass downstream for CRS inspection

Verify: geo_meta["columns"]["geometry"]["crs"] is non-null before continuing.

Step 2 — CRS enforcement and reprojection

Tippecanoe expects EPSG:4326 (WGS84 longitude/latitude). Any other CRS requires an explicit to_crs() call per batch. Do not rely on implicit reprojection — always log the source CRS so the transform is auditable.

python

import geopandas as gpd
import pyarrow.parquet as pq
from pathlib import Path
import logging

TARGET_CRS = "EPSG:4326"

def enforce_crs(batch, source_crs_auth: str, source_crs_code: int) -> gpd.GeoDataFrame:
    gdf = gpd.GeoDataFrame.from_arrow(batch)
    source_epsg = f"{source_crs_auth}:{source_crs_code}"

    if gdf.crs is None:
        gdf = gdf.set_crs(source_epsg)
        logging.warning("CRS was None; assigned %s from metadata", source_epsg)

    if gdf.crs.to_epsg() != 4326:
        logging.info("Reprojecting %s → %s", source_epsg, TARGET_CRS)
        gdf = gdf.to_crs(TARGET_CRS)

    return gdf

import geopandas as gpd
import pyarrow.parquet as pq
from pathlib import Path
import logging

TARGET_CRS = "EPSG:4326"

def enforce_crs(batch, source_crs_auth: str, source_crs_code: int) -> gpd.GeoDataFrame:
    gdf = gpd.GeoDataFrame.from_arrow(batch)
    source_epsg = f"{source_crs_auth}:{source_crs_code}"

    if gdf.crs is None:
        gdf = gdf.set_crs(source_epsg)
        logging.warning("CRS was None; assigned %s from metadata", source_epsg)

    if gdf.crs.to_epsg() != 4326:
        logging.info("Reprojecting %s → %s", source_epsg, TARGET_CRS)
        gdf = gdf.to_crs(TARGET_CRS)

    return gdf

For antimeridian-crossing datasets (bbox crosses ±180°), split geometries with shapely.ops.split before reprojecting; otherwise to_crs() wraps coordinates around the globe and produces crossing artefacts.

Step 3 — Geometry repair and type normalisation

Apply make_valid() to every batch, then filter to only valid features. Explode MULTI* types into single-part geometries if the downstream layer config requires homogeneous types — Tippecanoe’s --detect-shared-borders and --no-simplification-of-shared-nodes flags behave differently on multi-part features.

python

from shapely.validation import make_valid  # shapely >= 2.0

def repair_geometries(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    # In-place repair using shapely make_valid (no-op for already-valid geometries)
    gdf = gdf.copy()
    gdf["geometry"] = gdf["geometry"].apply(
        lambda g: make_valid(g) if g is not None and not g.is_valid else g
    )
    invalid_count = (~gdf["geometry"].is_valid).sum()
    if invalid_count > 0:
        logging.warning("Dropping %d features still invalid after make_valid()", invalid_count)
        gdf = gdf[gdf["geometry"].is_valid].copy()

    # Explode multi-part to single-part (remove if your pipeline accepts MULTI*)
    gdf = gdf.explode(index_parts=False).reset_index(drop=True)
    return gdf

from shapely.validation import make_valid  # shapely >= 2.0

def repair_geometries(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    # In-place repair using shapely make_valid (no-op for already-valid geometries)
    gdf = gdf.copy()
    gdf["geometry"] = gdf["geometry"].apply(
        lambda g: make_valid(g) if g is not None and not g.is_valid else g
    )
    invalid_count = (~gdf["geometry"].is_valid).sum()
    if invalid_count > 0:
        logging.warning("Dropping %d features still invalid after make_valid()", invalid_count)
        gdf = gdf[gdf["geometry"].is_valid].copy()

    # Explode multi-part to single-part (remove if your pipeline accepts MULTI*)
    gdf = gdf.explode(index_parts=False).reset_index(drop=True)
    return gdf

Verify: gdf["geometry"].is_valid.all() is True before passing to step 4.

Also integrate geometry simplification at this stage when pre-processing national- or continental-scale polygon datasets; reducing vertex density before write reduces both GeoParquet file size and Tippecanoe’s peak RAM usage.

Step 4 — Attribute pruning and type normalisation

Unused columns inflate Parquet row groups and, more critically, bloat tile attribute payloads. Tippecanoe’s -y/--include flags can drop columns at encode time, but pruning before write is faster and reduces memory pressure during the tile generation job. Tighten numeric types — int32 instead of int64, float32 where precision allows — and replace NaN with type-appropriate defaults.

python

import pandas as pd

KEEP_COLUMNS = ["id", "name", "feature_class", "population", "geometry"]
INT_COLUMNS  = ["population"]
FLOAT_COLUMNS: list[str] = []  # extend as needed

def prune_attributes(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    present = [c for c in KEEP_COLUMNS if c in gdf.columns]
    gdf = gdf[present].copy()

    for col in INT_COLUMNS:
        if col in gdf.columns:
            gdf[col] = gdf[col].fillna(0).astype("int32")

    for col in FLOAT_COLUMNS:
        if col in gdf.columns:
            gdf[col] = gdf[col].astype("float32")

    # Normalise string columns to UTF-8 (object dtype in pandas is already str)
    str_cols = gdf.select_dtypes(include="object").columns.difference(["geometry"])
    for col in str_cols:
        gdf[col] = gdf[col].fillna("").str.strip()

    return gdf

import pandas as pd

KEEP_COLUMNS = ["id", "name", "feature_class", "population", "geometry"]
INT_COLUMNS  = ["population"]
FLOAT_COLUMNS: list[str] = []  # extend as needed

def prune_attributes(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    present = [c for c in KEEP_COLUMNS if c in gdf.columns]
    gdf = gdf[present].copy()

    for col in INT_COLUMNS:
        if col in gdf.columns:
            gdf[col] = gdf[col].fillna(0).astype("int32")

    for col in FLOAT_COLUMNS:
        if col in gdf.columns:
            gdf[col] = gdf[col].astype("float32")

    # Normalise string columns to UTF-8 (object dtype in pandas is already str)
    str_cols = gdf.select_dtypes(include="object").columns.difference(["geometry"])
    for col in str_cols:
        gdf[col] = gdf[col].fillna("").str.strip()

    return gdf

For guidance on which attributes to retain versus drop to reduce tile size, the attribute filtering section covers Tippecanoe’s -y, -j, and --exclude flag interactions in detail.

Step 5 — Spatial partitioning and chunked write

Sort rows by a spatial index (H3 cell or S2 cell) so that Parquet row groups align with geographic bounding boxes. This lets Tippecanoe read row groups sequentially without random seeks, and it enables bbox-based pushdown when querying with DuckDB or pyarrow datasets. Target row-group sizes of 128–256 MB.

python

import h3
import pyarrow as pa
import pyarrow.parquet as papq
import numpy as np

def add_h3_index(gdf: gpd.GeoDataFrame, resolution: int = 5) -> gpd.GeoDataFrame:
    gdf = gdf.copy()
    centroids = gdf["geometry"].centroid
    gdf["h3_cell"] = [
        h3.latlng_to_cell(pt.y, pt.x, resolution) if pt else ""
        for pt in centroids
    ]
    return gdf.sort_values("h3_cell").drop(columns=["h3_cell"])

def write_partitioned_geoparquet(gdf: gpd.GeoDataFrame, output_path: Path,
                                  row_group_size: int = 250_000) -> None:
    gdf.to_parquet(
        str(output_path),
        compression="snappy",      # snappy for speed; zstd for smaller archives
        engine="pyarrow",
        row_group_size=row_group_size,
        schema_version="1.0.0",
    )
    logging.info("Wrote %d features → %s", len(gdf), output_path)

import h3
import pyarrow as pa
import pyarrow.parquet as papq
import numpy as np

def add_h3_index(gdf: gpd.GeoDataFrame, resolution: int = 5) -> gpd.GeoDataFrame:
    gdf = gdf.copy()
    centroids = gdf["geometry"].centroid
    gdf["h3_cell"] = [
        h3.latlng_to_cell(pt.y, pt.x, resolution) if pt else ""
        for pt in centroids
    ]
    return gdf.sort_values("h3_cell").drop(columns=["h3_cell"])

def write_partitioned_geoparquet(gdf: gpd.GeoDataFrame, output_path: Path,
                                  row_group_size: int = 250_000) -> None:
    gdf.to_parquet(
        str(output_path),
        compression="snappy",      # snappy for speed; zstd for smaller archives
        engine="pyarrow",
        row_group_size=row_group_size,
        schema_version="1.0.0",
    )
    logging.info("Wrote %d features → %s", len(gdf), output_path)

Verify: pq.read_metadata(output_path).num_row_groups matches expected count, and b"geo" is present in the written schema metadata.

Composing the full pipeline

python

from pathlib import Path

def process_geoparquet(input_path: Path, output_path: Path) -> None:
    geo_meta = validate_geoparquet_schema(input_path)
    col_info  = geo_meta["columns"][geo_meta["primary_column"]]
    crs_info  = col_info.get("crs", {}).get("id", {})
    src_auth  = crs_info.get("authority", "EPSG")
    src_code  = int(crs_info.get("code", 4326))

    parquet_file = pq.ParquetFile(str(input_path))
    chunks: list[gpd.GeoDataFrame] = []

    for idx, batch in enumerate(parquet_file.iter_batches(batch_size=500_000)):
        gdf = enforce_crs(batch, src_auth, src_code)
        gdf = repair_geometries(gdf)
        gdf = prune_attributes(gdf)
        chunks.append(gdf)
        logging.info("Batch %d: %d features retained", idx + 1, len(gdf))

    import pandas as pd
    final = gpd.GeoDataFrame(pd.concat(chunks, ignore_index=True), crs=TARGET_CRS)
    final = add_h3_index(final)
    write_partitioned_geoparquet(final, output_path)

from pathlib import Path

def process_geoparquet(input_path: Path, output_path: Path) -> None:
    geo_meta = validate_geoparquet_schema(input_path)
    col_info  = geo_meta["columns"][geo_meta["primary_column"]]
    crs_info  = col_info.get("crs", {}).get("id", {})
    src_auth  = crs_info.get("authority", "EPSG")
    src_code  = int(crs_info.get("code", 4326))

    parquet_file = pq.ParquetFile(str(input_path))
    chunks: list[gpd.GeoDataFrame] = []

    for idx, batch in enumerate(parquet_file.iter_batches(batch_size=500_000)):
        gdf = enforce_crs(batch, src_auth, src_code)
        gdf = repair_geometries(gdf)
        gdf = prune_attributes(gdf)
        chunks.append(gdf)
        logging.info("Batch %d: %d features retained", idx + 1, len(gdf))

    import pandas as pd
    final = gpd.GeoDataFrame(pd.concat(chunks, ignore_index=True), crs=TARGET_CRS)
    final = add_h3_index(final)
    write_partitioned_geoparquet(final, output_path)

For datasets that exceed available RAM, replace the chunks list with a DuckDB INSERT INTO ... SELECT pipeline or a pyarrow.dataset incremental writer to avoid materialising all batches simultaneously.

Optimization Knobs

Parameter	Conservative	Aggressive	Trade-off
`batch_size` in `iter_batches()`	100 000 rows	1 000 000 rows	Higher batch size: faster I/O, higher peak RAM
`row_group_size` in `to_parquet()`	64 MB (~50k rows)	256 MB (~200k rows)	Larger groups: faster sequential scan; worse bbox pushdown granularity
H3 resolution for sort	4 (hex ~86 km²)	7 (hex ~0.26 km²)	Finer resolution: better spatial locality; higher cardinality overhead
Compression codec	`snappy`	`zstd` level 9	`zstd`: 20–40% smaller file; 2–4× higher CPU cost on read
`explode()` multi-part	always	only at z < 10	Exploding increases feature count; simplifies Tippecanoe layer logic

Integration with Adjacent Pipeline Stages

Once the cleaned .geoparquet file is written, the standard path into Tippecanoe is a two-step ogr2ogr conversion or a direct stdin pipe.

Path A — ogr2ogr conversion to GeoJSON then Tippecanoe:

bash

ogr2ogr -f GeoJSONSeq /vsistdout/ processed.geoparquet \
  | tippecanoe \
      --output=output.pmtiles \
      --layer=features \
      --minimum-zoom=4 \
      --maximum-zoom=14 \
      --drop-densest-as-needed \
      --extend-zooms-if-still-dropping \
      --read-parallel \
      -

ogr2ogr -f GeoJSONSeq /vsistdout/ processed.geoparquet \
  | tippecanoe \
      --output=output.pmtiles \
      --layer=features \
      --minimum-zoom=4 \
      --maximum-zoom=14 \
      --drop-densest-as-needed \
      --extend-zooms-if-still-dropping \
      --read-parallel \
      -

The - at the end instructs Tippecanoe to read NDJSON from stdin, eliminating the intermediate GeoJSON file entirely. --read-parallel works with NDJSON stdin and takes advantage of the spatial ordering produced by the H3 sort.

Path B — DuckDB spatial extension directly:

sql

-- Export from DuckDB to NDJSON for piping into Tippecanoe
INSTALL spatial; LOAD spatial;
COPY (
  SELECT ST_AsGeoJSON(geometry)::JSON AS geometry,
         id, name, feature_class, population
  FROM read_parquet('processed.geoparquet')
  WHERE ST_Intersects(geometry, ST_GeomFromText('POLYGON((-30 34,-30 72,50 72,50 34,-30 34))'))
) TO '/dev/stdout' (FORMAT JSON, ARRAY false);

-- Export from DuckDB to NDJSON for piping into Tippecanoe
INSTALL spatial; LOAD spatial;
COPY (
  SELECT ST_AsGeoJSON(geometry)::JSON AS geometry,
         id, name, feature_class, population
  FROM read_parquet('processed.geoparquet')
  WHERE ST_Intersects(geometry, ST_GeomFromText('POLYGON((-30 34,-30 72,50 72,50 34,-30 34))'))
) TO '/dev/stdout' (FORMAT JSON, ARRAY false);

This bbox-filtered export takes advantage of the row-group statistics written during step 5 — DuckDB pushes the ST_Intersects predicate down to the Parquet reader and skips row groups that do not overlap the bbox, cutting I/O by 60–90% for regional tile rebuilds.

The processed file also feeds directly into the essential Tippecanoe flags for production builds, where --coalesce-smallest-as-needed and --simplification work most predictably on spatially ordered, single-part geometry inputs.

Troubleshooting

1. ValueError: Missing 'geo' extension in Parquet metadata

The file was written by a non-GeoParquet-aware tool (e.g. plain pandas.to_parquet() on a GeoDataFrame with geometry serialised as WKB strings).

bash

# Inspect raw metadata
python3 -c "
import pyarrow.parquet as pq, json
m = pq.read_metadata('suspect.parquet').schema.to_arrow_schema().metadata
print(list(m.keys()))
"
# Fix: re-export with geopandas
python3 -c "
import geopandas as gpd
gdf = gpd.read_parquet('suspect.parquet')
gdf.to_parquet('fixed.geoparquet')
"

# Inspect raw metadata
python3 -c "
import pyarrow.parquet as pq, json
m = pq.read_metadata('suspect.parquet').schema.to_arrow_schema().metadata
print(list(m.keys()))
"
# Fix: re-export with geopandas
python3 -c "
import geopandas as gpd
gdf = gpd.read_parquet('suspect.parquet')
gdf.to_parquet('fixed.geoparquet')
"

2. CRS is None after from_arrow()

The crs field in the geo metadata blob is null or empty. Set it explicitly before any spatial operation:

python

gdf = gdf.set_crs("EPSG:4326", allow_override=True)

gdf = gdf.set_crs("EPSG:4326", allow_override=True)

3. Tippecanoe tile count is unexpectedly low at high zooms

Caused by explode() not being called — MULTI* geometries are treated as single features and clipped to one tile extent. Confirm geometry type distribution:

python

print(gdf["geometry"].geom_type.value_counts())

print(gdf["geometry"].geom_type.value_counts())

If MultiPolygon or MultiLineString count is non-zero, add gdf = gdf.explode(index_parts=False) before the write step.

4. MemoryError during pd.concat(chunks)

Batch list is materialising the full dataset. Switch to a streaming write:

python

import pyarrow.parquet as papq
writer = None
for idx, batch in enumerate(parquet_file.iter_batches(batch_size=100_000)):
    gdf = enforce_crs(batch, src_auth, src_code)
    gdf = repair_geometries(gdf)
    gdf = prune_attributes(gdf)
    table = gdf.to_arrow()
    if writer is None:
        writer = papq.ParquetWriter(str(output_path), table.schema,
                                    compression="snappy")
    writer.write_table(table)
if writer:
    writer.close()

import pyarrow.parquet as papq
writer = None
for idx, batch in enumerate(parquet_file.iter_batches(batch_size=100_000)):
    gdf = enforce_crs(batch, src_auth, src_code)
    gdf = repair_geometries(gdf)
    gdf = prune_attributes(gdf)
    table = gdf.to_arrow()
    if writer is None:
        writer = papq.ParquetWriter(str(output_path), table.schema,
                                    compression="snappy")
    writer.write_table(table)
if writer:
    writer.close()

5. Antimeridian artefacts — polygons spanning the dateline render as continent-wide bands

CRS reprojection wraps coordinates across ±180° when the source bbox crosses the dateline.

python

from shapely.ops import split as shp_split
from shapely.geometry import LineString

DATELINE = LineString([(180, -90), (180, 90)])

def split_antimeridian(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    gdf = gdf.copy()
    gdf["geometry"] = gdf["geometry"].apply(
        lambda g: shp_split(g, DATELINE) if g.bounds[2] > 179.9 else g
    )
    return gdf.explode(index_parts=False)

from shapely.ops import split as shp_split
from shapely.geometry import LineString

DATELINE = LineString([(180, -90), (180, 90)])

def split_antimeridian(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    gdf = gdf.copy()
    gdf["geometry"] = gdf["geometry"].apply(
        lambda g: shp_split(g, DATELINE) if g.bounds[2] > 179.9 else g
    )
    return gdf.explode(index_parts=False)

Call this function immediately after enforce_crs() for datasets with global or Pacific-region coverage.