Converting Large GeoParquet Files to Vector Tiles

Stream multi-gigabyte .geoparquet files directly into tippecanoe via pyarrow chunked reads and NDJSON piping — this is the only approach that reliably converts datasets larger than a few hundred megabytes without exhausting container memory.

When to Use This Approach

Use chunked NDJSON streaming when any of the following apply:

Your .geoparquet file exceeds 500 MB on disk (loading it via geopandas.read_parquet() will typically require 3–8× that in RAM)
You are running inside a container or CI worker with a 4 GB or 8 GB memory limit
The dataset contains 10 million or more features, regardless of file size
You need to prune attribute columns before they reach the tile encoder, keeping only fields that appear in your map style

If your file is under 200 MB and fits comfortably in available RAM, a single geopandas.read_parquet() + GeoDataFrame.to_file("out.geojson") followed by a tippecanoe invocation is simpler. For everything larger, the approach below is the production standard.

The alternative of converting to GeoJSON first and then running tippecanoe on the JSON file works but doubles disk I/O and temporarily inflates file size by 4–10× because GeoJSON stores coordinates as ASCII text instead of WKB binary.

Streaming Pipeline Architecture

The pipeline reads row groups from the Parquet file without constructing a full in-memory DataFrame. Each batch is decoded to GeoJSON features and written as newline-delimited JSON directly to tippecanoe’s standard input. The subprocess runs concurrently — it starts encoding tiles while Python is still reading the early row groups.

Specification Detail

pyarrow.parquet.ParquetFile.iter_batches() and the tippecanoe flags used here have specific constraints:

Parameter	Accepted values	Default	Notes
`chunk_size` (pyarrow)	any int	—	50k–250k rows is practical; larger chunks increase RAM, smaller chunks increase Python overhead
`columns` (pyarrow)	list of column names	all columns	Always specify to avoid reading unused attributes
`--maximum-zoom`	0–32	auto	Set explicitly; auto can produce zoom 30+ on dense data
`--minimum-zoom`	0–32	0	Raise to 4–5 for global datasets to avoid oversized low-zoom tiles
`--simplification`	float ≥ 1.0	1.0	Multiplier on default simplification distance; 2.0 halves vertex density
`--drop-densest-as-needed`	flag	off	Drops features at low zooms to stay under 500 KB per tile
`--coalesce-densest-as-needed`	flag	off	Merges adjacent identical features at low zooms instead of dropping
`--force`	flag	off	Overwrites existing output; required in automated pipelines
`--quiet`	flag	off	Suppresses progress output; recommended in CI to avoid log flooding

Requires pyarrow>=12.0, shapely>=2.0, and tippecanoe>=2.17 installed on the host. The --read-parallel flag is not needed here — tippecanoe reads from a single stdin stream.

Production Command

The following script is the complete, copy-pasteable implementation. Read the inline comments before adjusting chunk_size or zoom levels:

python

import subprocess
import json
from pathlib import Path
import pyarrow.parquet as pq
from shapely import wkb
from shapely.geometry import mapping


def stream_geoparquet_to_mbtiles(
    parquet_path: str,
    output_path: str,
    *,
    attribute_columns: list[str] | None = None,
    max_zoom: int = 14,
    min_zoom: int = 5,
    simplification: float = 1.0,
    chunk_size: int = 100_000,
    layer_name: str = "data",
) -> None:
    """
    Stream a GeoParquet file to tippecanoe via NDJSON on stdin.

    attribute_columns: columns to include as tile properties (None = all except geometry).
    Peak RAM is bounded by chunk_size × avg serialized feature size, not total file size.
    """
    source = Path(parquet_path)
    if not source.exists():
        raise FileNotFoundError(f"GeoParquet not found: {parquet_path}")

    pf = pq.ParquetFile(str(source))

    # Determine which columns to read — always exclude geometry from attributes
    schema_names = pf.schema_arrow.names
    if attribute_columns is None:
        attr_names = [c for c in schema_names if c != "geometry"]
    else:
        attr_names = [c for c in attribute_columns if c in schema_names]

    read_columns = attr_names + ["geometry"]

    cmd = [
        "tippecanoe",
        "--output", output_path,
        "--layer", layer_name,
        "--maximum-zoom", str(max_zoom),
        "--minimum-zoom", str(min_zoom),
        "--simplification", str(simplification),
        "--drop-densest-as-needed",
        "--coalesce-densest-as-needed",
        "--force",
        "--quiet",
        # Read NDJSON from stdin
        "-",
    ]

    proc = subprocess.Popen(
        cmd,
        stdin=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        encoding="utf-8",
    )

    features_written = 0
    try:
        for batch in pf.iter_batches(batch_size=chunk_size, columns=read_columns):
            geom_list = batch.column("geometry").to_pylist()
            # Pre-convert attribute columns to Python lists once per batch
            attr_data = {name: batch.column(name).to_pylist() for name in attr_names}

            for i, wkb_bytes in enumerate(geom_list):
                if wkb_bytes is None:
                    continue
                geom = wkb.loads(wkb_bytes)
                feature = {
                    "type": "Feature",
                    "geometry": mapping(geom),
                    "properties": {k: attr_data[k][i] for k in attr_names},
                }
                proc.stdin.write(json.dumps(feature, separators=(",", ":")) + "\n")
                features_written += 1

        proc.stdin.close()
        proc.wait()

        if proc.returncode != 0:
            stderr_output = proc.stderr.read()
            raise RuntimeError(
                f"tippecanoe exited {proc.returncode}: {stderr_output[:500]}"
            )

        print(f"Done: {features_written:,} features → {output_path}")

    except Exception:
        proc.kill()
        proc.wait()
        raise


if __name__ == "__main__":
    stream_geoparquet_to_mbtiles(
        parquet_path="data/buildings_north_america.geoparquet",
        output_path="dist/buildings.mbtiles",
        attribute_columns=["osm_id", "name", "building", "height"],
        max_zoom=16,
        min_zoom=5,
        simplification=1.5,
        chunk_size=200_000,
        layer_name="buildings",
    )

import subprocess
import json
from pathlib import Path
import pyarrow.parquet as pq
from shapely import wkb
from shapely.geometry import mapping


def stream_geoparquet_to_mbtiles(
    parquet_path: str,
    output_path: str,
    *,
    attribute_columns: list[str] | None = None,
    max_zoom: int = 14,
    min_zoom: int = 5,
    simplification: float = 1.0,
    chunk_size: int = 100_000,
    layer_name: str = "data",
) -> None:
    """
    Stream a GeoParquet file to tippecanoe via NDJSON on stdin.

    attribute_columns: columns to include as tile properties (None = all except geometry).
    Peak RAM is bounded by chunk_size × avg serialized feature size, not total file size.
    """
    source = Path(parquet_path)
    if not source.exists():
        raise FileNotFoundError(f"GeoParquet not found: {parquet_path}")

    pf = pq.ParquetFile(str(source))

    # Determine which columns to read — always exclude geometry from attributes
    schema_names = pf.schema_arrow.names
    if attribute_columns is None:
        attr_names = [c for c in schema_names if c != "geometry"]
    else:
        attr_names = [c for c in attribute_columns if c in schema_names]

    read_columns = attr_names + ["geometry"]

    cmd = [
        "tippecanoe",
        "--output", output_path,
        "--layer", layer_name,
        "--maximum-zoom", str(max_zoom),
        "--minimum-zoom", str(min_zoom),
        "--simplification", str(simplification),
        "--drop-densest-as-needed",
        "--coalesce-densest-as-needed",
        "--force",
        "--quiet",
        # Read NDJSON from stdin
        "-",
    ]

    proc = subprocess.Popen(
        cmd,
        stdin=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        encoding="utf-8",
    )

    features_written = 0
    try:
        for batch in pf.iter_batches(batch_size=chunk_size, columns=read_columns):
            geom_list = batch.column("geometry").to_pylist()
            # Pre-convert attribute columns to Python lists once per batch
            attr_data = {name: batch.column(name).to_pylist() for name in attr_names}

            for i, wkb_bytes in enumerate(geom_list):
                if wkb_bytes is None:
                    continue
                geom = wkb.loads(wkb_bytes)
                feature = {
                    "type": "Feature",
                    "geometry": mapping(geom),
                    "properties": {k: attr_data[k][i] for k in attr_names},
                }
                proc.stdin.write(json.dumps(feature, separators=(",", ":")) + "\n")
                features_written += 1

        proc.stdin.close()
        proc.wait()

        if proc.returncode != 0:
            stderr_output = proc.stderr.read()
            raise RuntimeError(
                f"tippecanoe exited {proc.returncode}: {stderr_output[:500]}"
            )

        print(f"Done: {features_written:,} features → {output_path}")

    except Exception:
        proc.kill()
        proc.wait()
        raise


if __name__ == "__main__":
    stream_geoparquet_to_mbtiles(
        parquet_path="data/buildings_north_america.geoparquet",
        output_path="dist/buildings.mbtiles",
        attribute_columns=["osm_id", "name", "building", "height"],
        max_zoom=16,
        min_zoom=5,
        simplification=1.5,
        chunk_size=200_000,
        layer_name="buildings",
    )

Key details: columns=read_columns is passed to iter_batches() so that pyarrow applies column projection pushdown at the Parquet reader level — unused columns are never decompressed. The "-" argument at the end of the tippecanoe command tells it to read from stdin rather than a file path.

Interaction Effects

With --simplification and pre-simplification in Python. Tippecanoe’s --simplification operates after features reach the encoder. If your source data contains survey-grade coordinates (millimetre precision), call shapely.simplify(geom, tolerance=0.00001, preserve_topology=True) inside the batch loop before mapping(geom). This shrinks the NDJSON payload by 30–60% and reduces the load on tippecanoe, which still applies its own render-time simplification on top. Combining both is safe; the stages are independent.

With attribute filtering. The attribute_columns parameter in the script above is the Python-side equivalent of tippecanoe’s -y / --include flag. Both can be used together: pass attribute_columns to restrict what pyarrow reads from Parquet (reducing deserialization time), and optionally add -y flags to the cmd list as a belt-and-suspenders check. Doing both reduces NDJSON payload size before tippecanoe sees it.

With geometry simplification algorithms. Tippecanoe defaults to a variant of Visvalingam–Whyatt for polygon simplification and Douglas-Peucker for lines. The --simplification multiplier scales the threshold for both. For polygon-heavy datasets (buildings, land use), values between 1.5 and 2.5 are common production settings. For road networks where topology is critical, stay at 1.0–1.2.

With MBTiles storage limits. The output .mbtiles is a SQLite database. For datasets producing files larger than 5 GB, consider splitting into regional .mbtiles files and merging with tile-join, or targeting PMTiles output by replacing --output dist/buildings.mbtiles with --output dist/buildings.pmtiles (requires tippecanoe ≥ 2.22).

Performance Impact

chunk_size	Approx. peak RAM	Throughput (features/s)	Notes
25 000	~120 MB	~80 000	Use in 512 MB containers
100 000	~400 MB	~180 000	Good default for 4 GB workers
250 000	~900 MB	~250 000	Best throughput if RAM allows
500 000	~1.8 GB	~270 000	Diminishing returns above 250k

Throughput depends on geometry complexity and attribute count. These figures are representative for a polygon dataset with 8 properties at average complexity (buildings, land parcels). Point datasets run 30–50% faster; complex multipolygons (coastlines, administrative boundaries) run 20–40% slower.

The NDJSON pipe approach typically completes a 1 GB .geoparquet file in 4–12 minutes on a 4-core worker. Converting to GeoJSON first and then running tippecanoe on the file takes 2–4× longer due to the intermediate write and the full re-read.

Common Mistakes

BrokenPipeError: [Errno 32] Broken pipe when writing to proc.stdin. tippecanoe has exited early, usually due to an invalid GeoJSON feature. Symptoms: the error appears mid-stream, not at the end. Diagnosis:

bash

# Decode the first failing tile to find the bad feature
tippecanoe-decode output.mbtiles 5 16 10 | jq '.features[0]'

# Decode the first failing tile to find the bad feature
tippecanoe-decode output.mbtiles 5 16 10 | jq '.features[0]'

Fix: add a try/except around the WKB decode and skip or log null geometries. The if wkb_bytes is None: continue guard in the script above handles missing geometry; add a try/except shapely.errors.WKBReadingError block to skip geometries with corrupt WKB.

RuntimeError: tippecanoe exited 1: ... tile is too large. A single tile at a low zoom level exceeds the 500 KB MVT limit. This happens when --minimum-zoom is set too low for a dense dataset.

bash

# Increase minimum zoom and add coalescing
tippecanoe \
  --minimum-zoom=4 \
  --drop-densest-as-needed \
  --coalesce-densest-as-needed \
  --simplification=2.0 \
  ...

# Increase minimum zoom and add coalescing
tippecanoe \
  --minimum-zoom=4 \
  --drop-densest-as-needed \
  --coalesce-densest-as-needed \
  --simplification=2.0 \
  ...

Raising --minimum-zoom by 1 reduces the number of features per tile at the lowest zoom by roughly 4×.

Column not found: pyarrow.lib.ArrowInvalid: Field named 'geometry' not found. The geometry column has a non-standard name (common in files exported from QGIS or PostGIS, where it may be geom, wkb_geometry, or shape). Inspect the schema first:

python

import pyarrow.parquet as pq
schema = pq.read_schema("data/input.geoparquet")
print(schema.names)  # find the geometry column name

import pyarrow.parquet as pq
schema = pq.read_schema("data/input.geoparquet")
print(schema.names)  # find the geometry column name

Then pass the correct column name to iter_batches() and update the batch.column("geometry") call.

Parent: GeoParquet Input Processing

Related

Dropping Unused Attributes to Reduce Tile Size — use -y / --include alongside column projection to cut NDJSON payload before it reaches the encoder.
Geometry Simplification Algorithms — choose between Visvalingam and Douglas-Peucker simplification and tune the --simplification multiplier for your geometry type.
Essential Tippecanoe Flags for Production Builds — full reference for --drop-densest-as-needed, --coalesce-densest-as-needed, zoom limits, and layer naming in batch pipelines.