Skip to contents

This vignette shows how to use ddbs_join() to perform fast spatial join operations on large data sets with three different approaches:

1 In-memory: pass sf objects and get an sf result (DuckDB runs under the hood, no persistent DataBase). 2. Connected: pass table names stored in an existing DuckDB connection and get an sf result. 3. Write-to-DB: same as (2), but write the result to a new DuckDB table.

Let’s see a few examples. First, let’s load a few libraries and our sample data:

library(duckspatial)
# library(mapview)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE

# polygons
countries_sf  <- sf::st_read(
    system.file("spatial/countries.geojson",  package = "duckspatial"),
    quiet = TRUE
    )

# random points
set.seed(42)
n <- 10000
points_sf <- data.frame(
  id = 1:n,
  x  = runif(n, min = -180, max = 180),
  y  = runif(n, min =  -90, max =  90)
) |>
  sf::st_as_sf(coords = c("x","y"), crs = 4326)

1) In-memory: pass sf, return sf

The simplest way to perform fast spatial join. You simply pass two sf objects, and ddbs_join() spins up a temporary DuckDB, runs the join, and returns an sf.

  • When to use: quick analysis, prototyping, or when you don’t need to persist intermediate tables.
out_sf1 <- ddbs_join(
  x    = points_sf,
  y    = countries_sf,
  join = "within"
)

# quick peek
# mapview(out_sf1, zcol="NAME_ENGL")

2) Connected: pass table names in DuckDB, return sf

In the second and third approaches, we make use of a connection to an existing DuckDB database. So let’s create a fresh DuckDB connection using the ddbs_create_conn() function, which automatically install and load DuckDB spatial extension to the connection.

# create a fresh DuckDB connection
conn <- duckspatial::ddbs_create_conn()

Now, in this second approach you need first to write your layers to DuckDB, and perform the spatial join by referencing their table names. Like this:

# write data to DuckDB
ddbs_write_vector(conn, points_sf,   "points",    overwrite = TRUE)
ddbs_write_vector(conn, countries_sf, "countries", overwrite = TRUE)

# spatial join inside DuckDB; result returned as sf
out_sf2 <- ddbs_join(
  conn,
  x    = "points",
  y    = "countries",
  join = "within"
)
  • When to use: iterative workflows, larger-than-memory data, or when you’ll run multiple queries on the same tables.

3) Write-to-DB: create a new table with the join result

The output of approaches 1 and 2 is an sf object loaded to your memory. In this third approach, ddbs_join() writes a new table in the DuckDB database. You simply need to the name of the new table.

ddbs_join(
    conn = conn,
    x = "points",
    y = "countries",
    join = "within",
    name = "points_in_countries",
    overwrite = TRUE
)

# use the result in SQL (or read back as sf later)
# DBI::dbReadTable(conn, "points_in_countries") |>
#     sf::st_as_sf(wkt = 'geometry') |> 
#     head()
  • When to use: iterative workflows, larger-than-memory data, or when you’ll run multiple queries on the same tables.

Spatial Join Predicates:

A spatial predicate is really just a function that evaluates some spatial relation between two geometries and returns true or false, e.g., “does a contain b” or “is a within distance x of b”. The join argument accepts the spatial predicates:

  • "ST_Intersects": Whether a intersects b
  • "ST_Contains": Whether a contains b
  • "ST_ContainsProperly": Whether a contains b without b touching a’s boundary
  • "ST_Within": Whether a is within b
  • "ST_Overlaps": Whether a overlaps b
  • "ST_Touches": Whether a touches b
  • "ST_Equals": Whether a is equal to b
  • "ST_Crosses": Whether a crosses b
  • "ST_Covers": Whether a covers b
  • "ST_CoveredBy": Whether a is covered by b
  • "ST_DWithin": x) Whether a is within distance x of b

Clean up

Don’t forget to disconnect from the database.

duckdb::dbDisconnect(conn)