Full-Text Search with DuckDB

· databases · Source ↗

TLDR

  • DuckDB’s fts extension brings BM25-ranked full-text search to any data source with a two-line install, useful for exploratory work on email archives or large text corpora.

Key Takeaways

  • Install is trivial: INSTALL fts; LOAD fts; then PRAGMA create_fts_index('table', 'id', 'col1', 'col2') – no separate service needed.
  • BM25 tuning exposed directly in SQL: k₁ controls term-frequency weight, b controls document-length normalization, conjunctive := 1 requires all terms.
  • Stemming via Snowball (same as Postgres/Elasticsearch); debug unexpected stem misses with the Python snowballstemmer library before blaming the index.
  • Missing vs. Postgres: no ts_headline match highlighting, no phrase queries, no synonym dictionaries – workarounds require piping to grep or tmux.
  • Author’s verdict: good enough for exploratory use; easy to graduate to Postgres or Elasticsearch by dumping DuckDB if richer FTS is needed.

Hacker News Comment Review

  • Commenters note DuckDB’s ability to attach and join across heterogeneous data sources (Parquet, JSON, existing DBs) as the broader draw, with FTS as one capability on top.
  • A practical use case raised: publishing browsable, searchable mailbox archives publicly – current open-source tooling is thin and preprocessing friction is real.
  • ClickHouse Local surfaced as a less-marketed but functionally similar alternative worth benchmarking against DuckDB for local analytical workloads.

Notable Comments

  • @jiehong: asks whether anyone has compared DuckDB and ClickHouse Local directly, flagging ClickHouse as undermarketed but comparable.
  • @rahimnathwani: built a live demo publishing a small email archive via DuckDB and Claude; notes search quality and preprocessing overhead as unsolved problems.

Original | Discuss on HN