A Finnish-English dictionary app swapped a 3 GB SQLite/FTS inflection database for a 10 MB FST binary, achieving a 300x size reduction using BurntSushi’s fst Rust crate.
Key Takeaways
FSTs compress both prefixes and suffixes; Finnish’s agglutinative morphology (millions of inflections sharing a handful of endings) is an ideal fit, unlike tries which only share prefixes.
The fst crate by Andrew Gallant (ripgrep author) maps conjugations and declensions back to source definitions; static load at runtime sidesteps fst’s mutability weakness.
A naïve trie held ~400,000 entries at ~50 MB but could not scale to 40-60 million inflected forms without blowing memory budgets for low-end hardware.
The SQLite hack shipped first and worked; the FST rewrite was only possible because shipping the bad solution provided 9 months of runway to learn better approaches.
The Pro v2 binary is now ~20 MB all-in, smaller than the free v1 build, preserving the “pocket dictionary fits on any laptop” design constraint.
Hacker News Comment Review
One commenter noted the FST/MADFA structure is a rediscovery of the DAWG (Directed Acyclic Word Graph), itself called a DAFSA on Wikipedia, with roots going back roughly 20 years.
Notable Comments
@lscharen: identifies the structure as DAFSA/DAWG, a decades-old data structure the author independently rediscovered through the fst crate.