SymScanΒΆ

SymScan enables extremely fast discovery of pairs of similar strings within and across large collections. It is available as a CLI tool, a Rust library, and a Python package.

SymScan is a variation on the symmetric deletion algorithm that is optimised for bulk-searching similar strings within one or across two large string collections at once (e.g. searching for similar protein sequences among a collection of 10M). The key algorithmic difference between SymScan and traditional symmetric deletion is the use of a sort-merge join approach in place of hash maps to discover input strings that share common deletion variants. This sort-and-scan approach trades off an additional factor of O(log N) (with N the total number of strings being compared) in expected time complexity for improved cache locality and effective parallelization, and ends up being much faster for the above use case.