Library (Python bindings)

Installation

$ pip install symscan

Usage

The Python package exposes two ways to use the symdel algorithm. The first are the pure functins get_neighbors_within() and get_neighbors_across(), which is optimised for one-off computations. In addition, the package provides a memoized implementation in the CachedRef class, which is useful for cases where you know you will be repeatedly querying against a large reference set.

Functional API

symscan.get_neighbors_within(query, max_distance=1, distance_type='levenshtein')

Detect string pairs within an input collection that lie within a threshold edit distance.

The function considers all possible combinations of string pairs from query, and returns all those where the two strings are no more than max_distance edit distance (either Levenshtein or Hamming) units apart.

Important

This function DOES NOT double-count string pairs. As seen in the examples below, each pair is represented once where the row index is always less than the col index. In other words, if you were to interpret the output as a sparse matrix, only the lower triangle will be filled.

Parameters:
queryiterable of str
max_distanceint, default=1

The maximum edit distance at which strings are considered neighbours.

distance_type{ “levenshtein”, “hamming” }, default=”levenshtein”

The string metric to use when measuring distance between input strings. When using Hamming distance, strings of different lengths will never be considered neighbors.

Returns:
rowndarray of shape (N,), dtype=uint32

Indices of strings in the query that have neighbors.

colndarray of shape (N,), dtype=uint32

Indices of neighbor strings (i.e. query[row[i]] and query[col[i]] are neighbors).

distsndarray of shape (N,), dtype=uint8

Edit distances between neighbors (i.e. Levenshtein(query[row[i]], query[col[i]]) = dists[i]).

Examples

Look for pairs of similar strings within a string collection. Note how string pairs are not double-counted, each is counted once such that the row coordinate is always less than the col coordinate.

>>> import symscan
>>> (row, col, dists) = symscan.get_neighbors_within(["fizz", "fuzz", "buzz", "fizzy"])
>>> row
array([0, 0, 1], dtype=uint32)
>>> col
array([1, 3, 2], dtype=uint32)
>>> dists
array([1, 1, 1], dtype=uint8)

To increase the threshold at which string pairs are considered similar, set max_distance.

>>> (row, col, dists) = symscan.get_neighbors_within(["fizz", "fuzz", "buzz", "fizzy"], max_distance=2)
>>> row
array([0, 0, 0, 1, 1], dtype=uint32)
>>> col
array([1, 2, 3, 2, 3], dtype=uint32)
>>> dists
array([1, 2, 1, 1, 2], dtype=uint8)

Hamming distance can be used instead of Levenshtein distance.

>>> (row, col, dists) = symscan.get_neighbors_within(["fizz", "fuzz", "buzz", "fizzy"], distance_type="hamming")
>>> row
array([0, 1], dtype=uint32)
>>> col
array([1, 2], dtype=uint32)
>>> dists
array([1, 1], dtype=uint8)
symscan.get_neighbors_across(query, reference, max_distance=1, distance_type='levenshtein')

Detect string pairs across two input collections that lie within a threshold edit distance.

The function considers all string pairs in the cartesian product of query and reference, and returns all those where the two strings are no more than max_distance edit distance (either Levenshtein or Hamming) units apart.

Parameters:
queryiterable of str
referenceiterable of str
max_distanceint, default=1

The maximum edit distance at which strings are considered neighbors.

distance_type{ “levenshtein”, “hamming” }, default=”levenshtein”

The string metric to use when measuring distance between input strings. When using Hamming distance, strings of different lengths will never be considered neighbors.

Returns:
rowndarray of shape (N,), dtype=uint32

Indices of strings in the query that have neighbors.

colndarray of shape (N,), dtype=uint32

Indices of neighbor strings (i.e. query[row[i]] and reference[col[i]] are neighbors).

distsndarray of shape (N,), dtype=uint8

Edit distances between neighbors (i.e. Levenshtein(query[row[i]], reference[col[i]]) = dists[i]).

Examples

Look for pairs of similar strings across two collections.

>>> (row, col, dists) = symscan.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], ["foo", "bar", "baz", "buzz"])
>>> row
array([1, 2], dtype=uint32)
>>> col
array([3, 3], dtype=uint32)
>>> dists
array([1, 0], dtype=uint8)

To increase the threshold at which string pairs are considered similar, set max_distance.

>>> (row, col, dists) = symscan.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], ["foo", "bar", "baz", "buzz"], max_distance=2)
>>> row
array([0, 1, 2, 2], dtype=uint32)
>>> col
array([3, 3, 2, 3], dtype=uint32)
>>> dists
array([2, 1, 2, 0], dtype=uint8)

Hamming distance can be used instead of Levenshtein distance.

>>> (row, col, dists) = symscan.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], ["foo", "bar", "baz", "buzz"], max_distance=2, distance_type="hamming")
>>> row
array([0, 1, 2], dtype=uint32)
>>> col
array([3, 3, 3], dtype=uint32)
>>> dists
array([2, 1, 0], dtype=uint8)

Class-based (memoized) API

class symscan.CachedRef(reference, max_distance=1, distance_type='levenshtein')

A class for memoizing the deletion variant calculations for a string collection.

When constructed, the CachedRef instance precomputes and stores the deletion variants for the supplied reference strings as a hashmap. This significantly speeds up subsequent queries against the reference, at the upfront cost of spending extra time to construct the hashmap. This is useful for use-cases where you want to repeatedly query the same reference, especially if the reference is very large. However, for one-off computations, the pure functions get_neighbors_within() and get_neighbors_across() are faster.

Note

When interpreting the index order of method return values, the string collection specified at construction is considered the reference, and any string collections specified during subsequent query calls are considered the query.

Parameters:
referenceiterable of str
max_distanceint, default=1

The maximum edit distance that this CachedRef instance will be able to support in future queries.

distance_type{ “levenshtein”, “hamming” }, default=”levenshtein”

The string metric to use when measuring distance between input strings. When using Hamming distance, strings of different lengths will never be considered neighbors.

Methods

get_neighbors_across(query[, max_distance])

The memoized equivalent of get_neighbors_across().

get_neighbors_within([max_distance])

The memoized equivalent of get_neighbors_within().

get_neighbors_across(query, max_distance=1)

The memoized equivalent of get_neighbors_across().

Parameters:
queryiterable of str or CachedRef
max_distanceint, default=1

The maximum edit distance at which strings are considered neighbours. This must not be greater than the max_distance specified when constructing the caller instance.

Returns:
rowndarray of shape (N,), dtype=uint32

Indices of strings in the query that have neighbors.

colndarray of shape (N,), dtype=uint32

Indices of neighbor strings (i.e. query[row[i]] and reference[col[i]] are neighbors).

distsndarray of shape (N,), dtype=uint8

Edit distances between neighbors (i.e. Levenshtein(query[row[i]], reference[col[i]]) = dists[i]).

Examples

Look for pairs of similar strings across the cached reference and a query collection.

>>> import symscan
>>> cached = symscan.CachedRef(["foo", "bar", "baz", "buzz"])
>>> (row, col, dists) = cached.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"])
>>> row
array([1, 2], dtype=uint32)
>>> col
array([3, 3], dtype=uint32)
>>> dists
array([1, 0], dtype=uint8)

It is possible to use a CachedRef instance as the query collection as well.

>>> cached_query = symscan.CachedRef(["fizz", "fuzz", "buzz", "fizzy"])
>>> (row, col, dists) = cached.get_neighbors_across(cached_query)
>>> row
array([1, 2], dtype=uint32)
>>> col
array([3, 3], dtype=uint32)
>>> dists
array([1, 0], dtype=uint8)

To use Hamming distance, this must be specified at construction time.

>>> cached = symscan.CachedRef(["foo", "bar", "baz", "buzz"], max_distance=2, distance_type="hamming")
>>> (row, col, dists) = cached.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], max_distance=2)
>>> row
array([0, 1, 2], dtype=uint32)
>>> col
array([3, 3, 3], dtype=uint32)
>>> dists
array([2, 1, 0], dtype=uint8)
get_neighbors_within(max_distance=1)

The memoized equivalent of get_neighbors_within().

Parameters:
max_distanceint, default=1

The maximum edit distance at which strings are considered neighbours. This must not be greater than the max_distance specified when constructing the caller instance.

Returns:
rowndarray of shape (N,), dtype=uint32

Indices of strings in the cached reference that have neighbors.

colndarray of shape (N,), dtype=uint32

Indices of neighbor strings (i.e. reference[row[i]] and reference[col[i]] are neighbors).

distsndarray of shape (N,), dtype=uint8

Edit distances between neighbors (i.e. Levenshtein(reference[row[i]], reference[col[i]]) = dists[i]).

Examples

Look for pairs of similar strings within a string collection.

>>> import symscan
>>> cached = symscan.CachedRef(["fizz", "fuzz", "buzz", "fizzy"])
>>> (row, col, dists) = cached.get_neighbors_within()
>>> row
array([0, 0, 1], dtype=uint32)
>>> col
array([1, 3, 2], dtype=uint32)
>>> dists
array([1, 1, 1], dtype=uint8)

To use Hamming distance, this must be specified at construction time.

>>> cached = symscan.CachedRef(["fizz", "fuzz", "buzz", "fizzy"], distance_type="hamming")
>>> (row, col, dists) = cached.get_neighbors_within()
>>> row
array([0, 1], dtype=uint32)
>>> col
array([1, 2], dtype=uint32)
>>> dists
array([1, 1], dtype=uint8)