Library (Python bindings)¶
Installation¶
$ pip install symscan
Usage¶
The Python package exposes two ways to use the symdel algorithm. The first are
the pure functins get_neighbors_within() and
get_neighbors_across(), which is optimised for one-off
computations. In addition, the package provides a memoized implementation in
the CachedRef class, which is useful for cases where you
know you will be repeatedly querying against a large reference set.
Functional API¶
- symscan.get_neighbors_within(query, max_distance=1, distance_type='levenshtein')¶
Detect string pairs within an input collection that lie within a threshold edit distance.
The function considers all possible combinations of string pairs from query, and returns all those where the two strings are no more than max_distance edit distance (either Levenshtein or Hamming) units apart.
Important
This function DOES NOT double-count string pairs. As seen in the examples below, each pair is represented once where the row index is always less than the col index. In other words, if you were to interpret the output as a sparse matrix, only the lower triangle will be filled.
- Parameters:
- queryiterable of str
- max_distanceint, default=1
The maximum edit distance at which strings are considered neighbours.
- distance_type{ “levenshtein”, “hamming” }, default=”levenshtein”
The string metric to use when measuring distance between input strings. When using Hamming distance, strings of different lengths will never be considered neighbors.
- Returns:
- rowndarray of shape (N,), dtype=uint32
Indices of strings in the query that have neighbors.
- colndarray of shape (N,), dtype=uint32
Indices of neighbor strings (i.e.
query[row[i]]andquery[col[i]]are neighbors).- distsndarray of shape (N,), dtype=uint8
Edit distances between neighbors (i.e.
Levenshtein(query[row[i]], query[col[i]]) = dists[i]).
Examples
Look for pairs of similar strings within a string collection. Note how string pairs are not double-counted, each is counted once such that the row coordinate is always less than the col coordinate.
>>> import symscan >>> (row, col, dists) = symscan.get_neighbors_within(["fizz", "fuzz", "buzz", "fizzy"]) >>> row array([0, 0, 1], dtype=uint32) >>> col array([1, 3, 2], dtype=uint32) >>> dists array([1, 1, 1], dtype=uint8)
To increase the threshold at which string pairs are considered similar, set max_distance.
>>> (row, col, dists) = symscan.get_neighbors_within(["fizz", "fuzz", "buzz", "fizzy"], max_distance=2) >>> row array([0, 0, 0, 1, 1], dtype=uint32) >>> col array([1, 2, 3, 2, 3], dtype=uint32) >>> dists array([1, 2, 1, 1, 2], dtype=uint8)
Hamming distance can be used instead of Levenshtein distance.
>>> (row, col, dists) = symscan.get_neighbors_within(["fizz", "fuzz", "buzz", "fizzy"], distance_type="hamming") >>> row array([0, 1], dtype=uint32) >>> col array([1, 2], dtype=uint32) >>> dists array([1, 1], dtype=uint8)
- symscan.get_neighbors_across(query, reference, max_distance=1, distance_type='levenshtein')¶
Detect string pairs across two input collections that lie within a threshold edit distance.
The function considers all string pairs in the cartesian product of query and reference, and returns all those where the two strings are no more than max_distance edit distance (either Levenshtein or Hamming) units apart.
- Parameters:
- queryiterable of str
- referenceiterable of str
- max_distanceint, default=1
The maximum edit distance at which strings are considered neighbors.
- distance_type{ “levenshtein”, “hamming” }, default=”levenshtein”
The string metric to use when measuring distance between input strings. When using Hamming distance, strings of different lengths will never be considered neighbors.
- Returns:
- rowndarray of shape (N,), dtype=uint32
Indices of strings in the query that have neighbors.
- colndarray of shape (N,), dtype=uint32
Indices of neighbor strings (i.e.
query[row[i]]andreference[col[i]]are neighbors).- distsndarray of shape (N,), dtype=uint8
Edit distances between neighbors (i.e.
Levenshtein(query[row[i]], reference[col[i]]) = dists[i]).
Examples
Look for pairs of similar strings across two collections.
>>> (row, col, dists) = symscan.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], ["foo", "bar", "baz", "buzz"]) >>> row array([1, 2], dtype=uint32) >>> col array([3, 3], dtype=uint32) >>> dists array([1, 0], dtype=uint8)
To increase the threshold at which string pairs are considered similar, set max_distance.
>>> (row, col, dists) = symscan.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], ["foo", "bar", "baz", "buzz"], max_distance=2) >>> row array([0, 1, 2, 2], dtype=uint32) >>> col array([3, 3, 2, 3], dtype=uint32) >>> dists array([2, 1, 2, 0], dtype=uint8)
Hamming distance can be used instead of Levenshtein distance.
>>> (row, col, dists) = symscan.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], ["foo", "bar", "baz", "buzz"], max_distance=2, distance_type="hamming") >>> row array([0, 1, 2], dtype=uint32) >>> col array([3, 3, 3], dtype=uint32) >>> dists array([2, 1, 0], dtype=uint8)
Class-based (memoized) API¶
- class symscan.CachedRef(reference, max_distance=1, distance_type='levenshtein')¶
A class for memoizing the deletion variant calculations for a string collection.
When constructed, the CachedRef instance precomputes and stores the deletion variants for the supplied reference strings as a hashmap. This significantly speeds up subsequent queries against the reference, at the upfront cost of spending extra time to construct the hashmap. This is useful for use-cases where you want to repeatedly query the same reference, especially if the reference is very large. However, for one-off computations, the pure functions
get_neighbors_within()andget_neighbors_across()are faster.Note
When interpreting the index order of method return values, the string collection specified at construction is considered the reference, and any string collections specified during subsequent query calls are considered the query.
- Parameters:
- referenceiterable of str
- max_distanceint, default=1
The maximum edit distance that this CachedRef instance will be able to support in future queries.
- distance_type{ “levenshtein”, “hamming” }, default=”levenshtein”
The string metric to use when measuring distance between input strings. When using Hamming distance, strings of different lengths will never be considered neighbors.
Methods
get_neighbors_across(query[, max_distance])The memoized equivalent of
get_neighbors_across().get_neighbors_within([max_distance])The memoized equivalent of
get_neighbors_within().- get_neighbors_across(query, max_distance=1)¶
The memoized equivalent of
get_neighbors_across().- Parameters:
- queryiterable of str or CachedRef
- max_distanceint, default=1
The maximum edit distance at which strings are considered neighbours. This must not be greater than the max_distance specified when constructing the caller instance.
- Returns:
- rowndarray of shape (N,), dtype=uint32
Indices of strings in the query that have neighbors.
- colndarray of shape (N,), dtype=uint32
Indices of neighbor strings (i.e.
query[row[i]]andreference[col[i]]are neighbors).- distsndarray of shape (N,), dtype=uint8
Edit distances between neighbors (i.e.
Levenshtein(query[row[i]], reference[col[i]]) = dists[i]).
Examples
Look for pairs of similar strings across the cached reference and a query collection.
>>> import symscan >>> cached = symscan.CachedRef(["foo", "bar", "baz", "buzz"]) >>> (row, col, dists) = cached.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"]) >>> row array([1, 2], dtype=uint32) >>> col array([3, 3], dtype=uint32) >>> dists array([1, 0], dtype=uint8)
It is possible to use a CachedRef instance as the query collection as well.
>>> cached_query = symscan.CachedRef(["fizz", "fuzz", "buzz", "fizzy"]) >>> (row, col, dists) = cached.get_neighbors_across(cached_query) >>> row array([1, 2], dtype=uint32) >>> col array([3, 3], dtype=uint32) >>> dists array([1, 0], dtype=uint8)
To use Hamming distance, this must be specified at construction time.
>>> cached = symscan.CachedRef(["foo", "bar", "baz", "buzz"], max_distance=2, distance_type="hamming") >>> (row, col, dists) = cached.get_neighbors_across(["fizz", "fuzz", "buzz", "fizzy"], max_distance=2) >>> row array([0, 1, 2], dtype=uint32) >>> col array([3, 3, 3], dtype=uint32) >>> dists array([2, 1, 0], dtype=uint8)
- get_neighbors_within(max_distance=1)¶
The memoized equivalent of
get_neighbors_within().- Parameters:
- max_distanceint, default=1
The maximum edit distance at which strings are considered neighbours. This must not be greater than the max_distance specified when constructing the caller instance.
- Returns:
- rowndarray of shape (N,), dtype=uint32
Indices of strings in the cached reference that have neighbors.
- colndarray of shape (N,), dtype=uint32
Indices of neighbor strings (i.e.
reference[row[i]]andreference[col[i]]are neighbors).- distsndarray of shape (N,), dtype=uint8
Edit distances between neighbors (i.e.
Levenshtein(reference[row[i]], reference[col[i]]) = dists[i]).
Examples
Look for pairs of similar strings within a string collection.
>>> import symscan >>> cached = symscan.CachedRef(["fizz", "fuzz", "buzz", "fizzy"]) >>> (row, col, dists) = cached.get_neighbors_within() >>> row array([0, 0, 1], dtype=uint32) >>> col array([1, 3, 2], dtype=uint32) >>> dists array([1, 1, 1], dtype=uint8)
To use Hamming distance, this must be specified at construction time.
>>> cached = symscan.CachedRef(["fizz", "fuzz", "buzz", "fizzy"], distance_type="hamming") >>> (row, col, dists) = cached.get_neighbors_within() >>> row array([0, 1], dtype=uint32) >>> col array([1, 2], dtype=uint32) >>> dists array([1, 1], dtype=uint8)