go-lcss

Fast Longest Common Substring in Go

APACHE-2.0 License

Stars
44

Overview

Longest Common Substring (don't confuse with Longest Common Subsequence) is a well-known computer science problem of finding the longest substring which appears in all the given strings. It can be efficiently solved in log-linear time and linear space. The implemented algorithm uses the code to build suffix arrays which was copy-pasted from the Go's standard library - it is internal to index/suffixarray and unfortunately cannot be invoked directly. There is a blog post which gives some general understanding of that algorithm, though the actual implementation is quite different (and optimized).

How To Use

There are two functions: "basic" and "advanced". The former is straightforward:

import "gopkg.in/vmarkovtsev/go-lcss.v1"

lcss.LongestCommonSubstring([]byte("ABABC"), []byte("BABCA"), []byte("ABCBA"))
// []byte("ABC")

The "advanced" function allows to cache the suffix arrays. It can be useful since the construction of suffix arrays dominates over the run time of the algorithm:


import "gopkg.in/vmarkovtsev/go-lcss.v1"

s1, s2, s3 := []byte("ABABC"), []byte("BABCA"), []byte("ABCBA")
lcss.LongestCommonSubstringWithSuffixArrays(
	[][]byte{s1, s2, s3},
	[][]int{lcss.Qsufsort(s1), lcss.Qsufsort(s2), lcss.Qsufsort(s3)})
// []byte("ABC")

Installation

go get gopkg.in/vmarkovtsev/go-lcss.v1

The code supports building under Go >= 1.8.

Contributions

...are pretty much welcome! See contributing.md and code_of_conduct.md.

License

Apache 2.0, see LICENSE. It allows you to use this code in proprietary projects.