Substring Search Algorithms

FHannes · September 5, 2012

Brute-force

The brute-force substring search algorithm checks if a string matches, by trying to match it at every position. This is the worst possible algorithm you could use to do this, but here is it anyway.

[scar]function StrPos_Brute(const SubStr, Str: string): Integer;

var

I, J, M, N: Integer;

begin

Result := 0;

M := Length(SubStr);

N := Length(Str);

for I := 1 to N - M + 1 do

begin

for J := 1 to M do

if Str[i + J - 1] <> SubStr[J] then

Break;

if J = M + 1 then

begin

Result := I;

Exit;

end;

end;[/scar]

Knuth-Morris-Pratt

The KMP substring search algorithm uses a deterministic finite-state automaton, represented by a 2 dimensional array of jumpback positions for each character at each position in the substring. You may wonder why this uses AnsiString, this is because the algorithm requires an array that is the size of the number of different characters you use, which is a lot easier for 256 ansi characters than 65536 or more unicode characters.

[scar]function StrPos_KMP(const SubStr, Str: AnsiString): Integer;

var

C: Byte;

I, J, N, M, X: Integer;

DFA: array[0..255] of TIntArray;

begin

Result := 0;

M := Length(SubStr);

if M = 0 then Exit;

N := Length(Str);

for I := 0 to 255 do

SetLength(DFA, M);

DFA[Ord(SubStr[1])][0] := 1;

X := 0;

for I := 2 to M do

begin

for J := 0 to 255 do

DFA[J][i - 1] := DFA[J][X];

C := Ord(SubStr);

DFA[C][i - 1] := I;

X := DFA[C][X];

end;

J := 0;

for I := 1 to N do

begin

if J >= M then Break;

J := DFA[Ord(Str)][J];

end;

if J = M then

begin

Result := I - M;

Exit;

end;

end;[/scar]

Boyer-Moore

The BM substring search algorithm creates a shifting table prior to running the algorithm. The shifting table will indicate how much the substring has to be shifted to the right to potentially match up with the subject string. If for example the character on which your substring match failed is part of the substring, you will shift the substring until the character lines up with the subject string and try to match at that position. Keep in mind that this algorithm matches the substring from the right to the left. Also note that this also uses AnsiString because it builds a table containing data for each character.

[scar]function StrPos_BM(const SubStr, Str: AnsiString): Integer;

var

I, J, N, M, Skip: Integer;

Right: array[0..255] of Integer;

begin

Result := 0;

M := Length(SubStr);

if M = 0 then Exit;

N := Length(Str);

for I := 0 to 255 do

Right := -1;

for I := 1 to M do

Right[Ord(SubStr)] := I - 1;

I := 0;

while I <= N - M + 1 do

begin

Skip := 0;

for J := M downto 1 do

if SubStr[J] <> Str[i + J - 1] then

begin

Skip := J - Right[Ord(Str[i + J - 1])] - 1;

if Skip < 1 then Skip := 1;

Break;

end;

if Skip = 0 then

begin

Result := I;

Exit;

end;

IncEx(I, Skip);

end;

end;[/scar]

Rabin-Karp

The RB substring search algorithm searches for a string by creating a hash of the number of characters of the substring from the substring and the string. When these hashes match up, it will indicate the substring matches the string. To avoid having to recalculate the entire hash over and over again, the algorithm "subtracts" the first character from the hash and adds the next one, as such, moving the hash over the entire string until it finds a match or reaches the end. The number of available characters is important in the hashing process to avoid collisions, so this is once again for AnsiString. The 997 prime number used in the hash can be changed to any (preferably large) prime number, the reason for this is also to avoid collisions, but you should look into how hashing works to find out why.

[scar]function Hash(const Text: AnsiString; const Len: Integer): Integer;

var

I: Integer;

begin

Result := 0;

for I := 1 to Len do

Result := (Result shl 8 + Ord(Text)) mod 997;

end;

function StrPos_RK(const SubStr, Str: AnsiString): Integer;

var

I, N, M, RM, StrHash, SubStrHash: Integer;

begin

Result := 0;

M := Length(SubStr);

if M = 0 then Exit;

N := Length(Str);

RM := 1;

for I := 1 to M - 1 do

RM := (256 * RM) mod 997;

StrHash := Hash(Str, M);

SubStrHash := Hash(SubStr, M);

if StrHash = SubStrHash then

begin

Result := 1;

Exit;

end;

for I := M + 1 to N do

begin

StrHash := (StrHash + 997 - RM * Ord(Str[i - M]) mod 997) mod 997;

StrHash := (StrHash * 256 + Ord(Str)) mod 997;

if StrHash = SubStrHash then

begin

Result := I - M + 1;

Exit;

end;

end;[/scar]

Edited September 6, 2012 by Freddy

LordJashin · September 6, 2012

Interesting, always wondered about the methods, and now you showed how to use them in SCAR, Nice!

another bookmark I can add .

Janilabo · September 6, 2012

Brute-force is easy piece of a cake, but those other 2 are AWESOME functions!

Thanks for sharing em, Freddy.

FHannes · September 6, 2012

I'll also add the Rabin-Karp substring search algorithm in a bit. If any of them require more explanation btw, feel free to ask.

Basically, you should look at the KMP like this... It builds that DFA, which you can visualize like this:

It is a map that tells you, for example, if you're at position 5 in the string, and the string has the character A in that position, where you have to jump from there to continue matching. If you had to match an A at that position, it will jump you to the next position in the string, if you had to match something else, it will jump to wherever it has to start over, which isn't always from the start, depending on your search string.

Edited September 6, 2012 by Freddy

shadowrecon · September 6, 2012

This is awesome Freddy, i think having examples of functions like these can really help others out! You are a professional! lol

MarkD · September 7, 2012

I've experimented with something similar in the past in VHDL. KMP was the big winner there.

FHannes · September 7, 2012

Hmm, I'm not sure about that... It will depend on the entered strings, but I beleive that generally Boyer-Moore is faster than KMP.

MarkD · September 7, 2012

I'm not sure we tried Boyer-Moore, so that could be.

Also, FPGA is one big emulation. It could be one works quicker on FPGA and another on PC. I'm not that good at it, just glad I know how it works and how to deal with it.

Sign In

Substring Search Algorithms

Recommended Posts

FHannes

Link to comment

Share on other sites

LordJashin

Link to comment

Share on other sites

Janilabo

Link to comment

Share on other sites

FHannes

Link to comment

Share on other sites

shadowrecon

Link to comment

Share on other sites

MarkD

Link to comment

Share on other sites

FHannes

Link to comment

Share on other sites

MarkD

Link to comment

Share on other sites

Join the conversation

Browse

Activity