Jump to content
FHannes

Substring Search Algorithms

Recommended Posts

Brute-force

The brute-force substring search algorithm checks if a string matches, by trying to match it at every position. This is the worst possible algorithm you could use to do this, but here is it anyway.

 

[scar]function StrPos_Brute(const SubStr, Str: string): Integer;

var

I, J, M, N: Integer;

begin

Result := 0;

M := Length(SubStr);

N := Length(Str);

for I := 1 to N - M + 1 do

begin

for J := 1 to M do

if Str[i + J - 1] <> SubStr[J] then

Break;

if J = M + 1 then

begin

Result := I;

Exit;

end;

end;

end;[/scar]

 

 

Knuth-Morris-Pratt

The KMP substring search algorithm uses a deterministic finite-state automaton, represented by a 2 dimensional array of jumpback positions for each character at each position in the substring. You may wonder why this uses AnsiString, this is because the algorithm requires an array that is the size of the number of different characters you use, which is a lot easier for 256 ansi characters than 65536 or more unicode characters.

 

[scar]function StrPos_KMP(const SubStr, Str: AnsiString): Integer;

var

C: Byte;

I, J, N, M, X: Integer;

DFA: array[0..255] of TIntArray;

begin

Result := 0;

M := Length(SubStr);

if M = 0 then Exit;

N := Length(Str);

for I := 0 to 255 do

SetLength(DFA, M);

DFA[Ord(SubStr[1])][0] := 1;

X := 0;

for I := 2 to M do

begin

for J := 0 to 255 do

DFA[J][i - 1] := DFA[J][X];

C := Ord(SubStr);

DFA[C][i - 1] := I;

X := DFA[C][X];

end;

J := 0;

for I := 1 to N do

begin

if J >= M then Break;

J := DFA[Ord(Str)][J];

end;

if J = M then

begin

Result := I - M;

Exit;

end;

end;[/scar]

 

 

Boyer-Moore

The BM substring search algorithm creates a shifting table prior to running the algorithm. The shifting table will indicate how much the substring has to be shifted to the right to potentially match up with the subject string. If for example the character on which your substring match failed is part of the substring, you will shift the substring until the character lines up with the subject string and try to match at that position. Keep in mind that this algorithm matches the substring from the right to the left. Also note that this also uses AnsiString because it builds a table containing data for each character.

 

[scar]function StrPos_BM(const SubStr, Str: AnsiString): Integer;

var

I, J, N, M, Skip: Integer;

Right: array[0..255] of Integer;

begin

Result := 0;

M := Length(SubStr);

if M = 0 then Exit;

N := Length(Str);

for I := 0 to 255 do

Right := -1;

for I := 1 to M do

Right[Ord(SubStr)] := I - 1;

I := 0;

while I <= N - M + 1 do

begin

Skip := 0;

for J := M downto 1 do

if SubStr[J] <> Str[i + J - 1] then

begin

Skip := J - Right[Ord(Str[i + J - 1])] - 1;

if Skip < 1 then Skip := 1;

Break;

end;

if Skip = 0 then

begin

Result := I;

Exit;

end;

IncEx(I, Skip);

end;

end;[/scar]

 

 

Rabin-Karp

The RB substring search algorithm searches for a string by creating a hash of the number of characters of the substring from the substring and the string. When these hashes match up, it will indicate the substring matches the string. To avoid having to recalculate the entire hash over and over again, the algorithm "subtracts" the first character from the hash and adds the next one, as such, moving the hash over the entire string until it finds a match or reaches the end. The number of available characters is important in the hashing process to avoid collisions, so this is once again for AnsiString. The 997 prime number used in the hash can be changed to any (preferably large) prime number, the reason for this is also to avoid collisions, but you should look into how hashing works to find out why.

 

[scar]function Hash(const Text: AnsiString; const Len: Integer): Integer;

var

I: Integer;

begin

Result := 0;

for I := 1 to Len do

Result := (Result shl 8 + Ord(Text)) mod 997;

end;

 

function StrPos_RK(const SubStr, Str: AnsiString): Integer;

var

I, N, M, RM, StrHash, SubStrHash: Integer;

begin

Result := 0;

M := Length(SubStr);

if M = 0 then Exit;

N := Length(Str);

RM := 1;

for I := 1 to M - 1 do

RM := (256 * RM) mod 997;

StrHash := Hash(Str, M);

SubStrHash := Hash(SubStr, M);

if StrHash = SubStrHash then

begin

Result := 1;

Exit;

end;

for I := M + 1 to N do

begin

StrHash := (StrHash + 997 - RM * Ord(Str[i - M]) mod 997) mod 997;

StrHash := (StrHash * 256 + Ord(Str)) mod 997;

if StrHash = SubStrHash then

begin

Result := I - M + 1;

Exit;

end;

end;

end;[/scar]

Edited by Freddy
Link to comment
Share on other sites

I'll also add the Rabin-Karp substring search algorithm in a bit. If any of them require more explanation btw, feel free to ask.

 

Basically, you should look at the KMP like this... It builds that DFA, which you can visualize like this:

image3.png

 

It is a map that tells you, for example, if you're at position 5 in the string, and the string has the character A in that position, where you have to jump from there to continue matching. If you had to match an A at that position, it will jump you to the next position in the string, if you had to match something else, it will jump to wherever it has to start over, which isn't always from the start, depending on your search string.

Edited by Freddy
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
  • Create New...