och
The purpose is to implement the technique of free text searching with suffix arrays.
Assume that the input file tobe.txt consists of the following text ( _ means blank and $ means end-of-file):
To_be_or_not_to_be$Then, this text file contains the following suffixes: (at positions 0 - 17):
0:To_be_or_not_to_be$ 1:o_be_or_not_to_be$ 2:_be_or_not_to_be$ 3:be_or_not_to_be$ 4:e_or_not_to_be$ 5:_or_not_to_be$ 6:or_not_to_be$ 7:r_not_to_be$ 8:_not_to_be$ 9:not_to_be$ 10:ot_to_be$ 11:t_to_be$ 12:_to_be$ 13:to_be$ 14:o_be$ 15:_be$ 16:be$ 17:e$If we ignore all suffixes not starting with a letter, we are left with 13 suffixes. We sort these suffixes in alphabetic order. When comparing strings, we do not differ between upper and lower case letters, this is why to_be can be placed before To_be... We also ignore non-letter characters, this is why To_be... is placed before t_to... Alltogether, we get the following sorted set of suffixes:
16:be$ 3:be_or_not_to_be$ 17:e$ 4:e_or_not_to_be$ 9:not_to_be$ 14:o_be$ 1:o_be_or_not_to_be$ 6:or_not_to_be$ 10:ot_to_be$ 7:r_not_to_be$ 13:to_be$ 0:To_be_or_not_to_be$ 11:t_to_be$We can now in a fairly compact way represent an efficient search structure for all suffixes by storing their positions in a suffix array, which easily can be written to a file for future use:
16 3 17 4 9 14 1 6 10 7 13 0 11If we use this array to search for the string ²otto², a binary search will give the following result
compare at pos 1: otto > o_be_or_not_to_be
compare at pos 7: otto < r_not_to_be
compare at pos 6: otto > or_not_to_b
compare at pos 10: otto = ot_to_be
and we find the query string at position 10.
Example of session:
> java mySearch tobe.txt tobe.suf Welcome to my free text seareching in the file tobe.txt. Write a query string (end by Q): TO 2 hits found at 13 0 Write a query string (end by Q): .--. o ()? 4 hits found at 14 1 6 10 Write a query string (end by Q): Tobbe No hit Write a query string (end by Q): Beo 1 hit found at 3 Write a query string (end by Q): q Goodbye. >Testa på ett stort textdokument, som t ex Bibeln (finns att ladda ner som text på webben.)