Datastrukturer och databaser för STS

och

Programmeringsteknik II för F

Arne Andersson


Laborations- och programmeringsuppgift 4: Textsökning.


Java-dokumentation hittar du på http://java.sun.com/docs/
Assignment: An index for free text searching.

The purpose is to implement the technique of free text searching with suffix arrays.

Assume that the input file tobe.txt consists of the following text ( _ means blank and $ means end-of-file):


              To_be_or_not_to_be$
Then, this text file contains the following suffixes: (at positions 0 - 17):
 0:To_be_or_not_to_be$
 1:o_be_or_not_to_be$
 2:_be_or_not_to_be$
 3:be_or_not_to_be$
 4:e_or_not_to_be$
 5:_or_not_to_be$
 6:or_not_to_be$
 7:r_not_to_be$
 8:_not_to_be$
 9:not_to_be$
10:ot_to_be$
11:t_to_be$
12:_to_be$
13:to_be$
14:o_be$
15:_be$
16:be$
17:e$
If we ignore all suffixes not starting with a letter, we are left with 13 suffixes. We sort these suffixes in alphabetic order. When comparing strings, we do not differ between upper and lower case letters, this is why to_be can be placed before To_be... We also ignore non-letter characters, this is why To_be... is placed before t_to... Alltogether, we get the following sorted set of suffixes:
16:be$
 3:be_or_not_to_be$
17:e$
 4:e_or_not_to_be$
 9:not_to_be$
14:o_be$
 1:o_be_or_not_to_be$
 6:or_not_to_be$
10:ot_to_be$
 7:r_not_to_be$
13:to_be$
 0:To_be_or_not_to_be$
11:t_to_be$
We can now in a fairly compact way represent an efficient search structure for all suffixes by storing their positions in a suffix array, which easily can be written to a file for future use:
16 3 17 4 9 14 1 6 10 7 13 0 11
If we use this array to search for the string ²otto², a binary search will give the following result

compare at pos 1: otto > o_be_or_not_to_be
compare at pos 7: otto < r_not_to_be
compare at pos 6: otto > or_not_to_b
compare at pos 10: otto = ot_to_be

and we find the query string at position 10.

The assignment:

  1. Write a program that takes a text file as input and produces a file containing a sorted set of positions for all suffixes starting by a character, as illustrated above. Note that the output file should only consist of integers. The generation shuld be done in a fairly efficient way. Suggestion:
    1. Generate an array containing only the relevant suffixes. In the example above, this would be 0 1 3 4 6 7 9 10 11 13 14 16 17
    2. Sort this array with an efficient sorting algorithm. Use your own comparison function which takes as input two positions in the text.
  2. Write a new program that takes as input a text file and a suffix file, and which lets the user write query strings. The program shall read query strings from standard input, one per input line. For each query string, the program shall return the number of hits and, if the number of hits is at most 10, their positions. When searching for a string, non-letter characters should be ignored both in the query string and in the text file. Also, no difference should be made between upper and lower case characters. The input string Q ends the session.

Example of session:

> java mySearch tobe.txt tobe.suf
Welcome to my free text seareching in the file tobe.txt.
Write a query string (end by Q): TO
2 hits found at          13 0
Write a query string (end by Q):             .--. o   ()?
4 hits found at          14 1 6 10
Write a query string (end by Q):    Tobbe
No hit
Write a query string (end by Q): Beo
1 hit found at           3
Write a query string (end by Q): q
Goodbye.
>
Testa på ett stort textdokument, som t ex Bibeln (finns att ladda ner som text på webben.)