In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or set of documents (or database). The purpose of an inverted index is to allow fast, full-text searches. It is the most popular data structure used in document retrieval systems, used on a large scale (for example) in search engines.
There are two main variants of inverted indexes: An inverted file index (or just inverted file) contains a list of references to documents, for each word. A full inverted index additionally contains the positions of each word within a document. The latter form offers more functionality (like phrase searches), but needs more time and space to be created.
For example, indexing these three (very short) documents:
| File | Contents |
|---|---|
| File0 | it is what it is |
| File1 | what is it |
| File2 | it is a banana |
results in the following inverted file index:
| Word | Set of Files |
|---|---|
| a | {2} |
| banana | {2} |
| is | {0, 1, 2} |
| it | {0, 1, 2} |
| what | {0, 1} |
(Typically a small integer is used to represent each file, so "0", "1", and "2" means the documents File0, File1, and File2.)
An AND search for the terms “what”,
“is” and “it” would give the set
(“∩” means intersection):
{0,1} ∩ {0,1,2} ∩ {0,1,2} = {0,1}
That means all three search terms appear only in documents 0 and 1.
An OR search for the same terms would give
{0, 1, 2}.
An AND search for the terms “banana” and “what”
results in no files found (we say the number of hits was zero.)
This index doesn't support PHRASE searches, however. For that you need a full inverted index. With the same three documents, we get the following full inverted index. The number pairs are document numbers and the position of the word within that document (starting with zero for the first word). So the entry:
banana: {(2, 3)}
means the word “banana” is in the third document
(File2), and it is the fourth word in that document
(position 3).
Here is the full inverted index for the three sample files:
| Word | Set of (File,Position) Pairs |
|---|---|
| a | {(2, 2)} |
| banana | {(2, 3)} |
| is | {(0, 1), (0, 4), (1, 1), (2, 1)} |
| it | {(0, 0), (0, 3), (1, 2), (2, 0)} |
| what | {(0, 2), (1, 0)} |
For AND and OR searches, the position doesn't matter and can be ignored. (Doing AND and OR searches is fairly straight-forward, and is left as an exercise for the reader.) Now if we run a phrase search for “what is it”, we get “hits” for all three search terms in File0 and File1. But the terms occur consecutively only in File1. You can determine this by first creating a set of all the documents that contain the first word. Then for each document in that set, check if the next word (“is”) appears in the same document, but in the next position. (Note a given word may appear in several positions in a document, so you need to check the word following each occurrence.) If not, drop that document from the set. Repeat for each word, until you run out of words (and the resulting set of documents are just those that contain the phrase) or until you run out of documents (in which case, the phrase doesn't appear in any of the documents).
This is a group project. Students should form their own groups of no more than four students per group (unless permission to form a larger group is given). It will be up to you to determine how to organize and manage your group, and when, where, how, and how often to meet.
Build a simple GUI search engine that can do AND, OR, and PHRASE searches on a small set of text files. The user should be able to say the type of search to do, and enter some search terms. The results should be a list of files that match the search.
It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime. When starting, your application should check if any of the files have been changed or deleted since the application last saved the index. If so, the user should be able to have the inverted index file(s) updated.
(Note that with HTML or Word documents, you would need to extract a plain text version before indexing. In this project, all the files are plain ASCII text.)
The inverted index must be stored in one or more file(s), and that file should be read whenever your application starts. The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents). The file format is up to you, but should have a format that is fast and simple to search. However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the index can be kept in memory. All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index. Note, the names (pathnames) of the files as well as their last modification time must be stored as well. It is your choice to use a single file or multiple files to hold the persistent data. (Don't use any DBMS however, just files.) In any case, your file format(s) must be documented completely, so that someone else, without access to your source code could use your file(s).
A copy of your group's Java source code files (including the file format document), your individual ratings (see below), and the answers to the following questions:
You only need to submit one time for your whole group. Be sure the names of all group members are listed on all files. You should use the CVS server, email, Facebook, Skype, or any means you wish, to communicate within your group.
Grading will be done for the group's results, and individuals in the group will have their grades adjusted by peer ratings. A rating of each team member's level of participation should be sent by every member to the instructor directly. Be sure to include yourself in the ratings! The rating is a number from 0 (didn't participate at all), 1 (less than their fair share of the work), 2 (participated fully), or 3 (did more than their fair share of the work).
You can send as email to (preferred). If email is a problem for some reason, you may turn in a hard-copy. In this case the pages should be readable, dated, and stapled together. Your name should appear on the first page.
Please see your syllabus for more information about projects, and about submitting projects.