Web Mining 2008-2009
Mauro Brunato, Elisa Cilia
[Back to the course page]
Second assignment: A simple word index
| Assigned: | 2008-03-05
|
| Due: | 2008-03-19
|
| What to hand in: | URL to targzipped source code;
instructions for compiling and running.
|
The purpose of this assignment is to
create a word index for the pages that have been fetched by the crawler.
Instructions
- Let us define a word as a sequence of two or more
letters with the following conditions:
- it is delimited by two non-letters, or by the start of file and a
non-letter, or by a non-letter and the end of file.
- it does not occur within a tag.
- Create a text file, called stopwords.txt,
with common words that we don't want to be indexed
(is, and, ...), and populate it with
a few words.
- Populate the web pages created as part of the first assignment with
some English text.
- Extend your fetching program in a way that it accepts as parameters both an initial URL (as before) and the filename
stopwords.txt.
You program must recursively fetch all pages that are linked by the initial URL (as before) and in addition it must create three files:
- dictionary.txt associating an integer ID to every word
(not in stopwords.txt)
occurring in every downloaded file;
- index.txt that associates to every document the list of word IDs
that occur in it with their positions (while recording the positions take into account also the offset of the stopword placeholders);
You can represent each element of the list as a pair (word ID, pos) or as a pair (word ID, [list of positions]).
- inverse.txt that associates to every word the list of documents
it appears in with the associated frequency.
Note: it is convenient to modify the fetching program so that the file
URLs.txt associates to every URL also a document ID.
[Back to the course page]