- Tuesday 1430-1630, room 106
- Thursday 1330-1530, room 105
See the detailed program below
- Web crawling
- Web page indexing
- Information retrieval
- Unsupervised learning: clustering
- PageRank and HITS
- Search engine attack and defense strategies
Exams will consist of a written test (exercises) and an oral test (discussion of theory and lab assignments).
Assignments are compulsory; collaboration is accepted,
but every student must deliver his own material.
Timely delivery is taken into account at the exam.
To hand in an assignment, your email (to both teachers, see addresses above) should contain:
- URL of requested document or targzipped source code (NOT attached to the email, but available in the student's web space);
- Instructions for compilation and execution.
- Name of the student.
Mining the Web - Discovering knowledge from hypertext data
Morgan Kaufmann - Elsevier, 2003.
Slide handouts (4 slides per sheet):
Slide contents collected in a unique document ("article" format):
Text of the written exams:
- 2009-02-17 (PDF, 83KB)
- 2009-02-19 (PDF, 83KB)
- 2009-02-24 (PDF, 64KB)
- 2009-02-26 (PDF, 40KB)
- 2009-03-03 (PDF, 52KB)
- 2009-03-06 (PDF, 40KB)
- 2009-03-10 (PDF, 34KB)
- 2009-03-11 (PDF, 60KB)
- 2009-03-17 (PDF, 38KB)
- 2009-03-24 (PDF, 29KB)
- 2009-03-31 (PDF, 52KB)
- 2009-04-02 (PDF, 40KB)
- February 17, 2009 (Brunato)
Crawling: parallel page fetching, DNS prefetching, two-level
hashing for URLs, robots exclusion file, robot traps, network structure of a crawler.
- February 19, 2009 (Cilia)
Introduction to the Information Retrieval: the classical IR system architecture.
Text operations, direct and inverted indexing, queries.
First lab assignment: building blocks for a simple web crawler.
- February 24, 2009 (Cilia)
Indexing: batch indexing and updates, index compression techniques.
Recall and precision.
- February 26, 2009 (Cilia)
Performance evaluation: recall-precision plots, interpolated precision, Break Even Point (BEP), F-measure.
Vector-space model (VSM): document and query representation.
- March 3, 2009 (Cilia)
VSM: proximity measures in TFIDF-space. Relevance feedback in VSM, Rocchio's method.
Probabilistic model of retrieval.
- March 06, 2009 (Cilia)
Probabilistic Relevance Feedback: Odds ratio and Bayesian Inference Networks.
Second lab assignment: creating a word index.
- March 10, 2009 (Brunato)
Advanced Issues: spamming, internal tag structure and hyperlinking.
Maximum likelihood parameter estimates; the likelihood ratio test for term distribution dependence.
- March 11, 2009 (Brunato)
Approximate term matching.
The Jaccard similarity index, straightforward calculation,
probabilistic definition, approximation by means of
a randomized algorithm based on permutations.
- March 17, 2009 (Brunato)
Exercises on the estimate of the Jaccard similarity index.
Introduction to PageRank.
- March 24, 2009 (Brunato)
Introduction to HITS.
- March 31, 2009 (Cilia)
Introduction to clustering, agglomerative clustering.
- April 2, 2009 (Brunato)
Clustering by quantization error minimization: the "hard" and "soft" k-means algorithms.
Geometric embedding: Kohonen's Self Organizing Maps (SOMs).
- April 7, 2009 (Brunato)
Clustering and visualization via embeddings: Multidimensional scaling, FastMap, projection and subspaces.
Singular value decomposition (see, for example,
this tutorial ).
Latent Semantic Indexing.
- April 9, 2009 (Brunato)
Examples on Latent semantic indexing.
Generative models for document distributions: the binary and multinomial model.
- April 17, 2009 (Brunato)
Generative distributions for document clustering. Mixture models.
The Expectation-Maximization algorithm.
- April 21, 2009 (Cilia)
Exercises on Latent Semantic Indexing.
- April 23, 2009 (Brunato)
Introduction to collaborative filtering. Probabilistic models (Gibbs sampling) for the estimation of collaborative filtering parameters.
Multiple Cause Mixture Model (MCMM).
The MapReduce framework. [Reference: the Google paper]
- April 28, 2009 (Cilia)
Exercises on generative models and parameter estimation.
- April 30, 2009 (Brunato)
Exercises on MapReduce.
- May 5, 2009 (Brunato)
Exercises on MapReduce and on combined document/term clustering.
- May 7, 2009 (Cilia)
Exercises on the least square method for parameter estimation and relevance feedback.
- May 14, 2009 (Brunato)
SVD and the Netflix contest
Location-aware recommender system and the MapReduce framework
Supervised learning basics: classification, learning by examples, generalization properties.
Polynomial discriminators and k-Nearest-Neighbors.
- May 19, 2009 (Cilia)
Supervised learning scenario: text categorization, topic tagging, design of a learning system.
Overview of classification strategies.
Evaluating text classifiers: benchmarks, cross-validation, measures of accuracy.
- May 21, 2009 (Brunato)
Exercises on TFIDF representation, similarity, PageRank, MapReduce.
- May 26, 2009 (Cilia)
Exercises on least square and maximum likelihood estimation.
Exercises on measures of accuracy.
- May 28, 2009 (Brunato)
Examples of real-world data mining and aplication of link analysis techniques:
dimensionality reduction, PageRank, HITS techniques.
Page maintained by Mauro Brunato