Data Mining - Autumn 2005


Data Mining (Informationsutvinning) Fourth Year Course, Period 2, 2005 (1DL105 and 1DL111)

Instructor:

Kostis Sagonas <kostis at it dot uu dot se> (he is a terrible e-mail responder)
Office Hours: Mondays 14:00-15:00 (Office: 1352)

Assistants: (Office: 1310)

Latest News (check periodically!)

18/6: The final re-exam of the course is scheduled for 17th August 2006, 9–14 in Polacksbacken, Skrivsal (in the same place as the December and the April exams). Please make sure that you register in advance.
20/2: The re-exam of the course is scheduled for 20th April 2006, 9–14 in Polacksbacken, Skrivsal (in the same place as the December exam). Please make sure that you register in advance.
15/12: The absolute deadline for handing in your "free" late assignment is the date of the Spring re-exam (usually around Easter). After that date, no assignments will be accepted. But you are of course encouraged to hand it in earlier than that.
15/12: In the December exam (as well as in subsequent ones) you are allowed to have with you the following items:
1. An English <-> Swedish (your mother tongue) dictionary.
2. A simple (i.e., non-programmable) calculator.
3. A "cheat-sheet" (i.e., one A4 paper) with notes about topics from the course that you think is important to be reminded of. It can contain hand-written notes or printed ones in any font.
Best of luck!
11/12: The final exam of last year's course is now available. (You need to start your browser from the uu.it domain to be able to access it.)
11/12: Reading instructions for all lectures appear below.
30/11: The fourth and last assignment is out. Submission deadline is 21/12/2004, 17:00.
24/11: Note that it is important that you sign up for the exam. This should be done at the latest 14 days before the exam (i.e., you have to register by December 5th). If you do not sign up for the exam, we cannot guarantee you a seat. More information on how to register appears on this page.
17/11: The assignments' homepage has been updated with a link to the assignments' results.
17/11: The third assignment is available. There are two submission deadlines: December 5th for the contest and 13/12/2005, 17:00 for all others. By sometime on Monday 28/11/2005 the assistants will put some sets of correct results for the given data sets on the assignments' homepage.
14/11: The second assignment is out. Submission deadline is 28/11/2004, 12:00.
11/11: NOTE! All questions related to assignment should be sent to the assistants, not to the instructor. So should the assignments' solutions themselves.
3/11: The first assignment is out. Submission deadline is 14/11/2004, 17:00.
28/10: In order to access the lecture notes and the registration list you must start a browser from a machine within the uu.se domain.
28/10: The first lecture of the course is on Monday 31 October, 13:15-15:00, in Polacksbacken MIC 1211.

After the first lecture, to register in the course you have to send the instructor mail with the following information:
1. your name,
2. personnummer,
3. Uppsala University e-mail address (NOTE: not a hotmail, gmail, yahoo, ... address),
4. the course number (or number of points) to which you are registering.
After sending the mail, it is recommended to check the registration list. Only after you see your name in the list, you can assume that you are indeed registered in the course. The registration list will close (its final version will be sent to the Kansliet) on Friday November 11th. After that date, no additions or changes to it will be made.

Course Description

Data mining, or knowledge discovery from data repositories, has during the last few years emerged as one of the most exciting fields in computer science. Data mining aims at finding useful regularities in large data sets. Interest in the field is motivated by the growth of computerized data collections which are routinely kept by many organizations and commercial enterprises, and by the high potential value of patterns discovered in those collections. For instance, bar code readers at supermarkets produce extensive amounts of data about purchases. An analysis of this data can reveal previously unknown, yet useful information about the shopping behavior of the customers.

Data mining refers to a set of techniques that have been designed to efficiently find interesting pieces of information or knowledge in large amounts of data. Association rules, for instance, are a class of patterns that tell which products tend to be purchased together. There is currently a large commercial interest in the area, both for the development of data mining software and for the offering of consulting services on data mining.

In this course we explore how this interdisciplinary field brings together techniques from databases, statistics, machine learning, and information retrieval. We will discuss the main data mining methods currently used, including data cleaning, clustering and classification techniques, algorithms for association rule mining, text indexing and seaching algorithms, how search engines rank pages, and recent techniques for web mining and for privacy-preserving data mining. Designing algorithms for these tasks is difficult because the input data sets are typically very large, and the tasks may be very complex. One of the main focuses in the field is the integration of these algorithms with relational databases and the mining of information from semi-structured data. We will examine the additional complications that come up in this case.

Topics Covered

Introduction to Data Mining
Classification Techniques
Clustering Techniques
Association Rules
Web Mining
Search Engines
Data Mining and Privacy

Assignments & Exam

The course will have a total of four assignments: one on classification, one on clustering, one on association rules, and one on web mining. Students taking the course for 5 rather than 4 points will need to do an extra sub-assignment for the third assigment. On all assignments, you can work in pairs. Assignment deadlines are strict but, if you really need it, you are allowed to be late on one (but only one) assignment. More information on the assignments.

Besides assignments, there will also be a written, final examination; see the schedule below.

Recommended Literature


Margaret H. Dunham	Pang-Ning Tan, Michael Steinbach, and Vipin Kumar
Data Mining: Introductory and Advanced Topics	Introduction to Data Mining
Prentice Hall, 2002	Addison-Wesley, 2005

Additional Required Reading Material

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.
A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226-231, 1996.
Sundipto Guha, Rajeev Rastogi, and Kyuseok Shim.
CURE: An Efficient Clustering Algorithm for Large Databases.
In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 73-84, June 1998.
Rakesh Agrawal, Tomasz Imielinski, and Arun Swami.
Mining Associations between Sets of Items in Large Databases.
In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 207-216, May 1993.
Rakesh Agrawal, Ramakrishnan Srikant.
Fast Algorithms for Mining Association Rules.
In Proceedings of the 20th International Conference on Very Large Databases, pages 487-499, September 1994.
Jong Soo Park, Ming-Syan Chen, and Philip S. Yu.
An Effective Hash Based Algorithm for Mining Association Rules. (Also available in PDF.)
In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 175-186, May 1995.
Rakesh Agrawal and Ramakrishnan Srikant.
Mining Sequential Patterns.
In Proceedings of the International Conference on Data Engineering (ICDE), pages 3-14, March 1995.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.
The PageRank Citation Ranking: Bringing Order to the Web.
Technical report 1999-66, Stanford University, 1998.
David Gibson, Jon Kleinberg and Prabhakar Raghavan.
Inferring Web Communities from Link Topologies.
In Proceedings of ACM Hypertext'98: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space - Structure in Hypermedia Systems, pages 225-234, June 1998.

Additional Recommended Reading Material

Tomasz Imielinski and Heikki Mannila.
A Database Perspective on Knowledge Discovery.
In Communications of the ACM, 39(11):58-64, November 1996.
Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava.
Web Mining: Information and Pattern Discovery on the World Wide Web.
In Proceedings of the 9th International Conference on Tools with Artificial Intelligence, pages 558-567, November 1997.
Rakesh Agrawal, Ramakrishnan Srikant.
Mining Sequential Patterns: Generalizations and Performance Improvements.
In Proceedings of the Fifth International Conference on Extending Database Technology (EDBT), pages 3-17, March 1996.
Sergei Brin and Lawrence Page.
The Anatomy of a Large-Scale Hypertextual Web Search Engine.
In Proceedings of the Seventh International World Wide Web Conference (WWW7).
Also in the Journal of Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.
Privacy-related material
- Data Mining: Staking a Claim on Your Privacy.
  Office of the Information and Privacy Commissioner, Ontario, 1998.
- David Banisar.
  Privacy and Human Rights 2000: An international survey of privacy laws and developments.
  Electronic Privacy Information Center, 2000.
- Pretty Poor Privacy: An Assessment of P3P and Internet Privacy.
  Electronic Privacy Information Center, June 2000.

Lecture Slides

Lecture 1	Lecture 2	Lecture 3	Lecture 4	Lecture 5	Lecture 6
Lecture 7	Lecture 8	Lecture 9	Lecture 10	Lecture 11	Lecture 12

Tutorial Slides

Reading Instructions (for the Exam)

Below is the stuff you have to know to take and hopefully pass the exam, not necessarily the stuff you need to read to know about the topics covered in the course.

The table contains chapter suggestions, depending on the book you follow. Paper numbers are those of the required reading material. The table should be read as follows: for each set of lectures, one should read what it appears on the last column and either what's in column two or three. Chapter suggestions of the form "i – j" should be interpreted as read from chapter i to chapter j (both inclusive).

Lectures	Read from the Durham book	Read from the Tan et al. book	Additional required reading
1	Ch 1.1–1.6	Ch 1
2	Ch 2.2, 2.3, 2.9, 2.10, 3.1–3.4	Ch 2.1–2.4
3	Ch 4.1–4.3	Ch 4.1, 4.2, 5.2	Slides
4	Ch 4.4, 4.7	Ch 4.3, 4.5, 5.1	Slides
5	Ch 5.1–5.3, 5.5.3	Ch 8.1, 8.2	Slides
6	Ch 5.5.1, 5.5.2, 5.5.5, 5.6	Ch 8.3, 8.4, 9.5	Slides; Papers 1 & 2
7	Ch 6.1–6.2.3	Ch 6.1–6.3.1	Papers 3 & 4
8	Slides; Paper 5	Ch 6.3, 6.4; Paper 5
9	Slides	Ch 6.7, 7.3
10	Slides	Ch 7.4	Paper 6
11	Ch 7		Slides
12			Slides; Papers 7 & 8
13 & 14	Nothing	Nothing	Nothing

Class Schedule

All lectures and tutorials take place at Polacksbacken 1211.

Num	Date	Time	Topics Covered
1	Monday 31/10	13-15	Introduction to Data Mining
2	Tuesday 1/11	13-15	Overview of Data Mining Techniques
3	Wednesday 2/11	8-10	Classification (1)
Tut 1	Wednesday 2/11	10-12	Introduction to MATLAB
Tut 2	Monday 7/11	10-12	Tutorial on Assignment 1
4	Wednesday 9/11	10-12	Classification (2)
5	Thursday 10/11	10-12	Clustering (1): Partitional Techniques
6	Tuesday 15/11	13-15	Clustering (2): Hierarchical Techniques
Tut 3	Tuesday 15/11	15-17	Tutorial on Assignment 2
7	Wednesday 16/11	15-17	Association Rules (1): Frequent Itemsets
8	Thursday 17/11	10-12	Association Rules (2): Fast Algorithms & Rule Generation
9	Tuesday 22/11	15-17	Association Rules (3): Evaluation of Association Patterns
10	Wednesday 23/11	15-17	Mining Sequential Patterns
11	Monday 28/11	13-15	Web Content Mining
12	Tuesday 29/11	13-15	Search Engines
13	Tuesday 6/12	13-15	Data Mining and Privacy (1)
14	Tuesday 6/12	15-17	Data Mining and Privacy (2)
15	Thursday 15/12	13-15	Course Overview, Course Evaluation & Exam Tips
	Monday 19/12	8-13	EXAM

Last modified: Sun Jun 18 13:40:59 2006.

Data Mining (Informationsutvinning)

Fourth Year Course, Period 2, 2005 (1DL105 and 1DL111)