Syllabus

Introduction and motivation

A brief introduction to the concept of distributed systems. The goal is to explain the importance of a course about theory of distributed systems; in this sense, we will show that the design and implementation of effective distributed systems is not easy; issues related to ``imperfect'' computation and communication makes it substantially more difficult than designing centralized algorithms.

Lectures: 21/9
Slides: [pdf]
Additional notes: [pdf]

Reading material:

Basic abstractions

This lecture presents the basic abstractions we use to model a distributed system composed of computational entities (processes) communicating by exchanging messages

Lectures: 23/9, 28/9
Slides: [pdf]

Reading material and topics:

Time and Global States

This lecture introduces the concept of logical time, showing how it can be used to order distributed events and debug distributed applications.

Lectures: 28/9, 30/9
Slides: [pdf]
Reading material and topics: [BM93]

Broadcast

This lecture explains how to obtain reliable communication ``many-to-many'', from multiple nodes to multiple nodes. The relation between this problem and more complex forms of coordination is explained.

Lectures: 5/10, 7/10
Slides: [pdf]
Reading material and topics: [HT93]

Epidemic Protocols

This set of lectures will show how biology-inspired techniques (epidemics) can be used to obtain protocols characterized by efficiency and robustness, with the only drawback that guarantees are not deterministic any more, but only probablistic.

Lectures: 7/10
Slides: [pdf]
Reading material and topics: [DGH+87]

Consensus

This set of lectures is devoted to one of the main problems in distributed computing: how to reach an agreement among a collection of distributed processes. We will first show that Consensus cannot be reached in an asynchronous system [FLP85], and we will later shows how consensus can be solved ``in practice'' in real systems [CT96].

Lectures: 19/10, 21/10
Reading material and topics:

Additional reading material:

Aggregation

This lecture introduces the concept of aggregation, which is the collective name of a set of functions that provide statistical information about a system.

Lectures: 26/10
Slides: [pdf]
Reading material and topics:

Distributed transactions

A transaction is the execution of a sequence of actions that must be either entirely completed or aborted, independently of other transactions. A distributed transaction involves several servers and/or databases, and needs special protocols to be implemented.

Lectures: 28/10
Slides: [pdf]
Reading material and topics: [BT93]

Consistency and Replication

This lecture introduces the concept of consistency policies for replicated systems.

Lectures: 2/11, 4/11
Slides: [pdf]
Reading material and topics: [Tv07, Chapter 7], [Gho06, Chapter 16]

Additional reading material:

Group Communication

This lecture discusses a middleware tool called group communication, where multiple groups of processes or objects cooperate through simple primitives to provide replicated services to clients.

Lectures: 4/11
Slides: [pdf]
Reading material:

Peer Sampling

This lecture introduces the peer sampling problem, which is the large-scale equivalent of group membership in traditional systems.

Lectures: 9/11
Slides: [pdf]
Topics and Reading material:

Lab - Peersim

This lecture introduces the Peersim simulator, which will be used for the laboratory projects.

Lectures: 11/11 Slides: [pdf]
Additional material: Peersim web site

P2P Systems

This lecture discusses the differences between structured and unstructured P2P systems and briefly introduces some well-known protocols.

Lectures: 18/11, 23/11)
Slides: [pdf], [pdf], [pdf]
Reading material and topics:

Paxos

This short lecture discusses Paxos, a very practical Consensus protocol which is actually used in several real projects - including the Google file system.

Lectures: 30/11
Slides: [pdf]
Reading material: [Lam01]
Additional reading material

Checkpointing

Lectures: 9/12
Slides: [pdf]
Reading material: [GPJ10]
Additional reading material: [BvS08]

Practical Byzantine Fault Tolerance

We give a look to a paper that has recently re-started the research in the area of byzantine fault tolerance.

Lectures: 2/12
Slides: [pdf]
Reading material: [CL99]
Additional reading material:

Checkpointing

In rollback-recovery systems, processes achieve fault-tolerance by using ``stable storage'' to periodically save recovery information that can be used to re-start execution in case of failures.

Lectures: 14/12
Slides: [pdf]
Reading material: [EAWJ02]

P2P Video Streaming

This lecture introduces novel results on peer-to-peer video streaming..

Lectures: 16/12 (last year)
Slides: [pdf]
Reading material: [DBGM02,ZLLY05]
Additional reading material: [hCRSZ02,SF06]

Bibliography

CDK05
George Coulouris, Jean Dollimore, and Tim Kindberg.
Distributed systems (4th ed.): concepts and design.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.
[Bibtex].

GR04
Rachid Guerraoui and Louis Rodrigues.
Reliable Distributed Programming.
Springer-Verlag, Berlin, Germany, 2004.
[Bibtex].

Tv07
Andrew Tanenbaum and Maarten van Steen.
Distributed Systems: Principles and Paradigms (2nd ed.).
Prentice Hall, 2007.
[Bibtex].

AS87
Bowen Alpern and Fred B. Schneider.
Recognizing safety and liveness.
Distributed Computing, 2(3):117-126, 1987.
[PDF], [Bibtex].

HT93
Vassos Hadzilacos and Sam Toueg.
A modular approach to fault-tolerant broadcasts and related problems.
In Sape Mullender, editor, Distributed Systems (2nd edition). Addison-Wesley, 1993.
Chapter 5.
[PDF], [Bibtex].

BM93
Ozalp Babaoglu and Keith Marzullo.
Consistent global states of distributed systems: Fundamental concepts and mechanisms.
In Sape Mullender, editor, Distributed Systems (2nd edition). Addison-Wesley, 1993.
Chapter 4.
[PDF], [Bibtex].

DGH+87
Alan J. Demers, Daniel H. Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard E. Sturgis, Daniel C. Swinehart, and Douglas B. Terry.
Epidemic algorithms for replicated database maintenance.
In In Proceedings of the 6th annual ACM Symposium on Principles of Distributed Computing Systems (PODC'87), pages 1-12, 1987.
[PDF], [Bibtex].

FLP85
Michael J. Fischer, Nancy A. Lynch, and M. S. Paterson.
Impossibility of distributed consensus with one faulty processor.
Journal of the ACM, 32(2):374-382, April 1985.
[PDF], [Bibtex].

CT96
Tushar Deepak Chandra and Sam Toueg.
Unreliable failure detectors for reliable distributed systems.
Journal of the ACM, 43(2):225-267, 1996.
[PDF], [Bibtex].

MR99
Achour Mostéfaoui and Michel Raynal.
Solving consensus using chandra-toueg's unreliable failure detectors: A general quorum-based approach.
In In Proceedings of the 13th International Symposium on Distributed Computing (DISC'00), pages 49-63, Bratislava, Slavak Republic, 1999.
[PDF], [Bibtex].

AT98
Marcos Kawazoe Aguilera and Sam Toueg.
Correctness proof of ben-or’s randomized consensus algorithm.
Technical Report TR98-1682, Cornell University, 1998.
[PDF], [Bibtex].

Tus96
Vassos Hadzilacos Sam Toueg Tushar Deepak Chandra.
The weakest failure detector for solving consensus.
Journal of the ACM, 43(4):685-722, 1996.
[PDF], [Bibtex].

Gue00
Rachid Guerraoui.
Indulgent algorithms.
In In Proceedings of the 19th annual ACM Symposium on Principles of Distributed Computing Systems (PODC'00), pages 49-63, Portland, OR, 2000.
[PDF], [Bibtex].

AT98
Marcos Kawazoe Aguilera and Sam Toueg.
Failure detection and randomization: A hybrid approach to solve consensus.
SIAM Journal of Computing, 1998.
[PDF], [Bibtex].

BO83
Michael Ben-Or.
Another advantage of free choice: Completely asynchronous agreement protocols (extended abstract).
In In Proceedings of the 2nd annual ACM Symposium on Principles of Distributed Computing Systems (PODC'83), pages 27-30, 1983.
[PDF], [Bibtex].

Asp03
James Aspnes.
Randomized protocols for asynchronous consensus.
Distributed Computing, 16(2-3):165-175, September 2003.
[PDF], [Bibtex].

JMB05
Márk Jelasity, Alberto Montresor, and Ozalp Babaoglu.
Gossip-based aggregation in large dynamic networks.
ACM Trans. Comput. Syst., 23(1):219-252, August 2005.
[PDF], [Bibtex].

vRBV03
Robbert van Renesse, Kenneth Birman, and Werner Vogels.
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining.
ACM Trans. Comput. Syst., 21(2):164-206, May 2003.
[PDF], [Bibtex].

BT93
Ozalp Babaoglu and Sam Toueg.
Understanding non-blocking atomic commitment.
In Sape Mullender, editor, Distributed Systems (2nd edition). Addison-Wesley, 1993.
Chapter 6.
[PDF], [Bibtex].

Gho06
Sukumar Ghosh.
Distributed Systems: An Algorithmic Approach.
Chapman & Hall/CRC, 2006.
[Bibtex].

BMST93
Navin Budhiraja, Keith Marzullo, Fred Schneider, and Sam Toueg.
The primary-backup approach.
In Sape Mullender, editor, Distributed Systems (2nd edition). Addison-Wesley, 1993.
Chapter 8.
[PDF], [Bibtex].

Sch93
Fred Schneider.
Replication management using the state machine approach.
In Sape Mullender, editor, Distributed Systems (2nd edition). Addison-Wesley, 1993.
Chapter 7.
[PDF], [Bibtex].

CKV01
Gregory Chockler, Idit Keidar, and Roman Vitenberg.
Group communication specifications: a comprehensive study.
ACM Comput. Surv., 33(4):427-469, 2001.
[PDF], [Bibtex].

JGKv04
Márk Jelasity, Rachid Guerraoui, Anne-Marie Kermarrec, and Maarten van Steen.
The peer sampling service: Experimental evaluation of unstructured gossip-based implementations.
In Hans-Arno Jacobsen, editor, Proceedings of Middleware 2004, ACM/IFIP/USENIX International Middleware Conference, volume 3231 of Lecture Notes in Computer Science, pages 79-98, Toronto, Canada, 2004. Springer.
[PDF], [Bibtex].

VGv05
Spyros Voulgaris, Daniela Gavidia, and Maarten van Steen.
CYCLON: Inexpensive membership management for unstructured p2p overlays.
J. Network Syst. Manage., 13(2), 2005.
[PDF], [Bibtex].

JHv07
Gian Paolo Jesi, David Hales, and Maarten van Steen.
Identifying malicious peers before it's too late: A decentralized secure peer sampling service.
In Proceedings of the First International Conference on Self-Adaptive and Self-Organizing Systems (SASO'07), pages 237-246, Boston, MA, 2007. IEEE Computer Society.
[PDF], [Bibtex].

SMK+01
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan.
Chord: A scalable peer-to-peer lookup service for internet applications.
In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), pages 149-160, San Diego, CA, 2001. ACM, ACM Press.
[PDF], [Bibtex].

MM02
P. Maymounkov and D. Mazieres.
Kademlia: A peer-to-peer information system based on the XOR metric.
In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS’02), pages 258-263. Springer, 2002.
[PDF], [Bibtex].

Coh03
B. Cohen.
Incentives Build Robustness in BitTorrent.
In Proc. of the Workshop on Economics of Peer-to-Peer Systems, 2003.
[PDF], [Bibtex].

Lam01
L. Lamport.
Paxos made simple.
ACM SIGACT News, 32(4):18-25, 2001.
[PDF], [Bibtex].

CGR07
Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone.
Paxos made live: an engineering perspective.
In Indranil Gupta and Roger Wattenhofer, editors, Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing (PODC'07), pages 398-407, Portland, Oregon, USA, August 2007. ACM.
[PDF], [Bibtex].

GPJ10
Maarten van Steen Gian Paolo Jesi, Alberto Montresor.
Secure peer sampling.
Elsevier Computer Networks – Special Issue on Collaborative Peer-to-Peer Systems, 2010.
accepted.
[PDF], [Bibtex].

BvS08
Arno Bakker and Maarten van Steen.
Puppetcast: A secure peer sampling protocol.
In Proc. of the European Conference onComputer Network Defense, pages 3-10. IEEE Computer Society, 2008.
[PDF], [Bibtex].

CL99
Miguel Castro and Barbara Liskov.
Practical byzantine fault tolerance.
In Proc. of the 3rd Symposium on Operating Systems Design and Implementation (OSDI'99), New Orleans, USA, February 1999. USENIX Association.
[PDF], [Bibtex].

KCW+07
R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin.
Zyzzyva: Speculative byzantine fault tolerance.
In Proc. of the ACM Symposium on Operating Systems Principles (SOSP'07), Stevenson, WA, October 2007. ACM.
[PDF], [Bibtex].

EAWJ02
E. N. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson.
A survey of rollback-recovery protocols in message-passing systems.
ACM Computing Surveys, 34(3):375-408, 2002.
[PDF], [Bibtex].

DBGM02
Hrishikesh Deshpande, Mayank Bawa, and Hector Garcia-Molina.
Streaming live media over peers.
Technical Report 2002-21, Stanford InfoLab, 2002.
[PDF], [Bibtex].

ZLLY05
Xinyan Zhang, Jiangchuan Liu, Bo Li, and Y.-S.P. Yum.
Coolstreaming/donet: a data-driven overlay network for peer-to-peer live media streaming.
volume 3, pages 2102-2111 vol. 3, March 2005.
[Bibtex].

hCRSZ02
Yang hua Chu, S.G. Rao, S. Seshan, and Hui Zhang.
A case for end system multicast.
Selected Areas in Communications, IEEE Journal on, 20(8):1456-1471, October 2002.
[Bibtex].

SF06
Thomas Silverston and Olivier Fourmaux.
P2p iptv measurement: A comparison study.
CoRR, abs/cs/0610133, 2006.
[PDF], [Bibtex].



Alberto Montresor 2009-12-12