Skip to main content.

Final Project

Let me know if you have questions about any individual topic.

The goal of this project is to build on what you've learned from the course, including lectures, reading, guest speakers, analyzing research papers. There are approximately half a bazillion different projects you could do.

Possible Topics

The project is open-ended by design. These topics are not mutually exclusive; you may pull ideas from multiple topics. There are lots of tutorials out there that you can follow that may inspire ideas. (Some tutorials may be out of date.)

Topics are in no particular order, just trying to put related ones near each other.

Solve a new data analysis using Hadoop on Amazon/Google Cloud

Choose a problem with a large data set that you can solve using MapReduce. Pose questions and answer them by analyzing the data. This topic expects a larger/different problem/data set than the next two topics.

Possible Data Sets/Problems:

You can find many other resources online, for both these data sets and others.

Compare using at least one other language to implement MapReduce on Amazon/Google Cloud

You can implement MapReduce solutions with a variety of languages, e.g., Python and Hive. For each pair of students, add at least one additional language.

Compare Cloud Domains/Infrastructures

You can get educational accounts from other service providers, e.g., Microsoft Azure, Oracle, IBM.

Use Spark to solve a problem

Apache Spark is an open-source distributed processing system for large data sets. Use Spark to solve a data problem.

Using Other Cloud-based Services

Choose another web service (e.g., besides Hadoop) to explore.

Implement a Distributed Application

Examples: email client (need to adhere to the email protocol and talk to a mail server), a web application server (create an application that will run server-side code and dynamically generate HTML files and respond with them), a game, ... using appropriate technologies, such as sockets, Java RMI, ...

You could use a web service API, e.g., Yelp Fusion and Twitter, to solve a problem. What kind of data management/failure/scalability/reliability issues should your application handle?

Literature Review/Research Proposal

Explore a topic to find out what the state-of-the-art is in various distributed system domains, e.g., volunteer computing, sensor networks, cluster computing, distributed file systems, ... What are the limitations? What are the next steps? Where is the field going? What is an interesting open problem you'd be interested in exploring?

I recommend using LaTeX, Bibtex, and Zotero for this type of project. ShareLatex and other online applications are available to make using LaTeX easier.

Intermediate Steps:

Other broad topics

Providing a link to a page or a paper to get you started.

Project Proposal

Submit a one-page project proposal (single-spaced) containing the following information:

Project Presentation

During the last week of classes (Dec 4-8), you will give a presentation about your project and preliminary results. In the presentation, you should cover:

All teammates should present an approximately equal amount of time.


We're expecting about 5 minutes of discussion, plus, we need a few minutes of switching between teams. Therefore:

Inspiration: Present Like Steve Jobs


Write a document that describes your system.

Recommendation: Utilize/integrate figures from your presentation to help organize and complete the write up.


We can put this in GitHub, under the Final Project Assignment, even though each team will have very different projects.

Include the following files:

  1. Your writeup (PDF).
  2. All the files for your source code.
  3. Instructions for running your code
  4. Sample output, which can be part of your writeup.