Final Project
Let me know if you have questions about any individual topic.
- Project Proposal: Monday, November 27, 11:59 p.m.
- Project preliminary presentations: Last week of class (Dec 4-8)
- Final Project Writeup: Due Finals Week - Dec 15 at noon.
The goal of this project is to build on what you've learned from the course, including lectures, reading, guest speakers, analyzing research papers. There are approximately half a bazillion different projects you could do.
Possible Topics
The project is open-ended by design. These topics are not mutually exclusive; you may pull ideas from multiple topics. There are lots of tutorials out there that you can follow that may inspire ideas. (Some tutorials may be out of date.)
Topics are in no particular order, just trying to put related ones near each other.
Solve a new data analysis using Hadoop on Amazon/Google Cloud
Choose a problem with a large data set that you can solve using MapReduce. Pose questions and answer them by analyzing the data. This topic expects a larger/different problem/data set than the next two topics.
Possible Data Sets/Problems:
- Google’s n-gram data to learn trends in how words are used
- USENET (message board) corpus
- Data.gov
- Contextual Advertising
- ZebraNet - although does not qualify as a "large" data set
- Clickstream analysis
- Genetics
You can find many other resources online, for both these data sets and others.
Compare using at least one other language to implement MapReduce on Amazon/Google Cloud
You can implement MapReduce solutions with a variety of languages, e.g., Python and Hive. For each pair of students, add at least one additional language.
- Provide links to the tutorials you followed
- Compare and contrast the solutions to the Java solution with respect to development time, solution readability/maintainability, runtime, ...
Compare Cloud Domains/Infrastructures
You can get educational accounts from other service providers, e.g., Microsoft Azure, Oracle, IBM.
- Provide links to the tutorials you followed.
- Compare and contrast the service providers, e.g., documentation, usability, ...
Use Spark to solve a problem
Apache Spark is an open-source distributed processing system for large data sets. Use Spark to solve a data problem.
- Provide links to the tutorials you followed
- Compare and contrast the solutions to the MapReduce solution with respect to development time, solution readability/maintainability, runtime, ...
Using Other Cloud-based Services
Choose another web service (e.g., besides Hadoop) to explore.
- Provide links to the tutorials you followed
- What is the service used for? What are the pros and cons of the service? What are the equivalents of the service by other providers?
Implement a Distributed Application
Examples: email client (need to adhere to the email protocol and talk to a mail server), a web application server (create an application that will run server-side code and dynamically generate HTML files and respond with them), a game, ... using appropriate technologies, such as sockets, Java RMI, ...
You could use a web service API, e.g., Yelp Fusion and Twitter, to solve a problem. What kind of data management/failure/scalability/reliability issues should your application handle?
Literature Review/Research Proposal
Explore a topic to find out what the state-of-the-art is in various distributed system domains, e.g., volunteer computing, sensor networks, cluster computing, distributed file systems, ... What are the limitations? What are the next steps? Where is the field going? What is an interesting open problem you'd be interested in exploring?
I recommend using LaTeX, Bibtex, and Zotero for this type of project. ShareLatex and other online applications are available to make using LaTeX easier.
Intermediate Steps:
- Identify the world for your topic; that is, search for all papers relevant to your topic; not reading the papers, but determining relevance based on title and abstract. You should search in digital libraries (ACM, IEEE,...), recent conferences and workshops that cover your topic, and then use the bibliographies of recent papers to identify earlier relevant papers. The deliverable is a nicely formatted reference list, created by using bibtex and latex, a paragraph that explains how you performed your search, and one sentence describing the overall topic you are investigating.
- Understand the timeline, overall contributions, relative merits and limitations of the work embodied in the state-of-the-art in your topic. You need to read only the abstract, introduction, related work, and conclusions sections of each paper. Do this reading in chronological order (or reverse chronological order) of paper publication dates to obtain some sense of how the research has evolved over the years. Then, develop an outline where you have grouped the papers focusing on very similar problems, and then have a section of the outline for each paper. For each paper, be sure to include a subpart for problem addressed, contribution, findings of any evaluation of the contribution, and limitations.
- Bring this together into a paper. You should read some Background and Related Work sections of papers to see how they are written. That is, a good literature survey does not just write a separate paragraph on every paper written in the field in any order you want.
Other broad topics
Providing a link to a page or a paper to get you started.
- Volunteer Computing: BOINC, SETI@Home, A list of projects on Wikipedia
- Sensor Networks: Volcano Monitoring -- one paper, ZebraNet, Smart Homes, e.g., at UVA
- Internet of Things (IoT)
- Peer-to-Peer Networks
Project Proposal
Submit a one-page project proposal (single-spaced) containing the following information:
- Project title
- Team members (2-4)
- Description of project, including motivation, problem statement, and your approach to the problem.
- a sketch of a work plan -- how work will be divided up among teammates
- Expected outcomes/results (e.g., implementation, evaluation, analysis, ...)
Project Presentation
During the last week of classes (Dec 4-8), you will give a presentation about your project and preliminary results. In the presentation, you should cover:
- the problem you're trying to solve
- motivate why that problem is important and interesting -- you know your problem well; others don't
- your approach to the problem (methodology, tools, technologies)
- some preliminary results
- lessons you've learned so far
- the most interesting snags you ran into and how you solved them
- your plan to complete the project
- Show code/execution of program as appropriate
- Your lessons and take aways so far
- discussion/tradeoff questions
All teammates should present an approximately equal amount of time.
Time
We're expecting about 5 minutes of discussion, plus, we need a few minutes of switching between teams. Therefore:
- 1-person project: 10 minutes
- 2-person project: 15 minutes
- 3-4 person project: 20 minutes
Inspiration: Present Like Steve Jobs
Writeup
Write a document that describes your system.
- Make the problem's importance and motivation for solving the problem clear
- Include any assumptions that you made, along with any problems that you were unable to solve
- Describe and analyze your results, where results are broadly defined
- Include evaluation, as appropriate
- In your conclusions, talk about how you would improve or extend the project given more time
Recommendation: Utilize/integrate figures from your presentation to help organize and complete the write up.
Submission
We can put this in GitHub, under the Final Project Assignment, even though each team will have very different projects.
Include the following files:
- Your writeup (PDF).
- All the files for your source code.
- Instructions for running your code
- Sample output, which can be part of your writeup.