Project 3: Inverted Index

Due: November 13 11:59 p.m.

The goal of this project is to become familiar with Hadoop/MapReduce, a popular model for programming distributed systems that was developed by Google and publicly released by Apache.

Part 0: Set Up

Eclipse Set Up

Clone the GitHub assignment repository.

Click on the InvertedIndex within Working Tree and checkout the project.

You may need to set the source directory. Go to "Configure Build Path", remove the existing source directory, and add "src/main/java" as the source directory.

Word Count Code

Create a .jar file from your Java project and specify the main program as the WordCount class. In Eclipse, if you use "Export --> Java --> Jar", the wizard steps you through the creation of the .jar file.

AWS EMR Overview using Word Count Example

From AWS documentation: "When you finish working with your Starter Account, close your browser tab. Important: If you choose End Lab, you will lose access to your Starter Account. Do not choose End Lab unless you no longer want to use your Starter Account."

See the steps in Getting Started: Analyzing Big Data with Amazon EMR. I'll add some more specific/clarifying info here:
Upload your .jar file to an appropriate location on S3, like the bucket you created in the steps above.
After you create a cluster, add a step using a Custom Jar. Add your jar file. In the arguments, put the directory for the input (s3://wlu-csci325-input/preliminput) and for the output (your bucket).
Tip: Each time you run a Hadoop step you must specify a new output location or the job will fail. You can specify a folder within an existing bucket (what I typically do) or create a new bucket.

Google Cloud with Word Count Example

Setting up with Google Cloud DataProc is similar to Amazon. First, set up your account, following the emailed directions.

Create a bucket in Google Cloud Storage.
Download the prelimtest.tar.gz and final.tar.gz files.
Extract the files (on your computer)
Upload the [extracted] folders to your Google Cloud Storage bucket.
Upload your .jar file to that bucket.
Create a Cluster
After you create a cluster, create a job. Add your jar file, including the full name (including the package name) of the WordCount class, e.g., edu.wlu.cs.hadoop.wordcount.WordCount. In the arguments, put the directory for the YOUR input and output directories in YOUR bucket. For example, (gs://wlu-csci325-input/preliminput) is my bucket address.
Tip: Each time you run a Hadoop step you must specify a new output location or the job will fail. You can specify a folder within an existing bucket (what I typically do) or create a new bucket.

Part 1: Build an Inverted Index

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

Doc1:: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Doc2:: Buffalo are mammals.

We could construct the following inverted file index:

 
      buffalo -> Doc1, Doc2
      are -> Doc2
      mammals -> Doc2

Your goal is to build an inverted index of words to the documents that contain them.

Create an index of the files in the prelimtest dataset and then the final dataset. The "starter" input files are in the folder preliminput to give you a smaller set to test on to start.

Your end result will be of the form: (word, docname_list).

The big hiccup here is that the default file format doesn't provide you with the name of the file in your map function. You will have to figure some way around this. I suggest checking out Mapper.Contexts, InputFormats, and InputSplits.

Inverted Index Requirements

Lowercase the words
Strip punctuation but make sure that contractions don't change meaning, e.g., "it's" becoming posessive "its".
Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of a document. The are called "stop words". You can find "stop word" lists online. Choose one (and justify it). Your goal is to remove these words from your final inverted index output.

Cluster

The default cluster is fine for the smaller input set. For the larger input, you may want to increase the number of nodes involved in the cluster.

Downloading the output files

You can click on each file to download them. If you want to download in bulk, you'll need the gsutil.

Resources

Word Count Code Walk-through
Apache Hadoop Javadocs
Analyze Big Data with Hadoop and Getting Started: Analyzing Big Data with Amazon EMR
Map Reduce paper by Jeff Dean and Sanjay Ghemawat
Example Code and Libraries--not sure if useful; more libraries than examples
AWS Toolkit for Eclipse -- not sure if useful

Part 2: Querying Inverted Index

Write a query program that queries your inverted file index. Your program will take as input a user-specified word (or phrase) and return the IDs of the documents that contain that word.

Your program should take the directory location of your inverted index (output) files as a command-line argument.

Handle users entering words in various cases, with various punctuation, as well as stop words.

This code can be in either Java or Python. Put the code in the appropriate source location and document its use in the write up.

Optional Extensions

If you get the above working and want to try something else, here are some optional extensions that you can experiment with:

Exclude the tags (because we're thinking about this for a web search engine), i.e., don't map any word that looks like <tag> or <tag />. What is required to get rid of multi-word tags, e.g., <img src="image.jpg">?
Instead of creating an inverted file index (which maps words to their document ID), create a full inverted index (which maps words to their document ID + position in the document). How do you specify a word's position in the document? What code in Hadoop will help with this?
If you created a full inverted index, write a query program on top of the index which returns not only the document IDs but also a text "snippet" from each document showing where the query term appears in the document.

Part 3: Writeup

As usual, describe the project: an overview/introduction, the architecture, and implementation. Show a snippet of the output (full output will be in GitHub).

Include general thoughts/reflection, including answers to the following questions:

What are some challenges in searching web pages? How are these challenges similar/different from searching other corpuses/repositories?
How did you handle punctuation? How did you define "punctuation"? What are the tradeoffs in your decision in terms of the resulting search/results?
What else would you have done in the inverted index implementation, given more time, energy, resources, etc.?
How difficult was it to implement the inverted index? How difficult would it be to implement another task, given this experience? What would be straightforward? What would take more time?
If you used Google Cloud Platform, when did the job start reducing, i.e., at what level of mapping did reducing start?
What did you think of using AWS/Google Cloud? Tell me the positives and the negatives.

Part 4: Submission

GitHub Classroom will make a snapshot of your repository at the deadline. Your repository should contain:

Your writeup (PDF).
All the files for your source code.
A README, containing instructions for running your code and your jar file, e.g., what command-line arguments are required for your inverted index code and how to run your querying program (example call to run the program--including command-line arguments).
Your output; it should be small enough to fit on GitHub. If it's not small enough, then share a folder with me on Box.

CSCI 325: Distributed Systems