Project 1: A Web Server

Due: September 29 at 11:59 p.m.

Work on this project in a team of two or three.

Set up

Objectives

The goal of this project is to implement a functional web server. This assignment will teach you the basics of distributed programming, client/server structures, and issues in designing and building high performance servers. While the course lectures will focus on the concepts that enable network communication, it is also important to understand how the fundamental structures of systems make use of global Internet.

How Does a Web Server Work?

At a high level, a web server listens for connections on a socket (bound to a specific port on a host machine). Clients connect to this socket and use a simple text-based protocol to retrieve files from the server. For example, you might try the following command from a terminal:
telnet www.cs.wlu.edu 80

Then type: GET /index.html HTTP/1.0

The whole thing together looks something like:

telnet www.cs.wlu.edu  80
Trying 137.113.118.203...
Connected to hydros.cs.wlu.edu.
Escape character is '^]'.
GET /index.html HTTP/1.0

Note: Type two carriage returns after the "GET" command. The command will return to you (in the terminal) the HTML representing the "front page" of the Washington and Lee Computer Science web page. What is the content of the page? Why?

What is happening?

The server is translating relative filenames (such as index.html) to absolute filenames in a local filesystem.

For example, you might decide to keep all the files for your server in ~yourusername/cs325/server/files/, which is called the document root. When your server gets a request for index.html (which is the default web page if no file is specified), it will prepend the document root to the specified file and determine if the file exists, and if the proper permissions are set on the file (typically the file has to be world readable). If the file does not exist, the server returns a file not found error. If a file is present but does not have the proper permissions, the server returns a "permission denied" error. (Note what those return codes are.) Otherwise, the server returns an HTTP OK message along with the contents of a file.

Since index.html is the default file, web servers typically translate GET / to GET /index.html That way index.html is assumed to be the filename if no explicit filename is present. This is also why the two URLs http://www.cs.wlu.edu and http://www.cs.wlu.edu/index.html return the same results. You should implement this as well.

When you type a URL into a web browser, the server retrieves the contents of the requested file. If the file is of type text/html and HTTP/1.0 is being used, the browser will parse the html for embedded links (such as images or CSS files) and then make separate connections to the web server to retrieve the embedded files. If a web page contains 4 images, a browser makes a total of five separate connections to the web server to retrieve the html and the four image files.

Using HTTP/1.0, a separate connection is used for each requested file. This implies that the TCP connections being used never get out of the slow start phase. HTTP/1.1 attempts to address this limitation. When using HTTP/1.1, the server keeps connections to clients open, allowing for "persistent" connections and pipelining of client requests. That is, after the results of a single request are returned (e.g., index.html), the server should by default leave the connection open for some period of time, allowing the client to reuse that connection to make subsequent requests.

Sending an HTTP/1.1 request requires the Host header, which specifies which host you're requesting the resource from and, optionally, the port of the host. For example:

telnet www.cs.wlu.edu 80
Trying 137.113.118.203...
Connected to hydros.cs.wlu.edu.
Escape character is '^]'.
GET /index.html HTTP/1.1
Host: www.cs.wlu.edu

Part 1: Implementing the Web Server

Your web server needs to support enough of the HTTP/1.0 and HTTP/1.1 protocols to allow an existing web browser (e.g., Firefox) to connect to your web server and retrieve the contents of the W&L CS home page from your server. (The appropriate files were included in the github repository.) The page will probably not look exactly like the original web page. Note that you DO NOT have to support script parsing (PHP, JavaScript), and you do not have to support HTTP POST requests. You should support images, and you should return appropriate HTTP error messages as needed.

To see which version of HTTP is being used by Firefox, go to about:config in the navigation bar and look for network.http.version.

At a high level, your web server will be structured something like the following:

Forever loop:
    Listen for connections
    Accept new connection from incoming client
    Parse HTTP request
    Ensure well-formed request (return error otherwise)
    Determine if target file exists and if permissions are set properly
        (return error otherwise)
    Transmit contents of file to client
        (by performing reads on the file and writes on the socket)
    Close the connection (if HTTP/1.0)

Response

The response contains:

Initial response line (status line)
200 (OK), 3xx (moved), 4xx and 5xx (errors)
Header lines
Information about response or about object sent in message body
Blank line
Requested document

You are required to support the 200, 404, 403, and 400 status codes at least.

You should support the Content-Type, Content-Length, and Date headers in your responses. Also, are there any headers you need for HTTP/1.1?

You must at least support HTML, TXT, CSS, JPG, PNG, and GIF files for requested documents.

Handling Multiple Requests

You have three main choices in how you structure your web server in the context of the above simple structure:

A multi-threaded approach spawns a new thread for each incoming connection. When the server accepts a connection, it will spawn a thread to parse the request, transmit the file, etc. (This is the more familiar implementation and what you will likely implement. See below for more details.)
A multi-process approach maintains a worker pool of active processes to hand requests off to from the main server. This approach is largely appropriate because of its portability (relative to assuming the presence of a given threads package across multiple hardware/software platform; not an issue for Java). It faces increased context-switch overhead relative to a multi-threaded approach.
An event-driven architecture will keep a list of active connections and loop over them, performing a little bit of work on behalf of each connection. For example, there might be a loop that first checks to see if any new connections are pending to the server (performing appropriate bookkeeping if so), and then it will loop over all existing client connections and send a "block" of file data to each (e.g., 4096 bytes, or 8192 bytes, matching the granularity of disk block size). This event-driven architecture has the primary advantage of avoiding any synchronization issues associated with a multi-threaded model (though synchronization effects should be limited in your simple web server) and avoids the performance overhead of context switching among a number of threads.
This approach is loosely based on Matt Welsh's Ph.D. thesis. If you successfully implement this approach, you will receive extra credit points.

In class, we discussed some of the potential performance issues you should consider when designing a multithreaded web server. You should implement the "basic" thread-handling mechanisms we discussed--restricting the number of connections waiting to enter the application.

Extra Credit: (up to 5 pts) Implement a thread pool that restricts the number of available, concurrently executing threads. When a thread is finished handling a request, the thread should be returned to the pool of available threads. If no thread is available, the client connection should wait until there is an available thread. (What should you do if too many clients are waiting for threads?) See Thread Pools Tutorial.

Handling Connections

One key issue is determining how long to keep the connection open. The connection timeout needs to be configured in the server and ideally should be dynamic--based on the number of other active connections the server is currently supporting. Thus, if the server is idle, it can afford to leave the connection open for a relatively long period of time. If the server is handling many clients at once, it may not be able to afford to have an idle connection sitting around (consuming kernel/thread resources) for very long. You should develop a simple heuristic to determine this timeout in your server.

Command-line Arguments

The server document directory (the directory that the webserver uses to serve files) is the first command-line argument.

The port that the server listens on is the second command-line argument. (Note that you should use ports between 8888 and 9999.)

Thus, I should be able to run your server as
$ java edu.wlu.cs.WebServer /home/faculty/sprenkle/web_server_files 8888
How can you add command-line arguments to run code in Eclipse?

Non-Goals

A full-fledged web server would do a lot more than what ours is doing. For example, the browser sends more headers besides Host, and the server could work with that information. We're not going to worry about that.

Part 2: Writeup

After you complete implementing your server, write a document that describes your chosen architecture and implementation details. Instead of diving right into your description of the architecture, write as a technical paper, with an introduction (motivation, goals, challenges). Describe/justify your design decisions (e.g., your "magic numbers" for number of threads in the pool, limit on the socket queue, ...). Clearly state anything that does not work correctly. Describe any problems that you encountered and any help that you received. Create figures as appropriate.

In addition to describing the structure of your server, include a discussion that addresses the following questions:

Web servers often use .htaccess files to restrict access to clients based on their IP address. How could you modify your server to support .htaccess? Remember your code design techniques from CSCI209.
Since your web servers are restricted to on-campus use only, it is unlikely that you will notice any significant performance differences between HTTP/1.0 and HTTP/1.1. Can you think of a scenario in which HTTP/1.0 may perform better than HTTP/1.1? Can you think of a scenario when HTTP/1.1 outperforms HTTP/1.0? Think about bandwidth, latency, and file size. Consider some of the pros and cons of using a connection per session versus using a connection per object. The difference between the two comes down to the following:
- Only a single connection is established for all retrieved objects, meaning that slow start is only incurred once (assuming that the pipeline is kept full) and that the overhead of establishing and tearing down a TCP connection is also only incurred once.
- However, all objects must be retrieved in serial in HTTP/1.1 meaning that some of the benefits of parallelism are lost.

Part 3: Submission

Include the following files in your submission, within your GitHub repository.

Your writeup (PDF).
All the files for your source code.

Frequently Asked Questions

How should we test?: Use JUnit testing to make sure the code works "in the small". If your code is designed well, that should straightforward. You can use something like Selenium (a browser plugin) to help automate testing from the browser.
You can also use telnet to make sure your requests are correct, independent of the browser.

Resources

HTTP 1.0 and 1.1: http://www.jmarshall.com/easy/http/
w3c HTTP page: http://www.w3.org/Protocols/
HTTP Wikipedia: http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol

Grading

Implementation of web server:
- handles requests appropriately (including error responses)
- sends back responses
- handles multiple requests
Demonstration of testing
Write up
Individual evaluation of team

CSCI 325: Distributed Systems