Project 1: A Web Server
Due: September 29 at 11:59 p.m.
Work on this project in a team of two or three.
Set up
Objectives
The goal of this project is to implement a functional web server. This assignment will teach you the basics of distributed programming, client/server structures, and issues in designing and building high performance servers. While the course lectures will focus on the concepts that enable network communication, it is also important to understand how the fundamental structures of systems make use of global Internet.
How Does a Web Server Work?
At a high level, a web server listens for connections on a
socket (bound to a specific port on a host machine).
Clients connect to this socket and use a simple text-based
protocol to retrieve files from the server. For example, you
might try the following command from a terminal:
telnet www.cs.wlu.edu 80
Then type: GET /index.html HTTP/1.0
The whole thing together looks something like:
telnet www.cs.wlu.edu 80 Trying 137.113.118.203... Connected to hydros.cs.wlu.edu. Escape character is '^]'. GET /index.html HTTP/1.0
Note: Type two carriage returns after the "GET" command. The command will return to you (in the terminal) the HTML representing the "front page" of the Washington and Lee Computer Science web page. What is the content of the page? Why?
What is happening?
The server is translating relative filenames (such as
index.html
) to absolute
filenames in a local filesystem.
For example, you might decide to keep all the files for your
server
in ~yourusername/cs325/server/files/
,
which is called the document root. When your
server gets a request for index.html
(which is the default web page if no file is specified), it will
prepend the document root to the specified file and determine if
the file exists, and if the proper permissions are set on the
file (typically the file has to be world readable). If
the file does not exist, the server returns a file not found
error. If a file is present but does not have the proper
permissions, the server returns a "permission denied" error.
(Note what those return codes are.) Otherwise, the
server returns an HTTP OK message along with the contents of a
file.
Since index.html
is the default file, web servers
typically translate GET /
to GET /index.html
That
way index.html
is assumed to be the
filename if no explicit filename is present. This is also why
the two
URLs http://www.cs.wlu.edu
and http://www.cs.wlu.edu/index.html
return the same results. You should implement this as
well.
When you type a URL into a web browser, the server retrieves
the contents of the requested file. If the file is of type
text/html
and HTTP/1.0 is being used, the browser
will parse the html for embedded links (such as images or CSS
files) and then make separate connections to the web server to
retrieve the embedded files. If a web page contains 4 images, a
browser makes a total of five separate connections to the web
server to retrieve the html and the four image files.
Using HTTP/1.0
, a separate connection is used for
each requested file. This implies that the TCP connections
being used never get out of
the slow
start phase.
HTTP/1.1 attempts to address this limitation.
When using HTTP/1.1, the server keeps connections to clients
open, allowing for "persistent" connections and pipelining of
client requests. That is, after the results of a single request
are returned (e.g., index.html
), the server should
by default leave the connection open for some period of time,
allowing the client to reuse that connection to make subsequent
requests.
Sending an HTTP/1.1 request requires the Host
header, which specifies which host you're requesting the resource from
and, optionally, the port of the host. For example:
telnet www.cs.wlu.edu 80 Trying 137.113.118.203... Connected to hydros.cs.wlu.edu. Escape character is '^]'. GET /index.html HTTP/1.1 Host: www.cs.wlu.edu
Part 1: Implementing the Web Server
Your web server needs to support enough of the HTTP/1.0 and
HTTP/1.1 protocols to allow an existing web browser (e.g.,
Firefox) to connect to your web server and retrieve the
contents of the
W&L CS home page from
your server. (The appropriate files were included in the github
repository.) The page will probably not look exactly like the
original web page. Note that you DO NOT have to support script
parsing (PHP, JavaScript), and you do not have to support HTTP
POST
requests. You should support images, and you
should return appropriate HTTP error messages as needed.
To see which version of HTTP is being used by Firefox, go
to about:config
in the navigation bar and look
for network.http.version
.
At a high level, your web server will be structured something like the following:
Forever loop: Listen for connections Accept new connection from incoming client Parse HTTP request Ensure well-formed request (return error otherwise) Determine if target file exists and if permissions are set properly (return error otherwise) Transmit contents of file to client (by performing reads on the file and writes on the socket) Close the connection (if HTTP/1.0)
Response
The response contains:
- Initial response line (status line)
200 (OK), 3xx (moved), 4xx and 5xx (errors) - Header lines
Information about response or about object sent in message body - Blank line
- Requested document
You are required to support the 200, 404, 403, and 400 status codes at least.
You should support the Content-Type
,
Content-Length
, and Date
headers in your
responses. Also, are there any headers you need for HTTP/1.1?
You must at least support HTML, TXT, CSS, JPG, PNG, and GIF files for requested documents.
Handling Multiple Requests
You have three main choices in how you structure your web server in the context of the above simple structure:
- A multi-threaded approach spawns a new thread for each incoming connection. When the server accepts a connection, it will spawn a thread to parse the request, transmit the file, etc. (This is the more familiar implementation and what you will likely implement. See below for more details.)
- A multi-process approach maintains a worker pool of active processes to hand requests off to from the main server. This approach is largely appropriate because of its portability (relative to assuming the presence of a given threads package across multiple hardware/software platform; not an issue for Java). It faces increased context-switch overhead relative to a multi-threaded approach.
- An event-driven architecture will keep a list of active
connections and loop over them, performing a little bit of
work on behalf of each connection. For example, there might
be a loop that first checks to see if any new connections are
pending to the server (performing appropriate bookkeeping if
so), and then it will loop over all existing client
connections and send a "block" of file data to each (e.g.,
4096 bytes, or 8192 bytes, matching the granularity of disk
block size). This event-driven architecture has the primary
advantage of avoiding any synchronization issues associated
with a multi-threaded model (though synchronization effects
should be limited in your simple web server) and avoids the
performance overhead of context switching among a number of
threads.
This approach is loosely based on Matt Welsh's Ph.D. thesis. If you successfully implement this approach, you will receive extra credit points.
In class, we discussed some of the potential performance issues you should consider when designing a multithreaded web server. You should implement the "basic" thread-handling mechanisms we discussed--restricting the number of connections waiting to enter the application.
Extra Credit: (up to 5 pts) Implement a thread pool that restricts the number of available, concurrently executing threads. When a thread is finished handling a request, the thread should be returned to the pool of available threads. If no thread is available, the client connection should wait until there is an available thread. (What should you do if too many clients are waiting for threads?) See Thread Pools Tutorial.
Handling Connections
One key issue is determining how long to keep the connection open. The connection timeout needs to be configured in the server and ideally should be dynamic--based on the number of other active connections the server is currently supporting. Thus, if the server is idle, it can afford to leave the connection open for a relatively long period of time. If the server is handling many clients at once, it may not be able to afford to have an idle connection sitting around (consuming kernel/thread resources) for very long. You should develop a simple heuristic to determine this timeout in your server.
Command-line Arguments
The server document directory (the directory that the webserver uses to serve files) is the first command-line argument.
The port that the server listens on is the second command-line argument. (Note that you should use ports between 8888 and 9999.)
Thus, I should be able to run your server as
$ java edu.wlu.cs.WebServer /home/faculty/sprenkle/web_server_files 8888
How can you add command-line arguments to run code in Eclipse?
Non-Goals
A full-fledged web server would do a lot more than what ours is
doing. For example, the browser sends more headers
besides Host
, and the server could work with that
information. We're not going to worry about that.
Part 2: Writeup
After you complete implementing your server, write a document that describes your chosen architecture and implementation details. Instead of diving right into your description of the architecture, write as a technical paper, with an introduction (motivation, goals, challenges). Describe/justify your design decisions (e.g., your "magic numbers" for number of threads in the pool, limit on the socket queue, ...). Clearly state anything that does not work correctly. Describe any problems that you encountered and any help that you received. Create figures as appropriate.
In addition to describing the structure of your server, include a discussion that addresses the following questions:
- Web servers often use
.htaccess
files to restrict access to clients based on their IP address. How could you modify your server to support .htaccess? Remember your code design techniques from CSCI209. - Since your web servers are restricted to on-campus use only,
it is unlikely that you will notice any significant
performance differences between HTTP/1.0 and HTTP/1.1. Can
you think of a scenario in which HTTP/1.0 may perform better
than HTTP/1.1? Can you think of a scenario when HTTP/1.1
outperforms HTTP/1.0? Think about bandwidth, latency, and
file size. Consider some of the pros and cons of using a
connection per session versus using a connection per
object. The difference between the two comes down to the
following:
- Only a single connection is established for all retrieved objects, meaning that slow start is only incurred once (assuming that the pipeline is kept full) and that the overhead of establishing and tearing down a TCP connection is also only incurred once.
- However, all objects must be retrieved in serial in HTTP/1.1 meaning that some of the benefits of parallelism are lost.
Part 3: Submission
Include the following files in your submission, within your GitHub repository.
- Your writeup (PDF).
- All the files for your source code.
Frequently Asked Questions
- How should we test?
- Use JUnit testing to make sure the code works "in the
small". If your code is designed well, that should
straightforward. You can use something
like Selenium
(a browser plugin) to help automate testing from the browser.
You can also usetelnet
to make sure your requests are correct, independent of the browser.
Resources
- HTTP 1.0 and 1.1: http://www.jmarshall.com/easy/http/
- w3c HTTP page: http://www.w3.org/Protocols/
- HTTP Wikipedia: http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
Grading
- Implementation of web server:
- handles requests appropriately (including error responses)
- sends back responses
- handles multiple requests
- Demonstration of testing
- Write up
- Individual evaluation of team