World Wide Web
Major problems with the net (circa 1994)
-
Cryptic command line interface
-
Complete lack of organization (also a strength)
-
Incredible growth, change
WWW is the best answer to these major problems so far
-
GUI, hypertext, searchable, de-centralized
WWW - mixes content with menus. Gopher - separate content and menus.
-
hypertext on the Net
-
Swiss army knife of Net tools (subsumes functionality of ftp, Gopher, archie,
news readers)
Publishing model
From few-to-many, with high costs and long delays, to many-to-many
with low costs and no delay.
Business model
Targeting, interactivity, customization
URL - Uniform Resource Locator
a unified means of describing a net resource
three parts: protocol, DNS host name, local page name
solves problem of uniquely identifying pages because DNS names
are unique
identifies type of document by protocol of access
some servers support the idea of default document names (e.g.
Welcome.html, index.html)
variety of protocols
http, ftp, file (local disk), news, gopher, mailto, telnet
a problem with URL scheme: unique host identity, no possibility for
load sharing
Client - Web browser
Very simple original interface: read text, click on highlights to follow
links
Now: graphics, image maps, interactivity, animation
Some browsers know how to interpret and display a variety of data formats,
others use helper apps
Browser history
WWW.app for NEXTSTEP, Tim Berners-Lee at CERN
Mosaic at UoI, Marc Andreesen
Netscape and Jim Clark
MS and InternetExplorer
Browser features
multithreading -simultaneous downloads/interpretation of images
cacheing - save time
extensions to HTML, frames, tables, ...
Browsers progressing towards the universal client app (displacing proprietary
apps)
current problems: HTML isn't very rich
Role of MIME
The web uses MIME (Multimedia Internet Mail Enhancement) to encode
data.
Hyper Text Transport Protocol - HTTP
References
Simple page transfer scenario <figure 7-59 Tanenbaum>
-
Browser displays default page
-
User enters URL (http://www.site.domain/~person)
-
Browser library call resolves hostname to IP address via DNS server
-
Browser makes TCP connection to port 80 of host
-
Browser sends a GET command for page ~person/Welcome.html
-
Server www.site.domain sends desired HTML page
-
TCP socket connection is closed
-
Browser interprets HTML, displays page
-
Browser establishes separate connections for images, may use threads
The server side
The httpd daemon is waiting for incoming connection requests. When
it hears one it:
-
allocates a new socket for this client request
-
forks a kid process (or creates a thread)
-
hands the client off to the kid
-
goes back to listening.
HTTP is built on TCP. Version 1.1 came out fall of 1997. This brings up
some questions:
Is HTTP connection-less or connection-oriented?
Is TCP a good match to the idea of web browsing?
What are the advantages and disadvantages of using TCP for HTTP?
How might efficiency be improved?
Their are seven built-in methods to HTTP, and a means of extending the
protocol. Each request is followed by a status line, optionally followed
by data (e.g. the URL content from a GET). The built-in methods are
-
GET - request to send a web page
An If-Modified-Since header allows for the server to reply without
data to this request so a cached copy is used instead, saving transfer
time
-
HEAD - request to read a web page's header
Handy if you want to check a URL, get its modification time, or collect
info for indexing.
-
PUT - request to store a web page
Web server must be configured to allow for this. Options to allow for
authentication header.
-
POST - request to append to a named resource
Like PUT, but does an append. Used for adding a file to bulletin board.
-
DELETE - request to delete a web page
Authentication header may be used. File system ultimately determines
success or failure of this request.
-
LINK/UNLINK -
The request would look like this
GET /mattmarg/ HTTP/1.0
User-Agent: Mozilla/2.0 (Macintosh; I; PPC)
Accept: text/html; */*
Cookie: name = value
Referer: http://www.webmonkey.com/html/96/47/index2a.html
Host: www.grippy.org
The status replies are things like
200 OK
304 Not modified
400 Bad request
404 File not found
301 Redirect
A redirect response can be sent by a server when a URL is requested that
the server knows isn't right (for example, when you ask for someone's home
page without including the ending /) or when a URL has a redirect in it
(e.g. someone moved the page and didn't want everyone to get 404s on the
old URL). The redirect is put in the .htaccess file in the URLs directory,
or in the server's configuration file.
After the response code comes the response header
HTTP/1.0 200 Found
Date: Mon, 10 Feb 1997 23:48:22 GMT
Server: Apache/1.1.1 HotWired/1.0
Content-type: text/html
Last-Modified: Tues, 11 Feb 1997 22:45:55 GMT
Performance
The first version of HTTP was really inefficient. One request per socket
meant multiple socket setups per page. Various things have been done to
improve performance and waste less bandwidth.
Keep-Alive
To improve efficiency and reduce latency of downloading complex web
pages a server allows for more than a single request to be sent down a
TCP socket connection. These multiple requests may be pipelined (i.e. more
than one can be sent before a response to the first is received). There
is a configured maximum number of requests per socket, plus a timer value
the server uses to tear down idle sockets.
Caching
All browsers keep a disk cache for URLs they've fetched before. The
cache is checked before the URL is downloaded. HTTP 1.1 gives some fine
grained control over what is cached, so a page can be written which specifies
that some pieces of it may come from a cache, but that others must be reloaded
every time.
Proxy servers
Act as intermediaries for HTTP requests. They usually do caching to
save bandwidth. They also may restrict access to certain domains, or for
certain types of content. For example, a corporation or ISP may run a proxy
server with a huge cache to reduce the use of the network connection for
commonly accessed URLs, or not allow access to playboy.com. An Internet
Cache Protocol (ICP) has been defined for proxy server neighbors to talk
to each other and share information, further reducing redundant access
to web servers.
Note that the cacheing done by a proxy server is common to all users/client
browsers, and not just limited to one single user/browser, so the performance
boost is potentially greater. One large campus proxy server sees hit rates
of between 50 and 59%.
Configuration of web server
how child processes are launched
which port it listens on,
user and group for httpd to run as
who to mail with problems
server root (different that document root)
where the log files are
the server name you wish to use
Directives
DocumentRoot - where files accessible to the server live
UserDir - how to resolve ~username paths
DirectoryIndex - used when a file is not the last element of
the URL (Welcome.html on burbot)
AccessFileName - for access control; this file can have a password
Redirect - automatically redirects a client to a new location
for a document
ErrorDocument - override the built-in error messages with this
file
Logging
What can you find out? hits, errors, client types,
Accesses
pm087-07.dialip.mich.net - - [31/Aug/1996:05:00:25 -0400] "GET
/~wolfd/now8.gif HTTP/1.0" 200 1135
muskie.csis.gvsu.edu - - [04/Sep/1996:12:20:23 -0400] "GET /~erickson/CS380/Content.html
HTTP/1.0" 200 1404
yandr-bh.yr.com - - [04/Sep/1996:12:06:07 -0400] "GET /~vanoflej/HenryMiller/images/cubeneg.jpg
HTTP/1.0" 200 1784
148.61.25.48 - - [04/Sep/1996:12:02:15 -0400] "GET /~levinm/237downl.html
HTTP/1.0" 200 1606
Errors
[Fri Aug 30 14:32:50 1996] httpd: access to /home/erickson/public_html/CS452/Course
failed for smelt.csis.gvsu.edu, reason: file does not exist from http://www.csis.gvsu.edu/~erickson/CS452/452.html
[Fri Aug 30 15:30:08 1996] httpd: access to /home/erickson/public_html/NetworkingArchives/WindowsSockets/KenCoviak/descript.html
failed for barpm3_p9.caribsurf.com, reason: file does not exist from http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&stq=20&fmt=.&q=wsacleanup
Client types
Mozilla/2.0 (compatible; MSIE 3.0; Windows 95)
Mozilla/2.0 (Win16; I)
Mozilla/2.02 (WinNT; I)
OmniWeb/2.0 OWF/1.0
Mozilla/2.02E-KIT (Win95; U; 16bit)
Mozilla/3.0 (Win95; I) via Squid Cache version 1.0.beta17
Mozilla/2.01 (Macintosh; I; PPC)
Mozilla/2.0 (compatible; MSIE 2.1; Windows 3.1)
Mozilla/3.0 (Win16; I) via proxy gateway CERN-HTTPD/3.0
libwww/2.17