Escolar Documentos
Profissional Documentos
Cultura Documentos
Web Clients
Web Browsers Other Clients
Robert Tolksdorf
Freie Universitt Berlin
7.5 7.6
7.1 Introduction
The Web is a network-based information system in which a huge number of components interwork. The main classes of these are Web servers that have and deliver information, Web clients that request information, and intermediate components that help to make the overall system more efficient. In this chapter, we review these components and their functionalities.
7-2
Figure 7.1 shows a configuration of such components. There are two Web servers: one offers some specific information and the other one hosts a Web search engine. There is a traditional Web browser as a client, which is configured to use a proxy to access the corporate Web server and to contact the search engine directly. Close to the search engine, a Web crawler retrieves some information from the corporate Web server that will be fed to the full text index of the search engine on which searches are performed later. All components run on different machines connected via IP. They communicate with HTTP to transfer requests and responses on information. Figure 7.1 shows a very small excerpt of the real structure of the Web. The Web Server Survey of Netcraft (available at http://www.netcraft.com/Survey) counted 35,424,956 Websites in January 2003. The HTTP service is available on a substantial fraction of the named computers on the Internet these were counted as 171,638,297 in the January 2003 edition of the Internet Domain Survey of the Internet Software Consortium (available at http://www.isc.org/ds/WWW-200301). The number of clients cannot be counted. However, the number of people having Internet access can give an indication of this. The Global Internet Statistics survey of Global Reach (available at http://www.glreach.com/globstats) estimates the size of the worldwide online population in September 2002 as 619 million people.
General Operation
The identification of content on the Web is done by URLs. For HTTP communication, they have the format http://<Servers IP Address>/<Path to information>. An example URL is http:// www.w3.org/Protocols/Activity.html, which identifies the HTML page Protocols/ Activity.html on the server www.w3.org. This Web page is stored in the Web servers file system in the form it is delivered to the client. Therefore, it is called static content. Another example is http://www.google.com/search?q=server, which addresses the script search on the
Intermediate proxy
7-3
server www.google.com and passes a parameter string q=server to it. The information returned from the server is generated fresh with each request; therefore, we speak of dynamic content in this case. The resolution of the servers address is part of the standard TCP usage by any client. The resolution of the path information to a specific chunk of data is up to the server. It may map it into some location in a local file system, or to the execution of some programs on the servers machine. Web servers in general consist of several subcomponents. Among their duties are the following: Interfacing with the network: Web servers speak by definition HTTP for communication with clients. HTTP interaction is initiated by contacting the default port 80 on a machine with TCP as the transport protocol. Since Web servers can be configured to accept connections on a different port and since HTTP can be directed to any port, this is actually just a default definition. A Web server has to set up the standard interface for the respective IP communication, which means to listen on port 80, to accept connections, and to perform the standard HTTP communication over the resulting communication socket. Mapping of static content to the local file system: Web servers have access to some local file system, which is the persistent storage for the static contents. How the path to information from the URL is mapped to a path in the file system is up to the server and its configuration. http://www.w3.org/ Protocols/Activity.html could lead to the delivery of the file /htdocs/Protocols/ Activity.html,/import/htdocs/staticcontent/Protocols/Activity.html, or ~auser/.public_html/Protocols/Activity.html. The mapping could also map to different filenames, for example, to c:\Protocols\Activity.htm. In the extreme case, it is not necessary to access a file system at all a server could execute some network access or have static hardcoded HTML pages as constants in its program. Mapping of dynamic content by execution of programs: The servers configuration determines whether a URL denotes static or dynamic content. An example configuration could state that after performing the static mapping all files in the directory cgi-bin and ending in .cgi are programs to be executed to generate dynamic content. The server then has to interface the standard input and standard output channels of the programs with the incoming and outgoing network connection and to start the program. A most important side issue of this operation are security aspects, for example, under what user account the program runs and what permissions it has. Coordination of the components: Web servers must be able to serve multiple requests at a time. Sequential processing is necessary for the acceptance of TCP connections on port 80. All other activities of a server can in principle be executed concurrently in different threads. Managing this degree of concurrency needs coordination activities. These have to take care of shared writable resources, for example, log files, deal with processes started by dynamic resources, or optimize the degree of concurrency.
Dynamic Content
The static content offered by a server is generated by authoring Web pages in advance. This is usually supported by editors for individual pages, by programs that help to handle the complete structure of a site, or by advanced content management systems. Dynamic content is the basis for interaction on the Web any forms that accept user input have to be processed on the server. Conceptually, the generation of the page that answers to the form input is the goal of the server. In practice, of course, side effects such as storing inputs in a database or triggering the delivery of some items purchased with the form is the actual goal of this generation of dynamic content. Figure 7.2 shows a minimal script that generates dynamic content. We can assume that it is accessible by the (imaginary) URL http://www.foo.ba/cgi-bin/serverdate.cgi. We also assume that the server runs on a UNIX system that recognizes such a script as executable content. When a request for /cgi-bin/serverdate.cgi reaches the server www.foo.ba, it executes that script. The echo instructions in this script simply write the following string to the standard output. The line echo `/usr/bin/date` invokes the program /usr/bin/date which results in the current system date of the server and forwards its result to the standard output.
7-4
The Web server sends all output of the script to the Web client in its response, resulting in an answer as in Figure 7.3. The generated content always contains the date of its creation. For interactive Web sites, it is necessary to process input typed by the user in forms with scripts. The Common Gateway Interface (CGI) (see http://www.w3.org/CGI) defines a mechanism to pass these data to scripts. The core idea is to use environment variables to communicate this information. The Web server sets a specific set of about 20 variables and then calls the script (which is often also called CGI program). An example variable is REMOTE_ADDR, which contains the Internet number from which a client sent the HTTP request. More important is QUERY_STRING, which contains the form-input to be processed. Figure 7.4 shows the flow of information in the processing of a CGI request. After filling out a form, the client encodes input data in the URL and sends it as part of its request to the server. The server identifies phone.cgi as a script and places the information received from the client into environment variables according to the CGI mechanism. The CGI program processes these data, for example, by querying a database accordingly. The results received are then placed in an HTML page, which is communicated to the server by the standard output of the CGI program. The server passes the generated HTML page to the client, which displays it as the result of the request to the server. Many Web-Servers offer another mechanism to include dynamic content in Web pages, the so-called Server Side Includes. The basic idea is that the server scans the Web-pages to be delivered for a set of special tags. These tags are then replaced by dynamic content by the server. The set of these tags depends on the Web server used. For most Web servers, the SSI Tags are actually HTML comments of the form
7-5
Client
Inputs encoded In URL and sent by HTTP (like http://www.x.y/phone.cgi?name=Mike)
Server
Database
With SSI: <div align="right"> Last modified: <! --#flastmod file=" "-->. </div> After expansion of SSI: <div align="right"> Last modified: Wednesday, 08-Jan-2003 09:30:15 CET. </div>
FIGURE 7.5 The page with SSIs before and after processing by the server.
Web Browsers
Web browsers are what makes the Web usable they retrieve information and display it. The success of the Web is due to two mutual reinforcing developments the availability of more and more information on growing numbers of Web servers and the availability of better browsers as clients. Historically, the first Web browser was written in 1990 on a NeXT machine by Tim BernersLee, who keeps a description including a screen shot at http://www.w3.org/People/Berners-Lee/WorldWideWeb.html. The first widespread client application was the Mosaic browser written by Marc Andreessen and Eric Bina at NCSA. It was available for a number of Unix platforms. Andreessen soon commercialized the application by cofounding Netscape and developing the browser with the same name. The rest is history after a long time of dominating the browser market, Netscape was outnumbered by the installations of Microsofts Internet Explorer. Today, the Internet Explorer is the by far the most used Web browser. A Web browser consists of multiple components that fulfill at least the following functionalities: Accepting user inputs: A Web browser is a program with a GUI that accepts user inputs like mouse-clicks, menu selection, or keyboard input. The browser has to interpret these inputs, for example, to determine on which link a user has clicked, what bookmark was selected from a menu, or which URL was typed in. Processing this input can affect the current display, affect the browsers state, or lead to a network access. Retrieving information from servers: When a URL is determined, the respective Web server has to be contacted, which means extracting the servers IP address from the URL and to initiate a HTTP connection with it. Then, the client has to extract the path to information and generate the respective HTTP request. The answer from the Web server has to be read and stored in some internal data format. Then, the network connection has to be closed. The browser has to be able to react properly to the various response-codes as defined in HTTP, for example, by retrieving redirected URL from another server or by asking the user for necessary authentication credentials and retrying the access. The browser also has to scan the retrieved HTML code for further embedded data-like images and retrieve these also from the network.
7-6
Rendering and displaying information: A Web page has to be rendered by the browser, which means to associate the HTML elements with some display style, for example, that headlines with the <h1>-Tag are displayed in a very large font-size and boldfaced. The style can be determined by the HTML standard, by a style specification with the Cascading Style Sheets language (http://www.w3.org/Style/ CSS), or be browser-specific. Device specific abilities also have to be taken into account for rendering. Then, the whole display has to be formatted. The main parameter here is the current width of the browser window and any scaling levels selected. Then, the rendered page has to be displayed in a window according to the windowing and graphic system found on the machine. After this, user inputs are again accepted. Coordination of activities: Web browsers are usually able to offer multiple windows to the user. This implies that multiple interactions with the network can take place concurrently. Also, embedded elements in a page can be retrieved in parallel. The management of this concurrency needs coordination, which takes care of thread management and optimization of network usage, for example, by limiting the number of concurrent network accesses or ensuring that resources are retrieved only once and then reused. Management of the users interaction environment: The activity of browsing the Web is usually supported by some interaction environment that is specific for the user. It includes a set of preferred Web server addresses, the bookmarks or favorites, a user history, the storage of data entered info Web forms and its automatic usage in further forms, account information for Web passwords, functionalities to export Web information to local programs, etc. The variety of functions is large and so is the effort spent on them. Since the basic network access and rendering of information is quite standardized, these user-centered functionalities leave room for browser vendors to come up with unique product features. Support functionalities: Web information is not limited to HTML and some standard graphic formats. There is a variety of often proprietary further formats, for example, for audio. Web browsers usually support a wide variety of such formats and offer an interface for extensions with so-called plug-ins that are able to render further formats. Modern Web browsers also include a set of further functionalities that support the user in other Internet functionalities as well. Usually, a Web browser also includes a client for Netnews following the NNTP protocol and email following POP3, IMAP, and SMTP protocols. Finally, client-side activity is supported by scripting languages like JavaScript and execution environments like Java or ActiveX.
Other Clients
It is not necessary that Web servers are accessed by Web browsers only. Any program that obeys the HTTP can retrieve information from a Web server. Examples are Web site copy programs that retrieve the contents of some Web site by traversing all links starting from some entry site. The content is then stored on a local disk and available for usage even if no network connection is available. More important are, however, so-called Web crawlers that are the basis for any search service on the Web like Google or AltaVista. Historically, one of the first was the WebCrawler (see http://www.ncsa.uiuc.edu/SDG/IT94/ Proceedings/Searching/pinkerton/WebCrawler.html). It used the same principles as todays search engines. However, in 1994, the Web became significantly smaller WebCrawler indexed about 50000 pages. As of Spring 2003, the largest Web search engine is Google, with more than 3 billion pages indexed. A Web search engine is usually a full text index of Web pages on which queries can be placed. The technology of these engines differs in the way they determine the similarity between the query and the contents of the index and the way they order the results. But the full text index has to be filled with contents retrieved from the Web. Since the Web is extremely large, this task has to be automated. Web crawlers are clients that automatically traverse the Web and feed the retrieved information into a full text index. They operate according to the following generic steps: 1. Initialize a list of URLs to retrieve. 2. Take a URL from the list and perform a test whether it should be retrieved. This tests whether the URL has already been visited, or the crawler could be configured to focus only on URLs from some specific Internet domains. 3. Retrieve the information referenced by the URL using HTTP.
7-7
Document database
Firewall
4. Extract any references from the information (for HTML pages, e.g., the references from the tags <a>, <link>, <meta>, <img>, <object>, and <frameset>) and put it into the URL list. 5. Extract all contents from the retrieved information as text and enter it into the full text index. 6. Extract any metainformation available like authorship, size, date written, etc. and store it in some database. 7. Repeat the process starting with step 2. This basic process still defines the algorithm used by search engines. However, today, better strategies for selecting the next URL to follow exist; information can be extracted from more file types aside from HTML, and more metadata are retained.
7-8
As a side effect, the overall traffic in the WAN and the load to Web servers are reduced. In practice, proxy and cache functionality is combined in a single program leading to a proxy cache. Figure 7.6 shows a configuration where a proxy cache is located behind a firewall. It is contacted by several browsers and in part satisfies requests from a database of documents. The proxy cache contacts Web servers to retrieve documents not in the store. After retrieval, they are stored and then delivered to the clients asking for them.
7.6 Summary
The Web is conceptually a simple clientserver system with clearly defined interfaces. The main components in the Web are clients that access Web servers by HTTP to retrieve information. Mainly that information is displayed to users by a browser, and often it is also used to build search engines. Intermediate components help in optimizing the performance of the components. The most important of these components are proxy caches.