Você está na página 1de 8

7

Web Servers, Clients, and Browsers


7.1 7.2 7.3 7.4 Introduction The Architecture of the Web Web Servers
General Operation Dynamic Content

Web Clients
Web Browsers Other Clients

Robert Tolksdorf
Freie Universitt Berlin

7.5 7.6

Intermediate Components Summary

7.1 Introduction
The Web is a network-based information system in which a huge number of components interwork. The main classes of these are Web servers that have and deliver information, Web clients that request information, and intermediate components that help to make the overall system more efficient. In this chapter, we review these components and their functionalities.

7.2 The Architecture of the Web


The Web is a network-based information system where information is delivered by Web servers to Web clients on request. Components involved in this system can be classified into the following three categories: 1. Web servers have information and make it available on request. They are accessed by TCP connections on the IP address of the machine they run on and the port they are configured to listen to. On that connection, they communicate with the Hypertext Transfer Protocol HTTP. 2. Web clients request information from specific servers by issuing a request following the HTTP. The information received is then further processed according to the kind of client. The most wellknown kind of Web clients are Web browsers that display the information transferred from the server to a human user. The most important invisible kind of Web clients are Crawlers that retrieve information from Web servers to feed it into indexed full-text databases that are the core of Web search engines. 3. Intermediate components transform requests or information. One important kind of such components is formed by proxies who act on behalf of the actual client, for example, because it is behind a firewall and should not be visible outside. The other important kind of intermediate components are caches that also act as proxies but store information retrieved and answer further requests for it without having to contact the actual Web server.

2005 by CRC Press LLC

7-2

The Industrial Information Technology Handbook

Figure 7.1 shows a configuration of such components. There are two Web servers: one offers some specific information and the other one hosts a Web search engine. There is a traditional Web browser as a client, which is configured to use a proxy to access the corporate Web server and to contact the search engine directly. Close to the search engine, a Web crawler retrieves some information from the corporate Web server that will be fed to the full text index of the search engine on which searches are performed later. All components run on different machines connected via IP. They communicate with HTTP to transfer requests and responses on information. Figure 7.1 shows a very small excerpt of the real structure of the Web. The Web Server Survey of Netcraft (available at http://www.netcraft.com/Survey) counted 35,424,956 Websites in January 2003. The HTTP service is available on a substantial fraction of the named computers on the Internet these were counted as 171,638,297 in the January 2003 edition of the Internet Domain Survey of the Internet Software Consortium (available at http://www.isc.org/ds/WWW-200301). The number of clients cannot be counted. However, the number of people having Internet access can give an indication of this. The Global Internet Statistics survey of Global Reach (available at http://www.glreach.com/globstats) estimates the size of the worldwide online population in September 2002 as 619 million people.

7.3 Web Servers


Web servers are the source of information in the Web. They are, in general, passive in that they deliver the information they have on request to clients. Historically, the first Web page on a server was http://nxoc01. cern.ch/ hypertext /WWW/TheProject. html. The server ran at the European Organization for Nuclear Research (CERN) where the Web project was invented. The page mentioned contained the description of this project. The first server program available to the public was httpd developed at the CERN by Ari Luotonen, Henrik Frystyk Nielsen, and Tim Berners-Lee. It is still available at http://www.w3.org/Daemon. It soon found a more successful rival with the NCSA (National Center for Supercomputing Applications) HTTPd. Today, the most used Web server is Apache (see http://www.apache.org). It ranks first in the Netcraft Web Server Survey (http://www.netcraft.com/survey), with an estimated share of about 62% in January 2003 before the Microsoft product family of Internet-Information-Server with a share of about 27% of all servers.

General Operation
The identification of content on the Web is done by URLs. For HTTP communication, they have the format http://<Servers IP Address>/<Path to information>. An example URL is http:// www.w3.org/Protocols/Activity.html, which identifies the HTML page Protocols/ Activity.html on the server www.w3.org. This Web page is stored in the Web servers file system in the form it is delivered to the client. Therefore, it is called static content. Another example is http://www.google.com/search?q=server, which addresses the script search on the

Web Server with product information

Web crawler as client

Indexed Web pages

Intermediate proxy

Web browser as client

Web Server offering a search engine

FIGURE 7.1 Components in the Web.

2005 by CRC Press LLC

Web Servers, Clients, and Browsers

7-3

server www.google.com and passes a parameter string q=server to it. The information returned from the server is generated fresh with each request; therefore, we speak of dynamic content in this case. The resolution of the servers address is part of the standard TCP usage by any client. The resolution of the path information to a specific chunk of data is up to the server. It may map it into some location in a local file system, or to the execution of some programs on the servers machine. Web servers in general consist of several subcomponents. Among their duties are the following: Interfacing with the network: Web servers speak by definition HTTP for communication with clients. HTTP interaction is initiated by contacting the default port 80 on a machine with TCP as the transport protocol. Since Web servers can be configured to accept connections on a different port and since HTTP can be directed to any port, this is actually just a default definition. A Web server has to set up the standard interface for the respective IP communication, which means to listen on port 80, to accept connections, and to perform the standard HTTP communication over the resulting communication socket. Mapping of static content to the local file system: Web servers have access to some local file system, which is the persistent storage for the static contents. How the path to information from the URL is mapped to a path in the file system is up to the server and its configuration. http://www.w3.org/ Protocols/Activity.html could lead to the delivery of the file /htdocs/Protocols/ Activity.html,/import/htdocs/staticcontent/Protocols/Activity.html, or ~auser/.public_html/Protocols/Activity.html. The mapping could also map to different filenames, for example, to c:\Protocols\Activity.htm. In the extreme case, it is not necessary to access a file system at all a server could execute some network access or have static hardcoded HTML pages as constants in its program. Mapping of dynamic content by execution of programs: The servers configuration determines whether a URL denotes static or dynamic content. An example configuration could state that after performing the static mapping all files in the directory cgi-bin and ending in .cgi are programs to be executed to generate dynamic content. The server then has to interface the standard input and standard output channels of the programs with the incoming and outgoing network connection and to start the program. A most important side issue of this operation are security aspects, for example, under what user account the program runs and what permissions it has. Coordination of the components: Web servers must be able to serve multiple requests at a time. Sequential processing is necessary for the acceptance of TCP connections on port 80. All other activities of a server can in principle be executed concurrently in different threads. Managing this degree of concurrency needs coordination activities. These have to take care of shared writable resources, for example, log files, deal with processes started by dynamic resources, or optimize the degree of concurrency.

Dynamic Content
The static content offered by a server is generated by authoring Web pages in advance. This is usually supported by editors for individual pages, by programs that help to handle the complete structure of a site, or by advanced content management systems. Dynamic content is the basis for interaction on the Web any forms that accept user input have to be processed on the server. Conceptually, the generation of the page that answers to the form input is the goal of the server. In practice, of course, side effects such as storing inputs in a database or triggering the delivery of some items purchased with the form is the actual goal of this generation of dynamic content. Figure 7.2 shows a minimal script that generates dynamic content. We can assume that it is accessible by the (imaginary) URL http://www.foo.ba/cgi-bin/serverdate.cgi. We also assume that the server runs on a UNIX system that recognizes such a script as executable content. When a request for /cgi-bin/serverdate.cgi reaches the server www.foo.ba, it executes that script. The echo instructions in this script simply write the following string to the standard output. The line echo `/usr/bin/date` invokes the program /usr/bin/date which results in the current system date of the server and forwards its result to the standard output.

2005 by CRC Press LLC

7-4

The Industrial Information Technology Handbook


#!/bin/sh echo "Content-type: text/html" echo echo "<html> <head> <title>Date here</title> </head> <body>" echo "The current date here:<b>" echo `/usr/bin/date` echo "</b>. Thanks for using this service. </body> </html>"

FIGURE 7.2 A sample script.


Content-type: text/html <html> <head> <title>Date here</title> </head> <body> The current date here: <b> Sun Dec 29 20:54:41 2002 </b>. Thanks for using this service. </body> </html>

FIGURE 7.3 The output of the script.

The Web server sends all output of the script to the Web client in its response, resulting in an answer as in Figure 7.3. The generated content always contains the date of its creation. For interactive Web sites, it is necessary to process input typed by the user in forms with scripts. The Common Gateway Interface (CGI) (see http://www.w3.org/CGI) defines a mechanism to pass these data to scripts. The core idea is to use environment variables to communicate this information. The Web server sets a specific set of about 20 variables and then calls the script (which is often also called CGI program). An example variable is REMOTE_ADDR, which contains the Internet number from which a client sent the HTTP request. More important is QUERY_STRING, which contains the form-input to be processed. Figure 7.4 shows the flow of information in the processing of a CGI request. After filling out a form, the client encodes input data in the URL and sends it as part of its request to the server. The server identifies phone.cgi as a script and places the information received from the client into environment variables according to the CGI mechanism. The CGI program processes these data, for example, by querying a database accordingly. The results received are then placed in an HTML page, which is communicated to the server by the standard output of the CGI program. The server passes the generated HTML page to the client, which displays it as the result of the request to the server. Many Web-Servers offer another mechanism to include dynamic content in Web pages, the so-called Server Side Includes. The basic idea is that the server scans the Web-pages to be delivered for a set of special tags. These tags are then replaced by dynamic content by the server. The set of these tags depends on the Web server used. For most Web servers, the SSI Tags are actually HTML comments of the form

<!--#directive attribute1=value1 attribut2=value2-->.


This makes the approach compatible with all Web servers: if SSI are not parsed and replaced they appear as HTML comments and are ignored by clients. Figure 7.5 shows an example use of SSI: the regular HTML code includes one SSI tag. It contains the directive flastmod. This tag is replaced by the Web server with the date of the last modification of a file within the Web servers file system. Its name is given in the attribute file. In the case that the attribute has an empty value as in the example the modification date of the file that contains the SSI directive is taken.

7.4 Web Clients


Web servers are accessed by programs, which converse in HTTP and interact with the server accordingly. Only at first glance, all clients are Web browsers. There are number of components in the Web that are not intended for user interaction, such as proxies and caches. We provide an overview on them in the following.

2005 by CRC Press LLC

Web Servers, Clients, and Browsers


Results encoded in HTML and sent by HTTP (<html> <body> <i>2345 -78</i>...)

7-5

Client
Inputs encoded In URL and sent by HTTP (like http://www.x.y/phone.cgi?name=Mike)

Server

Inputs encoded in environment ($QUERY_STRING=name=Mike)

Results translated to HTML in stdout (<html> <body> <i> 2345 -78</i>...)

Inputs translated to query (SELECT...)

Database

Query results (<Mike,2345 -78>)

CGI program (phone.cgi)

FIGURE 7.4 Processing of a CGI request.

With SSI: <div align="right"> Last modified: <! --#flastmod file=" "-->. </div> After expansion of SSI: <div align="right"> Last modified: Wednesday, 08-Jan-2003 09:30:15 CET. </div>

FIGURE 7.5 The page with SSIs before and after processing by the server.

Web Browsers
Web browsers are what makes the Web usable they retrieve information and display it. The success of the Web is due to two mutual reinforcing developments the availability of more and more information on growing numbers of Web servers and the availability of better browsers as clients. Historically, the first Web browser was written in 1990 on a NeXT machine by Tim BernersLee, who keeps a description including a screen shot at http://www.w3.org/People/Berners-Lee/WorldWideWeb.html. The first widespread client application was the Mosaic browser written by Marc Andreessen and Eric Bina at NCSA. It was available for a number of Unix platforms. Andreessen soon commercialized the application by cofounding Netscape and developing the browser with the same name. The rest is history after a long time of dominating the browser market, Netscape was outnumbered by the installations of Microsofts Internet Explorer. Today, the Internet Explorer is the by far the most used Web browser. A Web browser consists of multiple components that fulfill at least the following functionalities: Accepting user inputs: A Web browser is a program with a GUI that accepts user inputs like mouse-clicks, menu selection, or keyboard input. The browser has to interpret these inputs, for example, to determine on which link a user has clicked, what bookmark was selected from a menu, or which URL was typed in. Processing this input can affect the current display, affect the browsers state, or lead to a network access. Retrieving information from servers: When a URL is determined, the respective Web server has to be contacted, which means extracting the servers IP address from the URL and to initiate a HTTP connection with it. Then, the client has to extract the path to information and generate the respective HTTP request. The answer from the Web server has to be read and stored in some internal data format. Then, the network connection has to be closed. The browser has to be able to react properly to the various response-codes as defined in HTTP, for example, by retrieving redirected URL from another server or by asking the user for necessary authentication credentials and retrying the access. The browser also has to scan the retrieved HTML code for further embedded data-like images and retrieve these also from the network.

2005 by CRC Press LLC

7-6

The Industrial Information Technology Handbook

Rendering and displaying information: A Web page has to be rendered by the browser, which means to associate the HTML elements with some display style, for example, that headlines with the <h1>-Tag are displayed in a very large font-size and boldfaced. The style can be determined by the HTML standard, by a style specification with the Cascading Style Sheets language (http://www.w3.org/Style/ CSS), or be browser-specific. Device specific abilities also have to be taken into account for rendering. Then, the whole display has to be formatted. The main parameter here is the current width of the browser window and any scaling levels selected. Then, the rendered page has to be displayed in a window according to the windowing and graphic system found on the machine. After this, user inputs are again accepted. Coordination of activities: Web browsers are usually able to offer multiple windows to the user. This implies that multiple interactions with the network can take place concurrently. Also, embedded elements in a page can be retrieved in parallel. The management of this concurrency needs coordination, which takes care of thread management and optimization of network usage, for example, by limiting the number of concurrent network accesses or ensuring that resources are retrieved only once and then reused. Management of the users interaction environment: The activity of browsing the Web is usually supported by some interaction environment that is specific for the user. It includes a set of preferred Web server addresses, the bookmarks or favorites, a user history, the storage of data entered info Web forms and its automatic usage in further forms, account information for Web passwords, functionalities to export Web information to local programs, etc. The variety of functions is large and so is the effort spent on them. Since the basic network access and rendering of information is quite standardized, these user-centered functionalities leave room for browser vendors to come up with unique product features. Support functionalities: Web information is not limited to HTML and some standard graphic formats. There is a variety of often proprietary further formats, for example, for audio. Web browsers usually support a wide variety of such formats and offer an interface for extensions with so-called plug-ins that are able to render further formats. Modern Web browsers also include a set of further functionalities that support the user in other Internet functionalities as well. Usually, a Web browser also includes a client for Netnews following the NNTP protocol and email following POP3, IMAP, and SMTP protocols. Finally, client-side activity is supported by scripting languages like JavaScript and execution environments like Java or ActiveX.

Other Clients
It is not necessary that Web servers are accessed by Web browsers only. Any program that obeys the HTTP can retrieve information from a Web server. Examples are Web site copy programs that retrieve the contents of some Web site by traversing all links starting from some entry site. The content is then stored on a local disk and available for usage even if no network connection is available. More important are, however, so-called Web crawlers that are the basis for any search service on the Web like Google or AltaVista. Historically, one of the first was the WebCrawler (see http://www.ncsa.uiuc.edu/SDG/IT94/ Proceedings/Searching/pinkerton/WebCrawler.html). It used the same principles as todays search engines. However, in 1994, the Web became significantly smaller WebCrawler indexed about 50000 pages. As of Spring 2003, the largest Web search engine is Google, with more than 3 billion pages indexed. A Web search engine is usually a full text index of Web pages on which queries can be placed. The technology of these engines differs in the way they determine the similarity between the query and the contents of the index and the way they order the results. But the full text index has to be filled with contents retrieved from the Web. Since the Web is extremely large, this task has to be automated. Web crawlers are clients that automatically traverse the Web and feed the retrieved information into a full text index. They operate according to the following generic steps: 1. Initialize a list of URLs to retrieve. 2. Take a URL from the list and perform a test whether it should be retrieved. This tests whether the URL has already been visited, or the crawler could be configured to focus only on URLs from some specific Internet domains. 3. Retrieve the information referenced by the URL using HTTP.

2005 by CRC Press LLC

Web Servers, Clients, and Browsers

7-7

Web browser Proxy Cache Web Server

Document database

Firewall

FIGURE 7.6 Browsers connected to servers via a proxy cache.

4. Extract any references from the information (for HTML pages, e.g., the references from the tags <a>, <link>, <meta>, <img>, <object>, and <frameset>) and put it into the URL list. 5. Extract all contents from the retrieved information as text and enter it into the full text index. 6. Extract any metainformation available like authorship, size, date written, etc. and store it in some database. 7. Repeat the process starting with step 2. This basic process still defines the algorithm used by search engines. However, today, better strategies for selecting the next URL to follow exist; information can be extracted from more file types aside from HTML, and more metadata are retained.

7.5 Intermediate Components


Browsers and crawlers communicate directly with Web servers. However, there are also components that act in between clients and browsers. The most important ones are proxies and caches discussed in this section. Historically, the CERN httpd server was also able to act as a cache. Today, many Web servers offer the cache functionality either integrated or as configurable modules. A popular stand-alone cache program is the Squid proxy cache available at http://www.squid-cache.org. The direct connection between Web clients and servers is not always possible or advisable. For example, in a corporate network, clients should not be fully exposed to the outside world due to security reasons. All Internet traffic will be routed through a firewall, which is the single and central interface between the both IP based LAN and WAN. It could be advisable to hide the internal network usage from the outside world. A solution to this is to establish a proxy that resembles the origin of all Web accesses to servers but actually only forwards requests by clients in the LAN and directs answers to them. A client is configured to access a proxy at a fixed address with its request. The proxy program forwards the request to the actual Web server and retrieves results from there. The results are then forwarded to the original client. This configuration has the consequence that the actual client is never exposed to the outside world in the Internet as a WAN. All traffic is directed through the proxy to whom special security measures can be applied. Since the proxy is used by all LAN clients, it can be extended to offer an additional functionality. As a cache, it can store all information retrieved indexed by the URL used. Any future accesses for the same URL can then be satisfied from that cache without having to contact the actual server. The result is the reduction of WAN usage and lower network latency due to the avoidance of WAN connections to servers.

2005 by CRC Press LLC

7-8

The Industrial Information Technology Handbook

As a side effect, the overall traffic in the WAN and the load to Web servers are reduced. In practice, proxy and cache functionality is combined in a single program leading to a proxy cache. Figure 7.6 shows a configuration where a proxy cache is located behind a firewall. It is contacted by several browsers and in part satisfies requests from a database of documents. The proxy cache contacts Web servers to retrieve documents not in the store. After retrieval, they are stored and then delivered to the clients asking for them.

7.6 Summary
The Web is conceptually a simple clientserver system with clearly defined interfaces. The main components in the Web are clients that access Web servers by HTTP to retrieve information. Mainly that information is displayed to users by a browser, and often it is also used to build search engines. Intermediate components help in optimizing the performance of the components. The most important of these components are proxy caches.

About the Author


Robert Tolksdorf is professor of computer science at the Freie Universitt Berlin, heading the group on network-based information systems NBI (http://nbi.inf.fu-berlin.de). His research focuses include the study and development of languages, models, and systems that guide the coordination of activities in networked systems, the study, application, and evaluation of XML-technologies for network-based information systems, among others, and the study and application of Semantic Web technologies, and the use of novel models for the coordination of agent societies on a large scale, especially those models based on Swarm Intelligence. Tolksdorf authored several journal articles, four books on Internet and coordination technologies, and about 90 refereed publications, invited book contributions, and research reports. He is a member of the IEEE Computer Society, the ACM, and the German Informatics Society, where he chairs the special interest group on multimedia.

2005 by CRC Press LLC

Você também pode gostar