Você está na página 1de 18

Voice enabled web browser

CHAPTER 1 INTRODUCTION
A voice browser is a device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities." The definition of a voice browser, above, is a broad one. The fact that the system deals with speech is obvious given the first word of the name, but what makes a software system that interacts with the user via speech a "browser"? The information that the system uses (for either domain data or dialog flow) is dynamic and comes somewhere from the Internet. From an end-user's perspective, the impetus is to provide a service similar to what graphical browsers of HTML and related technologies do today, but on devices that are not equipped with fullbrowsers or even the screens to support them. This situation is only exacerbated by the fact that much of today's content depends on the ability to run scripting languages and 3rd-party plug-ins to work correctly. Much of the efforts concentrate on using the telephone as the first voice browsing device. This is not to say that it is the preferred embodiment for a voice browser, only that the number of access devices is huge, and because it is at the opposite end of the graphical-browser continuum, which high lights the requirements that make a speech interface viable. By the first meeting it was clear that this scope-limiting was also needed in order to make progress, given that there are significant challenges in designing a system that uses or integrates with existing content, or that automatically scales to the features of various access devices.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 1

Voice enabled web browser

CHAPTER 2 MOTIVATION
The primary method of access today continues to be the computer, which has certain advantages as well as some limitations. Computers offer a visual Internet experience that is usually rich in content. Some basic computer skills and knowledge are needed to access the Internet. But, computer-based access is proving insufficient for the professional on the move. When in the car or away from the office or computer, accessing the Web is difficult, if not impossible. And, an increasing number of people prefer an interface that allows them to hear and speak rather than see and click or type. Some existing Internet users have also identified problems with the visual Internet experience. Pages are increasingly full of graphics, advertisement banners, etc., which move, flash, and blink as they vie for attention. Some find this information overload annoying, and lament the delays it creates by severely taxing the available bandwidth

The "Digital Divide" While computers and their use are on the rise, theyre not ubiquitous yet. A large segment of the population still doesnt have access in the United and other parts of the world. Thus, Internet is limited to only a small fraction of the world population; the majority is left out from the Internet. This gap between those who can effectively use new information from the Internet, and those who cannot is known as the digital divide. Bridging this digital divide is the key to ensure that most people in the world have the capability to access the Internet. Making computers ubiquitous is not a very attractive and feasible solution, at least in the near future, because of various barriers. One key barrier is cost, although the price of a computer has come down significantly in recent years. Internet as well, thus bridging the Digital Divide.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 2

Voice enabled web browser

The "Language Divide" Today more than eighty percent of website contents are written in English language. Internet because of language barrier is called "The Language Divide". As the need for alternative access to the Internet becomes more evident, several technology companies are pursuing solutions. Their products include smart cell phones with visual displays, intelligence built into the handset, and voice-activated Web sites. These products address different aspects of the problems outlined above.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 3

Voice enabled web browser

CHAPTER 3 THE CHALLENGE


To integrate existing technologies, or develop new technologies, to make simple, affordable, alternative Internet access possible. As the need for an alternative access method to the Internet has become evident, progress continues to be made by technologists to provide such solutions. One key area of focus has been voice-based technology, which would allow a very natural interface for most people, and address the limitations described earlier. A voice interface provides an alternative to the visually based interface. A device such as the telephone provides a readily accessible alternative to the computer. Several technologies existing today are keys to the solution, but the problem lies in successfully integrating these technologies into useful applications of greater value than their individual components. These technologies include: Voice Extensible Markup Language (VXML) which is an extension of HTML, the normal language in which Web pages are created. This technology adds voice capability to a Web page. The page can then be displayed, as usual, over a computer, but it can also be presented in audio format with voice navigation. Speech Language Application Tags (SALT) specification for supporting multimodal communication from PCs, cell phones, PDAs and other handheld devices. For example, input can be voice (such as asking for directions) and output can be data (a map pops up). SALT is a lightweight set of extensions to existing markup languages, allowing developers to embed speech enhancements in existing HTML, XHTML and XML pages. As with Voice XML, applications will be portable thanks to the separation from the underlying hardware and platform. Text-to-speech (TTS), which allows text to be converted automatically to synthetic speech. It allows communications between computers and humans through a natural interface, such as speech. Telephone integration is the key to interface with computers from a remote location. A protocol is needed to communicate with the computer from a telephone using voice. This also includes multimedia integration (e.g. with .wav files).

Department of Computer Science and Engineering, SCEM, Mangalore

Page 4

Voice enabled web browser

Intelligent software agents are needed to automate communication between a telephone and a computer, a computer and a Web site, to interpret the contents of a web page, to extract key information that makes sense in audio, to efficiently navigate through web pages, and to manage access to the Internet. The first technology listed Voice XML is a very elegant solution that leverages technology specifically developed for audio Internet access. However, it requires that Web sites be customized, or VXML-enabled. This means rewriting the web pages in VXML. According to analysts, today there are more than a billion web pages. Assuming that it takes one hour and costs about $100 to rewrite one page, the cost to voice-enable all sites would be about $100B. Clearly, it will take several years before the majority of popular pages are VXML-enabled. Today, only a very small portion of the total Web pages is voice-enabled using VXML. The paper presents a solution that successfully integrates the other technologies listed into a useful, audio-based approach for accessing the Internet today. It is independent of the timeline, interest and willingness of content providers to update their pages to be VXML or SALT-enabled. Another approach is to provide Internet access over wireless devices such as palm pilot or a cell phone with a screen. However, this method has inherent limitations such as the small size of the screen and the need for a special phone. Also there is need to rewrite the website in WML. Todays wireless Internet industry is facing many challenges due to limitation of bandwidth and small screen. The cost of cell phone based Internet access is very high and users do not want to pay high service fee. Also our eyes and fingers are not changing but the devices are getting smaller and smaller. Thus, existing visual based access is going to be even more difficult in future.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 5

Voice enabled web browser

CHAPTER 4 VOICE BROWSER DOCUMENTS


4.1 Dialog Requirements: "A prioritized list of requirements for spoken dialog interaction which any proposed markup language (or extension thereof) should address." The Dialog Requirements document describes properties of a voice browser dialog,including a discussion of modalities (input and output mechanisms combined with various dialog interaction capabilities), functionality (system behavior) and the format of a dialog language. A definition of the latter is not specified, but a list of criteria is given that any proposed language should adhere to.An important requirement of any proposed dialog language is ease-ofcreation. Dialogs can be created with a tool as simple as a text-editor, with more specific tools, such as an (XML) structure editor, to tools that are special-purposed to deal with the semantics of the language at hand. 4.2 Grammar Representation Requirements: It defines a speech recognition grammar specification language that will be generally useful across a variety of speech platforms used in the context of a dialog and synthesis markup environment. When the system or application needs to describe to the speech-recognizer what to listen for, one way it can do so is via a format that is both human and machine-readable.

4.3 Model Architecture for Voice Browser Systems Representations: "To assist in clarifying the scope of charters of each of the several subgroups of the W3C Voice Browser Working Group, a representative or model architecture for a typical voice browser application has been developed. This architecture illustrates one possible arrangement of the main components of a typical system, and should not be construed as a recommendation."

Department of Computer Science and Engineering, SCEM, Mangalore

Page 6

Voice enabled web browser

4.4 Natural Language Processing Requirements: It establishes a prioritized list of requirements for natural language processing in a voice browser environment. The data that a voice browser uses to create a dialog can vary from a rigid set of instructions and state transitions, whether declaratively and/or procedurally stated, to a dialog that is created dynamically from information and constraints about the dialog itself. The NLP requirements document describes the requirements of a system that takes the latter approach, using an example paradigm of a set of tasks operating on a frame-based model. Slots in the frame that are optionally filled guide the dialog and provide contextual information used for task-selection. 4.5 Speech Synthesis Markup Requirements: It establishes a prioritized list of requirements for speech synthesis markup which any proposed markup language should address. A text-to-speech system, which is usually a standalone module that does not actually "understand the meaning" of what is spoken, must rely on hints to produce an utterance that is natural and easy to understand, and moreover, evokes the desired meaning in the listener. In addition to these prosodic elements, the document also describes issues such as multi-lingual capability, pronunciation issues for words not in the lexicon, time-synchronization, and textual items that require special preprocessing before they can be spoken properly.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 7

Voice enabled web browser

CHAPTER 5 SYSTEM ARCHITECTURE


"The architecture diagram was created as an aid to how we structure our work into subgroups. The diagram will help us to pinpoint areas currently outside the scope of existing groups." Although individual instances of voice browser systems are apt to vary considerably, it is reasonable to try and point out architectural commonalties as an aid to discussion, design and implementation. Not all segments of this diagram need be present in any one system, and systems which implement various subsets of this functionality may be organized differently. Systems built entirely third-party components, with architecture imposed, may result in unused or redundant functional blocks. Two types of clients are illustrated: telephony and data networking. The fundamental telephony client is, of course, the telephone, either wirelined or wireless. The handset telephone requires PSTN (Public Switched Telephone Network) interface, which can be either tip/ring, T1, or higher level, and may include hybrid echo cancellation to remove line echoes for ASR barge-in over audio output. A speakerphone will also require an acoustic echo canceller to remove room echoes. The data network interface will require only acoustic echo cancellation if used with an open microphone since there is no line echo on data networks. The IP interface is shown for illustration only. Other data transport mechanisms can be used as well. The model architecture is shown below. Solid (green) boxes indicate system components, peripheral solid (yellow) boxes indicate points of usage for markup language, and dotted peripheral boxes indicate information flows.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 8

Voice enabled web browser

CHAPTER 6 KEY TECHNOLOGY


The idea of listening to the Internet may at first sound a bit like watching the radio. How does a visual medium rich in icons, text, and images translate itself into an audible format that is meaningful and pleasing to the ear? The answer lies in an innovative integration of three distinct technologies that render visual content into short, precise, easily navigable, and meaningful text that can be converted to audio.

The technologies and steps employed to accomplish this feat are: Document Processing 1.SpeechRecognition 2. Text-to-speech translation and Document Rendering 3. Artificial Intelligence An Intelligent Agent (IA) located between the user and the Internet (Figure 1). The IA automates the process of rendering information from the Internet to the user in a meaningful, precise, easily navigable and pleasant to listen to audio format. Rendering is achieved by using Page Highlights (a method to find and speak the key contents on a page), finding right as well as only relevant contents on a linked page, assembling right contents from a linked page, and providing easy navigation. These key steps are done using the information available in the visual web page itself and proper algorithms that use information such as text contents, color, font size, links, paragraph, and amount of text. Artificial Intelligence techniques are used in this automated rendering process. This is similar to how the human brain renders from a visual page; selecting the information of interest . The IA includes a language translation engine that dynamically translates web contents from one language into another in real time.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 9

Voice enabled web browser

The platform incorporates the highest quality speech recognition and text to speech engines from third party suppliers

ThePhoNET architecture is shown in Figure 2. The process starts with a telephone call placed by a user. The user is prompted for a logon pass phrase. The users pass phrase establishes the first connection to the Web site associated with that phrase and loads the first Web page. The HTML is parsed, as described later, separating text from other media types, isolating URL from HTML anchors and isolating the associated anchor titles (including ALT fields) for grammar generation. Grammar generation computes combinations of the words in titles to produce a wide range of alternative ways to say subsets of the title phrase. In this process simple function words (i.e., "and," "or," "the," etc.) are not allowed to occur in isolation where they would be meaningless. Browser control commands are mixed in to control typical browser operations like "go back "and "go home" (similar to the typical browser button commands). This process typically takes about a second. At the same time, the Web document is described to the user, as discussed later. The user may then speak a navigation or browser command phrase to control browsing. Each user navigation command takes the user to a new Web page. If the command is ambiguous the dialog manager collects the possible interpretations into a description list and asks the user to choose one.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 10

Voice enabled web browser

CHAPTER 7 DOCUMENT PROCESSING


Document analysis is performed in the HTML parser, grammar generator, and Hyper Voice processor modules. The typical HTML Web page is first parsed into a list of elements based mostly on the HTML tags structure. Some elements are aggregations (tables, for instance) but the element list is not a full parse tree, which we found was not needed and in some cases actually complicates processing. Images, tables, forms and most text structure elements like paragraphs are recognized and processed according to their recognized type. Much of the effort in building a robust HTML processor is dealing with malformed HTML expressions such as unclosed tag scope, overlapping tag scopes, etc. Unfortunately space does not allow for fully addressing this issue here. The location of each image is announced along with any associated caption. This feature can be disabled on a siteby-site basis when the user does not want to hear about images. Tables are first classified according to purpose, either layout or content. Most tables are actually used for page layout which can be recognized by the variety and types of data contained in the table cells. Data tables are processed by a parser according to one of a set of table model formats that Phone Browser recognizes. This provides primarily a simple way of reading the table contents row by row, which is often not very satisfying. Alternatively a transcoder can be used to reconstruct the table in sentential format. While large vocabulary dictation speech systems are available, most require speaker training to achieve sufficiently high accuracy for most applications. Phone Browser is intended to be immediately usable without training so dictation is not yet supported. This also implies that creating arbitrary text for messaging is also not yet supported. One additional type of form input is an extension to HTML. A GSL (Grammar Specification Language) or JSGF (Java Speech Grammar Format) specification can be inserted into an HTML anchor using an attribute tag (currently LSPSGSL). Using this method an application can specify an elaborate input grammar allowing many possible sentences to address the associated hyperlink and construct a GET type form response where the QUERY_STRING element is constructed by inserting the speech recognition text results. Grammar specifications written this way may represent many thousands of possible sentence inputs giving the end user great speaking flexibility.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 11

Voice enabled web browser

CHAPTER 8 DOCUMENT-RENDERING
8.1 Rendering-Definition: Information technology uses this term Rendering refers to how information is presented according to the medium, for example, graphically displayed on a screen, audibly read using a recording device, or printed on a piece of paper. In the context of voice/audio Internet, Web content rendering entails the translation of information originally intended for visual presentation into a format more suitable to audio.

8.2 Rendering Problem: Computers possess certain superhuman attributes, which far outstrip that of mortal manmost notable are their computational capabilities. The common business spreadsheet is a testament to this fact. Other seemingly more mundane tasks, however, present quite a conundrum for even the most sophisticated of processors. Designing a high-speedspecial purpose computer capable of defeating a grandmaster at chess took the computing industry over 50 years to perfect. Employing strategic thinking is not a computers forte. That is because in all the logic embodied in their digitized ones and zeroes, there is no inherent cognitive thought. This one powerful achievement of the brain along with our ability to feel and express emotion separates the human mind from its computerized equivalentthe centralized processing unit (CPU). The relevance of cognitive thought to text rendering may not be immediately obvious but it is one of the major challenges faced when attempting to present information designed for one medium and rendering it to another. This is because there are no hard and fast objective rules to follow. Computers are very good at following instructions when they can be reduced to very objective decision points. They are not so good when value judgments are involved. A human being can readily distinguish a cat from a dog, or a relevant news link on a Web page from a link for an advertisement. Solving the problem To solve the rendering problem, some intelligent techniques must be applied. The relevant data must be selected, navigated to its conclusion, and reassembled for presentation by a different medium. All of this must be done for all web pages, dynamically, in real-

Department of Computer Science and Engineering, SCEM, Mangalore

Page 12

Voice enabled web browser

time and in an automated fashion. We have used an Intelligent Agent (IA) that uses various intelligence techniques including artificial intelligence. Upon selecting an item of interest it is common to have to navigate to another Web page to read all the data of interest (just like in the newspaper example). To do so we click on a Web link. When following a page link the problem of continuity of thought is encountered because almost assuredly the newly linked page contains data in addition to the thread of information we are attempting to follow. In order to maintain continuity with the item from the previous page a contextual correlation must be made. Once again, this cognitive process poses a formidable challenge for the computer and requires application of Intelligent Agent (or artificial intelligence) principles to solve. The first step involves dynamically removing all the programming constructs and coding tags that comprise the instruction to a Web browser on how to visually render the data. HTML, CHTML, XML, and other languages are typically used for this purpose. Because the data is now being translated or rendered to a different medium, these tags no longer serve any purpose. It is doubtful that every single data item on a page will be read. Just like reading a newspaper, we read only items of interest and generally skip advertisements completely. Thus, we need to automatically render important information on a page and then when a topic is selected, only the relevant information from the linked page corresponding to the selected topic needs to be presented. . Rendering is achieved by using Page Highlights (using a method to find and speak the key contents on a page), finding right as well as only relevant contents on a linked page, assembling right contents from a linked page, and providing easy navigation. Finding and Assembling Relevant Information

To find relevant information, the Intelligent Agent (IA) uses various deterministic and nondeterministic algorithms that use contextual and non-contextual matches, semantic analysis, and learning. This is again very similar to how we do use our eyes and brain to find the relevant contents. To ensure real-time performance, algorithms are simplified as needed yet producing very satisfactory results. Once relevant contents are determined, they are assembled in appropriate order that makes sense when listen to in audio or viewed on a small screen. A content rich page with a small number of links makes rendering and navigation easy since there are only a few choices, and one can quickly select a particular topic or section. The two key media for rendering into are audio using any phone and visual using a cell phone screen
Department of Computer Science and Engineering, SCEM, Mangalore
Page 13

Voice enabled web browser

or PDA. There is a good synergy between these two modes from rendering standpoint. Both need small amount of meaningful information at a time that can be heard or viewed at ease with easy navigation. This is achieved by using Page Highlights mentioned above and finding relevant contents, column at a time like we do when we read a news paper or website. The same column of text can be displayed on a small screen that can be viewed at ease as a small screen can easily display a column; but not a whole page. The contents are then automatically scrolled using various speeds and hence can easily be viewed and absorbed at ease. This is what results a Micro Browser or true wireless Internet that does not need any re-write of the website and presents contents at ease in a meaningful way.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 14

Voice enabled web browser

CHAPTER 9 APPLICATIONS
This technology can find applications for Service Providers, Businesses and governments in the following areas: 1. For Service Providers 2. Surf and browse the Web 3. Email (send, receive, compose, copy, forward, reply, delete and more) 4. Search the Web 5. Voice Portal Features such as News, Weather, Stock Quotes, Horoscopes and more. 6. For Businesses 7. Airline reservations and tracking 8. Package tracker 9. Reservations 10. E-Commerce 11. Customer service 12. Alert service 13. Order Confirmation 14. CRM Applications 15. For Governments 16. All key benefits for Businesses Plus Easy Accessibility to all Government contents of Internet content from English to any other language and vice-versa.), and providing Internet to elderly, visually impaired and blind people in a very simple and cost effective way.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 15

Voice enabled web browser

Future Work
Implementation and results of this new proposed technology depends on the developments in the field of speech recognition and artificial intelligence. The major challenge of this technology is its complexity which comes with a high price tag. But it is not far more than the complexity and cost involved in voice enabling a web site using the present technologies. We believe that Voice/speech based interface options will become an important part of the overall solution to access the Internet content. And an automated approach to Voice-Enable or create Voice Portals would be most practical and more common way than rewriting web contents in different languages and maintaining multiple version of the web sites. More efficient communication between Government and citizens, Government and businesses, and between Government departments.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 16

Voice enabled web browser

CONCLUSION
If a voice browser is to converse with the user, then a description, either explicit or derived and implicit, must exist for the underlying system to "render" into a dialog. Ultimately, it will be up to solution-providers to take an inventory of the existing content (if any), development tools, data-access requirements, deployment platforms, and application goals such as cost, security, richness and robustness, before they can decide what technology to use. More likely than not, for the time-being, multiple content types will be required to deliver the most natural experience on each type of browsing device -- this is both a technical limitation, and driven by the user's who expect the latest-and-greatest attributes of each modality to be featured in their applications.

Department of Computer Science and Engineering, SCEM, Mangalore

Page 17

Voice enabled web browser

REFERENCE
1. http://www.w3.org/Voice/ 2. http://www.voicexml.org/ 4. Internet speech Inc. 5. Avaya Labs 6. http://www.lhs.com/ 7. http://trqce.wisc.edu/world/web 8. http://www.dcp.ucla.edu/

Department of Computer Science and Engineering, SCEM, Mangalore

Page 18

Você também pode gostar