Altamash R. Jiwani, Government College of Engineering Amravati.
are devising new ways to get a high
Abstract: Google Pagerank. The amount of information on the web is This paper would provide a lot of tricks growing rapidly and with this the web and hacks for increasing your page rank in creates new challenges for information the worlds most popular search engine. retrieval as well as the number of new inexperienced users in the art of web research. Why Google is considered in this paper?? A search engine is an information retrieval system designed to help find information Because Google is the most popular large- stored on the World Wide Web. This scale search engine which addresses many search engine allows one to ask for of the problems of existing systems. It information on the basis of specific criteria makes heavy use of the additional structure mentioned and retrieves a list of items that present in hypertext to provide much match those criteria. This list is sorted with higher quality search results. Some of the respect to some measure of relevance of features include Fast crawling technology the results. to gather the web documents and keep In this paper, I present Google them up to date, efficient use of Storage search engine, which has become a space to store indices, minimal response prototype of a large-scale search engine time Queries system. In short “The best which is in heavy use today and would be navigation service” which instead of a far more than that in the near future making things easier for the computer, which could be estimated from the fact that make things easier for the user and make Google has an indexed database of at least the computer work harder. 24 million pages. As a Google user, were familiar with the This paper mainly covers “How speed and accuracy of a Google search. Google works?” which includes Google How exactly does Google manage to find hardware architecture , servers, what the right results for every query as quickly Google indexes , features and limitations, as it does? All questions like this would be Google ranking principles and tips, answered in this paper. Googleplex and a lot more…. There’s something deeper to learn about Everybody is running for this Google like a mystery waited to be solved. amazing thing which has changed the way An example of this could be that Google is of how we surf the net. Technically they a company that has built a single very large, custom computer comprising of subsequently indexed and catalogued. 100,000 servers... It’s running their own Only information that is submitted is put cluster operating system. They make their into the index. big computer even bigger and faster each In both cases, when you query a month, while lowering the cost of CPU search engine to locate information, you cycles, making an efficient system with a are actually searching through the index unique combination of advanced hardware that the search engine has created —you and software. Google has taken the last 10 are not actually searching the Web. These years of systems software research out of indices are giant databases of information university labs, and built their own that is collected and stored and proprietary, production quality system. subsequently searched. This explains why What will they do next with the world’s sometimes a search on a commercial biggest computer and most advanced search engine, such as Yahoo! or Google, operating system? Still remains a mystery. will return results that are, in fact, dead links TYPES OF SEARCH ENGINES……… Why will the same search on different search engines produce different There are basically three types of search results? engines: Part of the answer to that question is 1) Those that are powered by robots because not all indices are going to be (called crawlers; ants or spiders) exactly the same. It depends on what the 2) Those that are powered by human spiders find or what the humans submitted. submissions. But more important, not every search 3) Those that are a hybrid of the two. engine uses the same algorithm to search through the indices. Crawler-based search engines are those Google developers: that use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the sites meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. Larry Page Co-Founder & President, Google Human-powered search engines rely on Products humans to submit information that is will then be handled solely by that cluster. A load balancer that is monitoring the cluster then spreads the request out over the servers in the cluster to make sure the load on the hardware is even.
Then the following process is done.
ü Determine the documents pointed to
by the keywords Sergey Brin ü Sort these documents using each Co-Founder & President, Google one’s Page Rank Technology ü Provide links to these documents on the Web ü Provide a link to view the cached Google’s Hardware: version of the document in the doc To provide sufficient service capacity, server farm Google’s physical structure consists of ü Pull an excerpt from the page, using clusters of computers situated around the the cached version of the page, to world known as server farms. These server give a quick idea of what it is about farms consist of a large number of ü Return an initial result set of commodity level computers running Linux document excerpts and links, with based systems that operate with GFS, or links to retrieve further result sets of the Google file system. matches, rendered as HTML. ü By default, Google returns result in It has been speculated that Google has the sets of ten matches (as an HTML world’s largest computer. The estimate page). states Google as having up to: ü You can change the number of Ø 899 racks results you want to see on the Ø 79,112 machines Google Preferences page. Ø 158,224 CPUs Ø 316,448 Ghz of processing power Google prides itself on the fact that most Ø 158,224 Gb of RAM queries are answered in less than half a Ø 6,180 Tb of Hard Drive space second. Considering the number of steps involved in answering a query, you can see How Google Handles Search that this is quite a technological feat. Queries?????? When a user enters a query into the search box at Google.com, it is randomly sent to one of many Google clusters. The query Let's see how Google processes a query . The PageRank System developed, and it is probable that Google uses a variation of it. The PageRank algorithm is used to sort pages returned by a Google search request. In the equation 't1 - tn' are pages linking to Page Rank, named after Larry Page, who page A, 'C' is the number of outbound came up with it, is one of the ways in links that a page has and 'd' is a damping which Google determines the importance factor, usually set to 0.85. of a page, which in turn decides where the page will turn up in the results list. We can think of it in a simpler way:-
PageRank is a numeric value that a page's PageRank = 0.15 + 0.85 * (a
represents how important a page is on the "share" of the PageRank of every web. Google figures that when one page page that links to it) links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more This equation shows that a website has a important the page must be. Also, the maximum amount of PageRank that is importance of the page that is casting the distributed between its pages by internal vote determines how important the vote links. The maximum amount of PageRank itself is. in a site increases as the number of pages in the site increases. The more pages that a The crucial element that makes PageRank site has, the more PageRank it has work is the nature of the Web itself, which depends almost solely on the use of hyper Let's consider a 3 page site (pages A, B linking between pages and sites. In the and C) with no links coming in from the system that makes Google’s PageRank outside. algorithm work, links are a Web popularity contest: Webmaster A thinks Webmaster The site's maximum PageRank is the B’s site has good information (or is cool, amount of PageRank in the site. or looks good, or is funny), Webmaster A Consider, the PageRank for the web pages may decide to add a link to Webmaster B’s as, site. Page A = 0.15 Page B = 1 Page C = 0.15 PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn)) That's the equation that calculates a page's PageRank. It's the original one that was published when PageRank was being Page A has "voted" for page B and, as a result, page B's PageRank has increased. Web crawlers are mainly used to create a This is looking good for page B. copy of all the visited pages for later processing by a search engine, that will After 100 iterations the figures are:- index the downloaded pages to provide Page A = 0.15 fast searches. Crawlers can also be used Page B = 0.2775 for automating maintenance tasks on a Page C = 0.15 Web site, such as checking links or validating HTML code. Also, crawlers can The total PageRank in the site is now be used to gather specific types of (0.15+0.15+0.2775) =0.5775. information from Web pages, such as Hence u could see that very less linking or harvesting e-mail addresses (usually for poor linking decreases your PageRank. spam).
PageRank is also displayed on the toolbar
of your browser if you’ve installed the Google toolbar (http://toolbar.google.com/).
Google’s web crawler: The
Googlebot
A web crawler (also known as a Web
spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms. There are 2 methods that the Googlebot the Googlebot will go through all links in uses to find a web page, the index page and every subsequent page, Ø Either it reaches the webpage after until the entire site has been indexed. crawling through links, Ø Or it goes the page after it has been submitted by the webmaster to www.google.com/addurl.html By submitting the base link of the site, for example, http://wiki.media-culture.org.au/, Because of Google’s decision to use a large number of commodity level computers instead of a smaller number of www.google.com/addurl.html server type systems, the Google File System had to be designed to handle Google File System: system failures, which resulted in it being The Google file system is a propriety file designed to effect constant monitoring, of management system developed by Sanjay systems, error detection, fault tolerance Ghemawat, Shun-Tak Leung and Urs and automatic recovery . That meant that Holzle for Google as a means to handle the clusters would have to hold multiple massive number of requests over a large replicas of the information created by number of server clusters. Google’s web crawlers. The system was designed like most other Because of the size of the Google distributed files systems for maximum database, the system had to be designed to performance, to handle the large number of handle huge multi-gigabyte sized files users, scalability, to be able to handle totaling many terabytes in size. inevitable expansions, reliability, to ensure GFS helps ensuring that Google has maximum uptime and availability, to maximum control over the system, at the ensure computers are available to handle same time allowing the system to stay queries. flexible. page or website, then – type your query inside square brackets More on Google e.g. [ur query].
ü To limit the scope of a search to a
Google almost certainly knows more about particular file type is to use the you than you would tell your mother. Did syntax for file type (filetype:). For you ever search for information about example, filetype:ppt google finds Aids, cancer, mental illnesses or bomb- mention of Google in PowerPoint making equipment? Google knows, slides. Other formats include .pdf because it has put a unique reference (Adobe Acrobat), .doc (Word) and number in a permanent cookie on your .xls (Excel). hard drive (which doesn't expire until 2038). It also knows your internet (IP) ü You can use an asterisk (*) as a address. wildcard. Example: "George * Bush" finds George W. Bush. Google's privacy policy says that it "notes Example: "To * * * to be" finds "To and saves information such as time of day, be or not to be". browser type, browser language, and IP You can also use this strategy to address with each query. That information find email addresses: is used to verify our records and to provide "email * * <domain>". more relevant services to users. For example, Google may use your IP address ü To find out who links to a Web or browser language to determine which page, use the link (link:) syntax. The language to use when showing search search link:www.virtualchase.com results or advertisements." would perform a reverse link search If you add the Google Toolbar to to the url mentioned. Useful to see your Windows browser, then it can send how popular your site is Google information about the pages you view, and Google can update the Toolbar ü Use quotation marks ” “ to locate an code automatically, without asking you entire string. eg. “bill gates conference” will only return results with that exact Searching Tips and Tricks for string. Google: ü Mark essential words with a ‘+’. If a search term must contain certain ü If you want to search Google for the words or phrases, mark it with a + words that will be on the page you symbol. eg: +”bill gates” want, not for a description of the conference will return all results containing “bill gates” but not necessarily those pertaining to a If you include other words in the conference query, Google will highlight those words within the cached document. ü Negate unwanted words with a - For instance, cache:www.cwire.org ü You may wish to search for the web will show the cached content term bass, pertaining to the fish and with the word “web” highlighted. be returned a list of music links as well. To narrow down your search a ü info: bit more, try: bass -music. This will The query [info:] will present some return all results with “bass” and information that Google has about NOT “music”. that web page. For instance, info:www.cwire.org will show ü site:www.cwire.org information about the CyberWyre This will search only pages, which homepage. reside on this domain. ü weather: ü related:www.cwire.org Used to find the weather in a This will display all pages which particular city. eg. weather: new Google finds to be related to your york URL ü allinurl: ü spell:word If you start a query with [allinurl:], Runs a spell check on your word Google will restrict the results to those with all of the query words in ü define:word the url. For instance, [allinurl: Returns the definition of the word google search] will return only documents that have both “google” ü stocks: [symbol, symbol, etc] and “search” in the url. Returns stock information. eg. stock: msft ü inurl: If you include [inurl:] in your ü maps: query, Google will restrict the A shortcut to Google Maps results to documents containing that word in the url. For instance, [inurl:google search] will return ü phone: name_here documents that mention the word Attempts to lookup the phone “google” in their url, and mention number for a given name the word “search” anywhere in the document (url or no). ü cache: ü allintitle: If you start a query with uploaded to the server. Here's a simple [allintitle:], Google will restrict the hack - Upload all your pages everyday results to those with all of the query even if nothing has changed. words in the title. For instance, [allintitle: google search] will return Lots of light HTML pages: Google adores only documents that have both simple websites with hundreds of pages. If “google” and “search” in the title. you are building a page that (because of its extensive contents) is going to be larger ü intitle: than 50K, split it in two or three pages If you include [intitle:] in your query, Google will restrict the Start out slowly. If possible, begin with a results to documents containing that new site that has never been submitted to word in the title. For instance, the search engines or directories. Choose [intitle:google search] will return an appropriate domain name, and start out documents that mention the word by optimizing just the home page. “google” in their title, and mention the word “search” anywhere in the Learn basic HTML. Many search engine document (title or no). Note there optimization techniques involve editing the can be no space between the behind the scenes HTML code. Your high “intitle:” and the following word. rankings can depend on knowing which codes are necessary, and which aren't. ü allinlinks: Searches only within links, not Choose keywords wisely. The keywords text or title. you think might be perfect for your site may not be what people are actually ü allintext: searching for. To find the optimal Searches only within text of pages, keywords for your site, use tools such as but not in the links or page title. WordTracker.
Create a killer Title tag. HTML title tags
Steps to get a high Google are critical because they're given a lot of weight with all of the search engines. You Pagerank: must put your keywords into this tag and not waste space with extra words. Do not Server Speed: Your website pages must be use the Title tag to display your company downloaded nearly at the speed of light. name or to say "Home Page." Think of it Yes it is, Google gives more visibility to more as a "Title Keyword Tag" and create websites that are resident on fast servers. it accordingly. Add your company name to the end of this tag, if you must use it. Site Updating: Googlebot has the ability to check out WHEN your pages have been Create Meaty Meta tags. Meta tags can be then it is considered as dynamic URL by valuable, but they are not a magic bullet. Search Engines, otherwise Create a Meta Description tag that uses Make static Page, with URL not having your keywords and also describes your elements from the provided list. site. The information in this tag often appears under your Title in the search Sites use Flash. Flash is not a problem, its engine results pages. the non-professional application wasting the effort. Mostly homepages, intro pages, Use extra "goodies" to boost rankings. etc. are build up using Flash to provide a Things like headlines, image alt tags, cool 'n interactive impact over the browser. header tags <H1><H2>, etc.), links from But the problem arises when the maximum other pages, keywords in file names, and or complete page is Flash dependent b'coz keywords in hyperlinks can cumulatively Search Engines don't index Flash. Another boost search engine rankings. Use any or major problem is that the links over flash all of these where they make sense for can't be crawled over by search engines so your site. they can't be indexed. Solution could be that in such cases you need to get highly Don't expect quick results. Getting high effective TITLE & META tags. To solve rankings takes time; there's no getting the links problem the standard way is around that fact. Once your site is added to creating a sitemap and linking it to each a search engine or directory, its ranking web page via a standard HTML Hyperlink may start out low and then slowly work its Tag. way up the ladder. Sites use Image Maps for navigation. The Google measure "click-through problem arising from these are that link popularity," i.e., the more people that click crawlers of Search Engines mostly get on a particular site, the higher its ranking jumbled over the Image Maps & don't will go. Be patient and give your site time spider most of links. You can overcome it to mature. adding an alternate Simple Navigational Menu or the standard technique to Search engines don't index framed sites SiteMap. very well so if they aren't necessary, Simply remove them for the better ranking Sites also use JavaScript for navigation. of your sites. Search engines mostly don't follow the links provided in the javascript. You can Sites that use Dynamic URLs. Most search overcome it adding an alternate Simple Engines don't list any dynamic URLs from Navigational Menu or the standard Database -driven or script running sites. technique to SiteMap. Solution could be if your URL contains only any of the following elements ? & % + = $ cgi-bin .cgi Eventually, you'll see the fruits of your labor i.e. your site’s listing in Google and the rest of the search engines!
Conclusion:
Google is now the world's most powerful
website. Google's mission: Organize the world's information and make it universally accessible and useful. Google said “We're about not ever accepting that the way something has been done in the past is necessarily the best way to do it today." But nobody believed it. But now they have done a far more than that.
The Google web site is powered by some
amazing technology. People often ask "What are you working on? Isn't search a solved problem?" And all this because of Google i.e. The master of the internet whose averages about 250 million Searches Per Day.