Você está na página 1de 7

How Does Google search

engine Works?

IRWM: Assignment 1

Ganesh B. Solanke
1. Crawling and Indexing
Google navigates web by crawling. It follows links from page to page and then sort pages
by their contents and other factors. Google keep track of it all in the index.

How Search Works

These processes lay the foundation. It gathers and organizes information on the web so it
can return the most useful results to us. Its index is well over 100,000,000 gigabytes, and it has
spent over one million computing hours to build it.

Fig. High level Google Architecture

Finding information by crawling (i.e. Googlebot)

It uses software known as “web crawlers” to discover publicly available web pages. The most
well-known crawler is called “Googlebot”. Crawlers look at web pages and follow links on
those pages, much like we would if we were browsing content on the web. They go from link to
link and bring data about those web pages back to Google’s servers.

The crawl process begins with a list of web addresses from past crawls and sitemaps
provided by website owners. As its crawlers visit these websites, they look for links for other
pages to visit. The software pays special attention to new sites, changes to existing sites and dead
links. Computer programs determine which sites to crawl, how often, and how many pages to
fetch from each site. Google doesn't accept payment to crawl a site more frequently for their web
search results. It cares more about having the best possible results because in the long run that’s
what’s best for users and, therefore, their business.

PCCOE, Pune Page 2


In Google, the web crawling (downloading of web pages) is done by several distributed
crawlers, which browses the World Wide Web by employing many Computers. URL server
sends lists of URLs (uniform resource locator) to be fetched to the crawlers. The web pages that
are fetched are then sent to the store server. The store server then compresses and stores the web
pages into a repository. Every web page has an associated ID number called a docID which is
assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform
indexing functions and read the repository, uncompress the documents, and parse them. Each
document is converted into a set of word occurrences called hits. The hits record the word,
position in document, an approximation of font size, and capitalization. The indexer distributes
these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs
another important function. It parses out all the links in every web page and stores important
information about them in an anchors file. This file contains enough information to determine
where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in
turn into docIDs. It puts the anchor text into the forward index, associated with the docID that
the anchor points to. It also generates a database of links which are pairs of docIDs. The links
database is used to compute Page Ranks for all the documents.

The sorter takes the barrels, which are sorted by docID and resorts them by wordID to
generate the inverted index. This is done in place so that little temporary space is needed for this
operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A
program called DumpLexicon takes this list together with the lexicon produced by the indexer
and generates a new lexicon to be used by the searcher. The searcher is run by a web server and
uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to
answer queries.

Organizing information by Indexing

The web is like an ever-growing public library with billions of books and no central filing
system. Google essentially gathers the pages during the crawl process and then creates an index,
so it know exactly how to look things up. Much like the index in the back of a book, the Google
index includes information about words and their locations. When people search, at the most
basic level, Google’s algorithms look up our search terms in the index to find the appropriate
pages.
The search process gets much more complex from there. Google’s indexing systems note
many different aspects of pages, such as when they were published, whether they contain
pictures and videos, and much more. With the Knowledge Graph, Google is continuing to go
beyond keyword matching to better understand the people, places and things people care about.
New sites, changes to existing sites and deadlinks are noted and used to update Google index.

Choice for website owners

Most websites don’t need to set up restrictions for crawling, indexing or serving, so their
pages are eligible to appear in search results without having to do any extra work. That said, site
owners have many choices about how Google crawls and indexes their sites through Webmaster
Tools and a file called “robots.txt”. With the robots.txt file, site owners can choose not to be

PCCOE, Pune Page 3


crawled by Googlebot, or they can provide more specific instructions about how to process pages
on their sites. Site owners have granular choices and can choose how content is indexed on a
page-by-page basis. For example, they can opt to have their pages appear without a snippet (the
summary of the page shown below the title in search results) or a cached version (an alternate
version stored on Google’s servers in case the live page is unavailable). Webmasters can also
choose to integrate search into their own pages with Custom Search.

2. Algorithm- Ranking and more


Algorithms are computer programs that look for clues to give you back exactly what user
wants. It helps to deliver best possible results by ranking and more methods. Algorithms are the
computer processes and formulas that take user’s questions and turn them into answers. Today
Google’s algorithms rely on more than 200 unique signals or clues that make it possible to guess
what you might really be looking for. These signals include things like the terms on websites, the
freshness of content, our region and PageRank.
The indexing process has produce all the pages that include particular words in a query
enter by the searcher, but they are not sorted in terms of importance or relevance. Ranking of the
document is measured to provide the most relevant WebPages for the search query entered.
Evaluation of relevance is based on factors, they are:
• PageRank
• Authority and trust of the pages which refer to a page.
• The number of times the keywords, phrases and synonyms of keywords occur on the
page.
• Spamming rate.
• The occurrence of the phrase within the document title, URL (Uniform Resource
Locator).

There are many components to the search process and the results page, and it is
constantly updating Google’s technologies and systems to deliver better results. Many of these
changes involve exciting new innovations, such as the Knowledge Graph or Google Instant.
There are other important systems that Google constantly tune and refine. This list of projects
provides a glimpse into the many different aspects of search. Some of them are:

• Autocomplete: Predicts what user might be searching for. This includes understanding
terms with more than one meaning.
• Freshness: Shows the latest news and information. This includes gathering timely results
when you’re searching specific dates.
• Google Instant: Displays immediate results as you type.
• Indexing: Uses systems for collecting and storing documents on the web.
• Mobile: Includes improvements designed specifically for mobile devices, such as tablets
and smart phones.
• Query Understanding: Gets to the deeper meaning of the words you type.
• Refinements: Provides features like “Advanced Search,” related searches, and other
search tools, all of which help you fine-tune your search.
• Safe Search: Reduces the amount of adult web pages, images, and videos in our results.

PCCOE, Pune Page 4


• Search Methods: Creates new ways to search, including “search by image” and “voice
search.”
• Site & Page Quality: Uses a set of signals to determine how trustworthy, reputable, or
authoritative a source is. (One of these signals is PageRank, one of Google’s first
algorithms, which looks at links between pages to determine their relevance.)
• Spelling: Identifies and corrects possible spelling errors and provides alternatives.
• Synonyms: Recognizes words with similar meanings.
• Translation and Internationalization: Tailors results based on your language and country.
• Universal Search: Blends relevant content, such as images, news, maps, videos, and
your personal content, into a single unified search results page.
• User Context: Provides more relevant results based on geographic region, Web
History, and other factors.

Based on all above clues, Google pull all relevant documents from index and rank them. After
ranking process those are returned as query results to the user.

PageRank Algorithm: bringing order to the web

Google’s most important feature is Page Rank, a method that determined the
“importance” of a webpage by analyze at what other pages link to it, as well as other data. It is
derived from link analysis algorithm. It is not possible for a user to go through all the millions of
pages presented as output of search. Thus all the pages should be weighted according to their
priority and represented in the order of their weights and importance. PageRank is an excellent
way to prioritize the results of web keyword searches. PageRank is basically a numeric value that
represents how much a webpage is important on the web. PageRank is calculated by counting
citations or backlinks to a given page.

PR (A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


Where,
PR (A)-PageRank of page A,
PR (Ti)-PageRank of pages Ti, which links to page A,
C (Ti)-total number of outbound links on page Ti,
d-damping factor, which is always set between 0 to 1,
N-total number of all pages on web

The PageRanks form a probability distribution over web pages, so the sum of all web pages
PageRanks will be one.

Anchor Text

The anchor text is defined as the visible, highlighted clickable text that is displayed for a
hyperlink in an HTML page. Search engine treat the anchor text in a different way. The anchor
text can determine the rank of the page. It provides more accurate descriptions of web pages that
are indicated in anchors than the pages themselves. Anchors may exist for documents which
cannot be indexed by a text-based search engine, such as images, programs, and databases.

PCCOE, Pune Page 5


Anchor text is ranked or given high weightage in search engine algorithms. The main goal of
search engines is to bring highly relevant search results and anchor text can help by providing
better quality results.

Other Features:
It has location information for all hits, a set of all word occurrences so it makes extensive
use of proximity or probability in searching. Google keeps information about some visual
presentation details such as font size of words, Words in a larger or bolder font are weighted
higher than other words. Full raw HTML of pages is available in a repository.

Result Serving
Results are served to user in different forms like images, audio, texts, videos, links,
knowledge graph, snippets, news, thumbnails, voices and more in just (1/8)th of second time.
Some of forms are as below.

• Snippets: Shows small previews of information, such as a page’s title and short
descriptive text, about each search result.
• Knowledge Graph: Provides results based on a database of real world people, places,
things, and the connections between them.
• News: Includes results from online newspapers and blogs from around the world.
• Answers: Displays immediate answers and information for things such as the weather,
sports scores and quick facts.
• Videos: Shows video-based results with thumbnails so you can quickly decide which
video to watch.
• Images: Shows you image-based results with thumbnails so you can decide which page to
visit from just a glance.
• Books: Finds results out of millions of books, including previews and text, from libraries
and publishers worldwide.

3. Fighting Spam
It fights spam through a combination of computer algorithms and manual review. Spam
sites attempt to game their way to the top of search results through techniques like repeating
keywords over and over, buying links that pass PageRank or putting invisible text on the screen.
This is bad for search because relevant websites get buried, and it’s bad for legitimate website
owners because their sites become harder to find. The good news is that Google's algorithms can
detect the vast majority of spam and demote it automatically. For the rest, they have teams who
manually review sites.

Identifying Spam

Spam sites come in all shapes and sizes. Some sites are automatically-generated gibberish
that no human could make sense of. Of course, it also sees sites using subtler spam techniques.

PCCOE, Pune Page 6


Taking Action

While its algorithms address the vast majority of spam, it addresses other spam manually
to prevent it from affecting the quality of your results. The numbers may look large out of
context, but the web is a really big place. A recent snapshot of its index showed that about 0.22%
of domains had been manually marked for removal.

Notifying Website Owners

When it takes manual action on a website, it tries to alert the site's owner to help him or
her address issues. It wants website owners to have the information they need to get their sites in
shape. That’s why, over time, it has invested substantial resources in webmaster communication
and outreach.

Listening for Feedback

Manual actions don’t last forever. Once a website owner cleans up her site to remove
spammy content, she can ask us to review the site again by filing a reconsideration request. It
processes all of the reconsideration requests it receive and communicate along the way to let site
owners know how it's going. Historically, most sites that have submitted reconsideration requests
are not actually affected by any manual spam action. Often these sites are simply experiencing
the natural ebb and flow of online traffic, an algorithmic change, or perhaps a technical problem
preventing Google from accessing site content.

Some facts about Google


• Google has been in the search game a long time, it has the highest share market of Search
Engine (about 81%).
• Web Crawler-based service provides both comprehensive coverage of the Web along
with great relevancy.
• Google is much better than the other engines at determining whether a link is an artificial
link or true editorial link.
• Google gives much importance to Sites which add fresh content on a regular basis. This
is why Google likes blogs, especially popular ones.
• Google prefer informational pages to commercial sites.
• A page on a site or sub domain of a site with significant age or link can rank much better
than it should, even with no external citations.
• It has aggressive duplicate content filters that filter out many pages with similar content.
• Crawl depth determined not only by link quantity, but also link quality. Excessive low
quality links may make your site less likely to be crawled deep or even included in the
index.
• In addition we can search for twelve different file formats, cached pages, images, news
and Usenet group postings

PCCOE, Pune Page 7

Você também pode gostar