Você está na página 1de 51

GOOGLE TALK

Ed Austin 12-09-09

Pre Presentation The Google Philosophy


(according to ed)

Jedis build their own lightsabres (the MS Eat your own Dog Food) Parallelize Everything Distribute Everything (to atomic level if possible) Compress Everything (CPU cheaper than bandwidth) Secure Everything (you can never be too paranoid) Cache (almost) Everything Redundantize Everything (in triplicate usually) Latency is VERY evil

The Anatomy of the Google Architecture


The unofficial Version
V1.0 November 2009

Ed Austin
{ed, edik} @i-dot.com

Section I The Basic Glue


Architecture GOOGLE APP ENGINE GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

Python. Java. C++

1. Exterior Network (Perimeter Architecture) 2. Data Centre 3. Rack Characteristics

BigTable

Mapreduce BigTable Chubby Lock

4. Core Server Hardware


GFS / GFS II
INTERIOR NETWORK IPv6

5. Operating System Implementation 6. Interior Network Architecture

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

THE PERIMETER How does your data enter the Google empire?

Perimeter Network Security (as known)


Client Browser
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ
80/443

Firewall
80/443

DMZ
Perimeter

Firewall

GOOGLE APP ENGINE

DNS
Load Balanced (.COM = 3, UK only one) [ed@d800 ~]$ dig google.com ..... ;; ANSWER SECTION: google.com. 223 IN A 74.125.45.100 google.com. 223 IN A 74.125.53.100 google.com. 223 IN A 74.125.67.100 [ed@d800 ~]$

Squid NetScalar
http multiplexing Reverse Proxy

GWS
Web Server Farm

Python. Java. C++

Cell
Interior Network GFS II etc

BigTable

Mapreduce BigTable Chubby Lock

Possible Search Traffic Path


Based upon Known Technologies employed edge routing not shown/instances not shown

GFS / GFS II
INTERIOR NETWORK IPv6

DNS Load Balanced splits traffic (country, .com multiple DNS, other X1) to FW Firewall filters traffic (http/s, smtp,pop etc) Netscalar Load Balancers take Request from FW blocks DOS attacks, ping floods (DOS) blocks non IPv4/6 and none 80/443 ports and http multiplexes (limited caching capability) User Request forwarded to Squid (Reverse Proxy) probably HUGE cache (Petabytes?) If not in Cache forwarded to GWS (Custom C++ Web Server) now not using Custom apache? GWS sends the Request to appropriate internal (Cell) servers Request is processed exterior https via thawte certs Dedicated Crawler Architecture separate from other infrastructure

RHEL 2.6.X PAE SERVER HARDWARE RACK DC Exterior Network

PERIMETER NETWORK CACHING


Architecture

GOOGLE APP ENGINE

Python. Java. C++

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

80/443

80/443

Squid
Reverse Proxy

BigTable

Mapreduce BigTable Chubby Lock

-Uses Squid Reverse Proxy -Perimeter Cache hit rates 30-60% = Huge!
- Dependent on search complexity/user preferences/traffic type

GFS / GFS II
INTERIOR NETWORK IPv6

All Image Thumbnails caches, much Multimedia cached Expensive common queries cached (common words i.e. Obama, edinburgh) as they require significant back-end processing. On cache flush/update big latency spike and capacity drop - Index servers need to do significant work to rebuild cache

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

THE DATA CENTRE Where do they store all that Data?

Worldwide Data Centres


Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

Where is Google Located? Last estimated were 36 Data Centers, 300+ GFSII Clusters and upwards of 800K machines. US (#1) Europe (#2) Asia (#3) South America/Russia (#4) Australia on Hold Future:
Taiwan, Malaysia, Lithuania, and Blythewood, South Carolina.

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

The Modular Data Centre


Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

Standard Google Modular DC (Cell) holds 1160 Servers / 250KW Power Consumption in 30 racks (40U). This is the Atomic Data Centre Building Block of Google. A Data Centre would consist of 100s of Modular Cells.
DC architecture then being the aggregation of smaller Cell level infrastructures in their own container some being pure GFS, other BT, other Map, some mixed etc.

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

MDCs can also be deployed autonomously at the Perimeter 10 (stand alone).

THE RACK
How is a server stored in the Data Centre?

11

Google Rack (GOOG rack)


Architecture

Why interesting?
The rack Implementation!

GOOGLE APP ENGINE

Python. Java. C++

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

EVERYTHING custom!

Mini Server Size Old Servers are Custom 1U


New Servers are 2U... again a custom design seem 1/3 width of a normal 2U Server

BigTable

Mapreduce BigTable Chubby Lock

40U/80U Custom Racks (50% each side)


Design Huge Heating and Power Issues Optimized Motherboards Work closely with HW MB developers Have their own HW builds specified to component level

GFS / GFS II
INTERIOR NETWORK IPv6

Servers expected to be expendable


build redundancy on top of failure

RHEL 2.6.X PAE

Motherboard directly mounted into Rack


servers have no casing - just bare boards assist with heat dispersal issues

SERVER HARDWARE RACK DC Exterior Network

12

THE HARDWARE Millions of exactly what?

13

Server Hardware
Architecture

GOOGLE APP ENGINE GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

2U Low-Cost (but not slow) Commodity Servers

2009 Currently 2-Way, Dual Core/16GB/1-2TB +- Standard


Both Intel/AMD Chipsets 1 NIC 2 USB Looks like they RAID1/mirror the disks for better I/O - read performance
SATA 7.2K/10K/15K drives? 8 x 2GB DDR3 ECC
Currently at 7Gen Build (1G 2005 was probably Dual Core/SMP) Each Server 12V Battery Backup and can run autonomously without external power (lasts 20-30s?) Work closely with chip manufacturers to improve design/reduce power custom Intel chips that can withstand higher heat factors than generic versions

Python. Java. C++

Standard HW Build (Several HW Build Versions at any one time)


BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

RHEL 2.6.X PAE


YEAR Average Server Specification PII/PIII 128MB+ Celeron 533, PIII 1.4 SMP, 2-4GB DRAM, Dual XEON 2.0/1-4GB/40160GB IDE - SATA Disks via Silicon Images SATA 3114/SATA 3124 Dual Opteron/Working Set DRAM(4GB+)/2x400GB IDE (RAID0?) 2-Way/Dual Core/16GB/1-2TB SATA

SERVER HARDWARE RACK DC Exterior Network

1999/2000 2003/2004 2006 2009

14

THE OPERATING SYSTEM


The Core Software on each of those servers

15

OPERATING SYSTEM
Architecture

-100% Redhat Linux Based since 1998 inception


GOOGLE APP ENGINE GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

- RHEL (Why not CentOS?) - 2.6.X Kernel - PAE - Custom glibc.. rpc... ipvs... - Custom FS (GFS II) - Custom Kerberos - Custom NFS - Custom CUPS - Custom gPXE bootloader - Custom EVERYTHING.....

GFS / GFS II
INTERIOR NETWORK IPv6

Kernel/Subsystem Modifications
tcmalloc replaces glibc 2.3 malloc much faster! works very well with threads... rpc the rpc layer extensively modified to provide > perf increase < latency (52%/40%) Significantly modified Kernel and Subsystems all IPv6 enabled

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

Use Python as the primary scripting language Deploy Ubuntu internally (likely for the Desktop) also Chrome OS base

Easily the Worlds largest installed Linux base

16

THE INTERIOR NETWORK How does your datatravel around the Google empire?

17

INTERIOR NETWORK
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

ROUTING PROTOCOL Internal network is IPv6 (exterior machines can be reached using IPv6) Heavily Modified Version of OSPF as the IRP Intra-rack network is 100baseT Inter-rack network is 1000baseT Inter-DC network pipes unknown but very fast

GOOGLE APP ENGINE

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

Technology:
GFS / GFS II
INTERIOR NETWORK IPv6

Juniper, Cisco, Foundry, HP, routers and switches Software:

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

ipvs (ip virtual server)


18

THE MAJOR GLUE The three foundation blocks of Googles Secret Sauce

19

Section II Googles Major Glue


Architecture GOOGLE APP ENGINE GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

Python. Java. C++

1. Google File System Architecture GFS II


2. Google Database - Bigtable 3. Google Computation - Mapreduce

BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II

4. Google Scheduling - GWQ


INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

20

GOOGLE FILE SYSTEM Manages the underlying Data on behalf of the upper layers and ultimately the applications

21

FILE SYSTEM I GFS v1


Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

The GFS II cell is Googles fundamental building block everything can be layered on top of this
Consists of (Highly distributed Linux based) Master Servers and Chunk Servers

GFS / GFS II
INTERIOR NETWORK IPv6

Chunk Servers serve the Data in 64MB Chunks to the client directly via Master arbitration DATA REDUNDANCY/FAULT TOLERANCE? Triplicate Copies of Chunks are kept often in other clusters / DC Chunks can be pulled from outside the DC! Expensive.... And try not to do! However apps built on top of GFS/BT do this on an ad-hoc basis (i.e. Gmail) On Chunk loss the Master handles the Recovery by sourcing a chunk copy Data is compressed using BMDiff/Zippy Chunk Server Fault-Tolerance achieved by Heart-beat to the Master (I am alive..) 22 Master Failure was problematic for Google (finally down from 2 minutes to 10 seconds)

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

FILE SYSTEM I GFS II


Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

GFS II Colossus Version 2 improves in many ways (is a complete rewrite)


GFS / GFS II
INTERIOR NETWORK IPv6

Elegant Master Failover (no more 2s delays...)

Chunk Size is now 1MB likely to improve latency for serving data other than Indexing for example GMail this was the rationale behind the change Master can store more Chunk Metadata (therefore more chunks addressable up to 100 million) = also more Chunk Servers

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

However according to Google Engineer they have only ever lost one 64MB chunk (in GFS I) during its entire production deployment (2004 2008?) so assumed extremely reliable

23

GOOGLE DATABASE Accesses the underlying Data on behalf of the upper layers and ultimately the applications

24

Bigtable I - Introduction
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

What is it? Googles Database Implementation since 1994

Python. Java. C++

Used internally for all large scale (Search, Indexing, GMail etc)
Similar to a sharded Database implemention GOALS

BigTable

Mapreduce

BigTable
Chubby Lock

Huge Scalability to many PBs (Web Database currently around 40 Billion URLs)

Tight Latency
Highly efficient scans over Textual Data
GFS / GFS II
INTERIOR NETWORK IPv6

Fault Tolerant Load Balancable

RHEL 2.6.X PAE

Eliminate Googles dependency on an external provider

SERVER HARDWARE RACK DC Exterior Network

25

Bigtable II
Architecture

How is Data Referenced?


GOOGLE APP ENGINE GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

Distributed Multi-Dimensional Sparse Map Simple addressing model using a triple:

Python. Java. C++

(row, column, {timestamp}) -> cell contents


ROWS

BigTable

Mapreduce

BigTable
Chubby Lock

- Rows (arbitrary length usually 10-100 Bytes Max <=64KB) - Rows stored lexographically - example row (URL))

GFS / GFS II
INTERIOR NETWORK IPv6

COLUMNS
- example column (contents:, PR, anchor1: ..)

RHEL 2.6.X PAE

TIMESTAMP (OPTIONAL?)
- timestamp (various API func args, i.e. ALL, LATEST) .

SERVER HARDWARE RACK DC Exterior Network

26

Bigtable III Table Structure


Architecture

ROW
10-100 Bytes <=64KB

COLUMN language: contents: anchorx:

GOOGLE APP ENGINE

Python. Java. C++

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

au.aaa

tablet 1
100-200MB Size

---- Server/Cell Boundary --uk.co.bbc.news ENG <HTML>

tablet ...
100-200MB Size

BigTable

Mapreduce

BigTable
Chubby Lock
html.test/za.zzzzz

---- Server/Cell Boundary --tablet n


100-200MB Size

Studying contents: column shows three versions of contents of a page (current, cached and ?) presumably all other columns are timestamped so could be used in a comparitive way (such as anchor increase/decrease) OTF in SERPS alg must use a combo of TimeSt diff between n(=3 rest garbage collected) page Versions and crawled anchors what else does table hold? Possibly PR (or OTF) and other search related weightings Google keeps much more info for ranking purposes than it did in 1999 Webtable hinted at 100 columns+! How do page units affect the URL reversal of the URL bigtable? -Does a Tables Tablets Cross a Clusters namespace (yes if unified else no?)

GFS / GFS II
INTERIOR NETWORK IPv6

Example of the the URL bigtable


C++ Bigtable Mutate of some Anchors //open table Table *T=OpenOrDie(/bigtable/web/bigtable); //write new anchor and delete old anchor RowMutation r1(T,uk.co.bbc.news); r1.Set(anchor:www.abc.org,CNN); r1.Delete(anchor:www.def.com); Operation op; Apply(&op, &r1); //atomic mutate to the columns
Other primitives such as DeleteCells(), DeleteRow(), Scanner (read arbitrary cells in a row)

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

27

Bigtable IV
Architecture

GOOGLE APP ENGINE

Python. Java. C++

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

How tables are broken down in storage ? For example Webtable is billions of pages! -Large Tables broken (split) into tablets at row boundaries
-Tablets discontiguous (assists in fault-tolerance) spread over DC but try to keep one copy in same rack -Tablet Size is approximately 100-200MB of compressed Data -Load Balanced migrate tablets from heavily loaded machines to lightly loaded ones - Heavily used tablets probably stay in working set (cached)

BigTable

Mapreduce

BigTable
Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

28

GOOGLE MAPREDUCE Computes the underlying Data on behalf of the applications

29

Mapreduce I
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Python. Java. C++

Map Reduction can be seen as a way to exploit massive parallelism by breaking a task down into constituent parts and executing on multiple processors

The Major Functions are MAP & REDUCE (with a number of intermediata steps) MAP REDUCE Break task down into parallel steps Combine results into final output

BigTable

Mapreduce
BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network


Shown is a 2-pipeline Map Reduction (There are 24 Map Reductions in the indexing pipeline) Mappers & Reducers usually run on separate processors (90% loss of reducers job still completed!)

30

Mapreduce II
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Python. Java. C++

LANGUAGE BINDINGS C++, Java, Python, Sawzall


DEPLOYED

BigTable

Mapreduce
BigTable Chubby Lock

Implemented 2004 before this MySQL?


GFS / GFS II
INTERIOR NETWORK IPv6

STATISTICS

-In September 2009 Google ran 3,467,000 MR Jobs with an average 475 sec completion time averaging 488 machines per MR and utilising 25.5K Machine years

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

-Technique extensively used by Yahoo with Hadoop (similar architecture to Google) and Facebook (since 06 multiple Hadoop clusters, one being 2500CPU/1PB with HBase).

31

Chubby Lock
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Python. Java. C++

Googles Distributed File Locking Service for Bigtable


-Provides Mutex Support for Data Access (atomic access to column data) - Used to synchronize access to shared resources

BigTable

Mapreduce BigTable

Chubby Lock

- Consists of a Master and Slaves (designated by election) - Failover consists of a Slave replacing the functionality of a Master -- Also servers as an ultra-fast high availability File Server for small fines (100s bytes)

GFS / GFS II

- Provides an ACL for tablet authentication (row and column data)


INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

32

GOOGLE WORKQUEUE Provides Resource Management for the Computational Jobs

33

GWQ Google Workqueue


Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Batch Submission/Scheduler System

Python. Java. C++

-Software to submit Mapreduce Jobs to a Cell/Cluster


-Arbitrates (process priorities) Schedules, Allocates Resources, process failover, Reports status, collects results

BigTable

Mapreduce BigTable Chubby Lock

- Often Workqueue overlaid on a GFS Cluster


- i.e. GFS cluster not computational bound jobs also seems to match co-locate tasks near data = just disk I/O not Network I/O (on the Chunk Server?)

GFS / GFS II
INTERIOR NETWORK IPv6

- Workqueue can manage many tens of thousands of machines Launched via API or command line (sawzall example shown)
saw --program code.szl --workqueue testing --input_files /gfs/cluster1/2005-02-0[1-7]/submits.* \ --destination /gfs/cluster2/$USER/output@100

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

34

Section III Some more Glue


Architecture GOOGLE APP ENGINE GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

1. Languages employed 2. Development Environment 3. Google App Engine 4. Network Security

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

5. Future Google Architecture Advances 6. Odds n Sods

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

7. DIY Google

35

DEVELOPMENT LANGUAGES
Architecture

- Initially Python, Java, C++


GOOGLE APP ENGINE GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

Usual Suspects

Python. Java. C++

- Sawzall (since 2006)


- equivalent to Hadoops Pig Latin - written in C++ - interpreted bytecode output JITd An internal Procedural language employed to solve map reduction problems. The few published Google papers employ Sawzall in the algorithm examples. Runs in the Map phase, Aggregators run in the Reduce phase (from each Sawzall Map instance) to get the final output.
- Transparent Parallelization no specialist Distrib Sys Knowledge Required (Good for developer) - Simple Datatypes 64-bit signed int, float, string, byte and a few unique such as time - Much STR regexp support - Compound Types arrays, tuples - typesafed (and declarations) similar to Pascal (Probably an LL(1) lang?) - similar to Algol, C Syntax (no pointers though!) - No Processing of exceptions (no exception handlers) - Shorter than corresponding C++ code by a factor of 10

BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

Early versions could not write into Bigtable. Now implemented? Output sometimes pipelined into MySQL for further analysis

36

GOOGLE APP ENGINE


Using Application Platform technology stack
Architecture

GOOGLE APP ENGINE

Python. Java. C++

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

Allows a developer to leverage components of Google Technology (but


not necessarily primary Infrastructure i.e. The usual business resources)

-Supports Python, Java - Bigtable support (via GQL)

BigTable

Mapreduce BigTable Chubby Lock

-Uses GFS as underlying FS usual Fault-tolerance/Load-balancing -Task Queue similar to GWQ?

GFS / GFS II
INTERIOR NETWORK IPv6

-Code exposed to Google - No support for subprocess spawning more importantly none of the google mapreduce library made available
- isolates computational aspects to single servers but the I/O is probably the google standard implementation underneath - therefore computationally intensive tasks more problematic = keeping your resource usage under control

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

37

Security
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

Rack Board Level (possible scenario)


gPXE on the board goes through DHCP/tftp sequence to pull over an encrypted image (this is not expensive as is done once per boot and boots are not usual) Image is pulled from a Secure Image Distribution Server (and held encrypted on these) Once at the board end the image is OTF decrypted and booted as normal RHEL

GOOGLE APP ENGINE

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

02/09 Google Engineer didnt dispute this and seemed to concur adding that incore encryption might be a possibility (R/T decryption might not be that expensive) this possibily means cryptology is used throughout the lifetime of the image including components outside the working-set but sensitive parts of the in-core OS (OTF decrypted)

GFS / GFS II
INTERIOR NETWORK IPv6

Enterprise
Kerberos is used throughout the enterprise

RHEL 2.6.X PAE

They have an Automated issuance system for SSL certificates, used by internal
(secure) infrastructure to validate https/TLS and generic SSL connections. Complete internal network encryption unlikely due to latency introduced?

SERVER HARDWARE RACK DC Exterior Network

Likely that one of the reasons failover between DCs problematic is the latency introduced due to the expense of Wide Area Encryption (essential)

38

Google Future Architecture


Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

- 99%ile latency for all data <50ms is a key speed metric -Single global namespace
-Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers.

Python. Java. C++

-Spanner
BigTable
Mapreduce BigTable Chubby Lock

Dynamic Load Balancing of upwards of 10M Servers between Data Centers

- automatic, dynamic world-wide placement of data & computation to minimize latency or cost.
GFS / GFS II
INTERIOR NETWORK IPv6

Allegedly used to reduce heat issues at DCs by moving the load when the heat issue becomes a problem at the new chillerless DCs (i.e. Belgium DC) not using chillers introducess significant savings.

- Translation Servers (automatic translation of documents)


RHEL 2.6.X PAE

- GDrive Servers
SERVER HARDWARE RACK DC Exterior Network

39

Odds n Sods
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

borg google technology/architecture (is a cluster..) Borg: a hybrid protocol for scalable application-level multicast in peer-to-peer networks (WAN multimedia steaming) data cube google technology Have a global loadbalancer assume load balances across a unified namespace probably worldwide gmail designers implemented application level failover to move your session to an alternate DC in a seamless fashion to the end user. Probably all Google Apps will be able to migrate to an alternate DC cell (the application, and its GFS data if need be) MySQL is used for back-end sys admin stuff (high availability master-slave implementations) and post Bigtable processing Remote employee access is via VPN

Python. Java. C++

BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

Sys Admins maintain 5 and 30 minute SLAs so on the ball


Has its own internal archive.org equiv.

SERVER HARDWARE RACK DC Exterior Network

40

BUILD YOUR OWN GOOGLE


The Basic Open Source Tools

41

The Google Stack (vs Yahooish/Open Source)


Open Source
Architecture (Yahooish) Architecture

APP ENGINE

Python, Java, C++, Task Queue

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

CLIENT APPLICATION

Pig Latin, Python, PHP, Java .... anything Job Tracker

Googles Secret Sauce

BigTable

Mapreduce BigTable Chubby Lock

Hadoop Framework
Mapreduce Hbase (Bigtable equiv.)

Hadoop Open Source


(Other Tools such as crawlers, indexers readily available)

GFS / GFS II
INTERIOR NETWORK IPv6

HDFS (hadoop)
INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

CentOS 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

SERVER HARDWARE RACK DC Exterior Network

Conceptual Overview Google vs. Open Source

42

END
(Thankyou)

43

Open Source
(Yahooish) Architecture

DIY GOOGLE What you require: Preferably 2 Machines + 100BT CentOS/RHEL (squid) Apache Hadoop (HDFS, Mapreduce, Pig, HBase) HDFS bmdiff/zippy compression library Google glibc/tcmalloc perftools Supporting stuff JRE etc Browser with Search Box pig mr call to scan a few files print results
44

CLIENT APPLICATION

Python, PHP, Java .... anything Job Tracker (Work Queue equiv.)

Hadoop Framework
Mapreduce Hbase (Bigtable equiv.)

HDFS (hadoop)
INTERIOR NETWORK IPv6

CentOS 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

Open Source
(Yahooish) Architecture

DIY GOOGLE

CLIENT APPLICATION

Python, PHP, Java .... anything Job Tracker (Work Queue equiv.)

Hadoop Framework
Mapreduce Hbase (Bigtable equiv.)

HDFS (hadoop)
INTERIOR NETWORK IPv6

Install Hadoop and Pig on Cluster Install eclipse and dependencies Install PigPen for eclipse and configure to cluster (NFS)

CentOS 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

45

TEMPLATE
Architecture

GOOGLE APP ENGINE

Python. Java. C++

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

- IPv6 enablement started 2008 (2009 finished?) - IRP OSPF


Google authored RFC points towards OSPF

BigTable

Mapreduce BigTable Chubby Lock

GFS / GFS II
INTERIOR NETWORK IPv6

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

46

DEVELOPMENT ENVIRONMENT bits&bobs


Architecture

GOOGLE APP ENGINE

Python. Java. C++

GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

BigTable

Mapreduce BigTable Chubby Lock

A rare shot of some concrete google internal stuff (this of a GFS Master Server code execution found as a perftools profiling example)
Agile Methodologies Used (development iterations, teamwork, collaboration, and process adaptability throughout the life-cycle of the project) Libraries are the predominant way of building programs
INTERIOR NETWORK IPv6

GFS / GFS II

RHEL 2.6.X PAE

An infrastructure handles versioning of applications so they can be release without a fear of breaking things = roll out with minimal QA - Internal Code uses replacement libraries - Google as youd expect rewrites everything! - Hungarian Notation? - Work in small teams 3-5 people likely few scutters know the big picture

SERVER HARDWARE RACK DC Exterior Network

47

Internal Linux development and deployment Served as technical lead of team responsible for customizing and deploying Linux to internal systems and workstations. Fixed bugs and added enterprise features to several Linux components, including NFS, Kerberos, CUPS. All relevant patches were pushed to upstream maintainers, and most are in current released distributions. Developed and maintained systems to automate installation, updates, and upgrades of Linux systems. Developed IPv6 support for Linux load-balancing (ipvs).

Managed several interns and contractors.

loadbalancing user accounts within a datacenter, and coordinating with the global loadbalancer, which uses linear programming to optimally allocate users. In particular, this avoids "shared fate" risks and reduces latency and costs incurred due to excessive transatlantic data traffic. Learned Sketchup so as to document the four dimensional data structures effectively

The testing, evalulation, deployment, operations, and maintenance of Netscaler load balancers.

automated Apache configuration reloader gPXE open-source network booting software GWS custom C++ webserver = not apache?

Google 02/09 talk example was a Cluster is 30 racks (I believe this refers to Google). At a 40U rack 40Ux30racks = 1200 = approximately a MDC can assume each MDC is a Cluster/cell at architectural level

Google engineer stated a DC is a collection of Modular Units (MDCs?) the picture (not above) illustrated suggested this.

48

Some Pre Presentation Information

1 Million GB = 1000TB = 1 PB (x 1000 = 1 EXABYTE)


Internet Archive is around 3PB (2009) CLEAN UP BEFORE all the poorly sourced stuff Add lock service to bt to all slides Google rack server on rack page SSTable Google PROFITS US $16M A DAY

49

Pre Presentation Disclaimer


Put together in a week from knowing zero about Google I am not associated with Google

Numbers are approximate but certainly are ball-park Google often delivers contradictory figures and uses many terms for some items - cell/cluster scheduler/workqueue (obfuscation?)
Googles philosophy/paranoia of tell as little as possible (pausing presenters and sideways answers) makes it hard to fill in some (significant) gaps inferences are sometimes drawn (in red) Google seem to design absolutely EVERYTHING themselves from HW MB build, Racks, Switches(?), Software... So its hard to find sources of information beyond broad concepts

50

Bigtable VI
Architecture GOOGLE APPS SEARCH INDEX CRAWL GMAIL... Python, Java, C++, Sawzall, other GWQ

GOOGLE APP ENGINE

Latest (or at least since 2006..)


-Increased Scalability (across Namespace/Datacenters) - i.e. Tablets spread over DCs for a table but expensive (both computationally and financially!) -Service Clusters (?)
-

Python. Java. C++

BigTable

Mapreduce

BigTable
Chubby Lock

-Multiple Bigtable Clusters replicated throughout DC


GFS / GFS II
INTERIOR NETWORK IPv6

Current Status
- Many Hundreds may be thousands of Bigtable Cells - Late 2009 stated 500 Bigtable clusters - At minimum scaled to many thousands of machines per cell in production - Cells manage Managing 3-figure TB data (0.X PB)
51

RHEL 2.6.X PAE

SERVER HARDWARE RACK DC Exterior Network

Você também pode gostar