0% found this document useful (0 votes)

128 views

Google Distributed System

The document provides an overview of Google's search engine architecture and design philosophy. It discusses how Google uses large numbers of commodity servers to provide scalable, reliable, and high-performance services like search and Google Apps. It describes key components like the Google File System for data storage, the Chubby coordination service, and BigTable for structured data storage. The architecture is designed to scale horizontally by adding more servers while maintaining reliability even with frequent hardware failures.

Uploaded by

sebghat aslamzai

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views

Google Distributed System

Uploaded by

sebghat aslamzai

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

‫بسم هللا الرمحن الرحمی‬

‫‪Safia Zadran‬‬
Google
What we will cover?

 Introducing the case study: Google

 Overall architecture and design philosophy
 Underlying communication paradigms
 Data storage and coordination services
 Distributed computation
Introduction to the Case Study (Google)

 The ability to create an effective design is an important skill in distributed systems

 We illustrate distributed design through a substantial case study, examining in detail the
design of the Google infrastructure a platform and associated middleware that supports
both Google search and a set of associated web services and applications including Google
Apps.
 Google is one of the largest distributed systems in use today, and the Google infrastructure
has successfully dealt with a variety of demanding requirements.
 Google [www.google.com III] is a US-based corporation with its headquarters in Mountain
View, California (the Googleplex), offering Internet search
 Google was born out of a research project at Stanford University, with the company
launched in 1998.
 providing a search engine is now a major player in cloud computing.
 with the growth of the company from its initial production system in 1998 to dealing with
over 88 billion queries a month by the end of 2010, that the main search engine has never
experienced an outage in all that time and that users can expect query results in around 0.2
seconds
The Google search engine

 Google Search engine is complex but the general process is simple.

 to take a given query and return an ordered list of the most relevant results that match that
query by searching the content of the Web.
 The underlying search engine consists of a set of services for
*crawling the Web
*indexing
* ranking
Crawling

 The task of the crawler is to locate and retrieve the contents of the Web and pass the
contents onto the indexing subsystem
 At the root of every search engines are Software known as Crawler
 Crawler also known as bots, robots, or spider.
What Crawlers Do?
What Crawlers Do?
 After the crawler copied the websites this data must be stored on search engine server.
 You can accesss to this copy from the cache.
 Google search engine is not working on your live site,
But on a copy of the site in its server.
Indexing

 More complex phase.

 Occurs in many sub_phases. Sth that happens in many datacenters of the world.
 Search engine algorithms extract signals to find the best information.
 This process is hidden to the public.
Indexing:

 This index will allow us to discover web pages that include the search terms ‘distributed’,
‘systems’ and ‘book’ and, by careful analysis, we will be able to discover pages that
include all of these terms. Forexample, the search engine will be able to identify that the
three terms can all be found. in amazon.com, www.cdk5.net and indeed many other web
sites. Using the index, it is therefore possible to narrow down the set of candidate web
pages from billions to perhaps tens of thousands, depending on the level of discrimination
in the keywords chosen.
Ranking

 The search engine ranks(order) all possible results relevant to the search query.
 Ranking is based on some factors
 Such as:
 Past Researches
 Location
Ranking:

 whereby a higher rank is an indication of the importance of a page and it is used to ensure
that important pages are returned nearer to the top of the list of results than lower-ranked
pages
 in PageRank, a page will be viewed as important if it is linked to by a large number of
other pages
 For example, a link from bbc.co.uk will be viewed as more important than a link from
Gordon Blair’s personal web page
Overall architecture and design philosophy

Physical Model
 The key philosophy of Google in terms of physical infrastructure is to use very large
numbers of commodity PCs to produce a cost-effective environment for distributed storage
and computation.
 PC will typically have around 2 terabytes of disk storage and around 16 gigabytes of
DRAM
 The philosophy of building system from commodity pcs come from the original research
project (Sergey Brin & Larry Page at Stanford university)
 Google has recognized that parts of its infrastructure will fail
 has designed the infrastructure using a range of strategies to tolerate such failures.
 By far the most common source of failure is due to software, with about 20 machines
needing to be rebooted per day due to software failures. (Interestingly, the rebooting
process is entirely manual.)
 Hardware failures represent about 1/10 of the failures due to software with around 2–3% of
PCs failing per annum(year) due to hardware faults. Of these, 95% are due to faults in
disks or DRAM.
 This indicates the decision to procure commodity PCs; given that the vast majority of
failures are due to software, it is not worthwhile to invest in more expensive, more reliable
hardware.
 The physical architecture is constructed as follows [Hennessy and Patterson2006]:
 between 40 and 80 PCs in a given rack, double-sided, has an Ethernet switch
 Switch inside the rack is modular , supporting either 8 100-Mbps network interfaces or a
single 1-Gbps interface.
 Racks are organized into clusters
 A cluster typically consists of 30 or more racks and two high-bandwidth switches
providing connectivity to the outside world( internet & other google centers)
 Clusters are housed in Google data centres that are spread around the world.
 2000, Google relied on key data centres in Silicon Valley (two centres) and in Virginia.
 now centres in many geographical locations across the US and in Dublin (Ireland), Saint-
Ghislain (Belgium), Zurich (Switzerland), Tokyo (Japan) and Beijing (China).
 to build fault-tolerant, large-scale systems
 If each PC offers 2 terabytes of storage, then a rack of 80 PCs will provide 160 terabytes,
with a cluster of 30 racks offering 4.8 petabytes.
To avoid clutter the Ethernet connections are shown from only of the clusters to the external links.
Overall system architecture

 Key requirements:
 Scalability: first , most important

More queries
Better Results

More Data

Scalability problem.
Reliability
Google Apps (Gmail, Google Calender, Google map)

Performance:
To achieve low latency of user interactions.
Better performance = better user return with more queries
 completing web search operations in 0.2 seconds
 is an end-to-end property requiring all associated underlying resources to work together,
including network, storage and computational resources.
 Openness: It is well known that Google as an organization encourages and nurtures
innovation, and this is most evident in the development of new web applications. This is
only possible with an infrastructure that is extensible and provides support for the
development of new applications

 Google has responded to these needs by developing the overall system

architecture
Data storage and coordination services

 complementary services in the Google infrastructure:

 Google File System
 Chubby
 BigTable
The Google File System (GFS)

 Google file system is designed to solve the problem of bigData.

 GFS is a distributed file system
 What is DFS?
 is any file system that allow access to file from multiple hosts sharing via a computer network.
 May include facilities for replication and fault tolerant.
 A DFS manages files and folders across multiple computers.
 What is google file system?
 Google file system is a scalabel Distributed File System created by Google and developed to
accommodate Google’s expanding data processing requirements.
 GFS is formed from many storage systems designed from low-cost commodity hardware
elements.
Google file system Architecture

 Cluster:
 Google organized the GFS clusters of computers. A cluster is simply a network of computers.
 Within GFS clusters there are 3 kinds of entities
 Clients
 Master servers
 Chunk servers
 Client:
 Any entity that that makes a file request. Clients can be other computers or computer
applications. Think of client as the customer of the GFS.
 Master:
 Master server acts as the coordinator for the culsters.
 The master’s duties include maintaining an operation log, which keep track of the activities
of the masters cluster. Masters maintains historical record of critical metadata changes,
namespace and mapping
 The operation log helps keep service interruptions to a minimum—
if the master server crashes, a replacement server can take its place.
 Chunkserver:
 Chunkservers are the horsepower of GFS. They are responsible for storing the 64 mb file
chunks.
 The chunkserver don’t send chunks to the master server. Instead they send requested
chunks directly to the client.
 The GFS copies every chunk multiple times and store it on different chunkservers. Each
copy is called a replica.
 By default the GFS makes 3 replica per chunks but user can change the setting and make
more or fewer replicas if desired.
Chubby

 is a crucial service at the heart of the Google infrastructure

 Chubby is a self described lock service
 offering storage and coordination services for other infrastructure services, including GFS
and Bigtable.
 It provides coarse-grained distributed locks to synchronize distributed activities
 In the role of a lock-management tool, the main operations provided are:
 Acquire,
 TryAcquire
 Release
BigTable

 Nosql database developed by google.

 Very large dataset.
 Highly distributed
 Row/column/timestamp indexing
 No joins.

project-callisto-search-quality-review-module-1-version-2.0
No ratings yet
project-callisto-search-quality-review-module-1-version-2.0
34 pages
Programming Backend with Go
From Everand
Programming Backend with Go
Julian Braun
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Building Scalable Web Sites
No ratings yet
Building Scalable Web Sites
21 pages
The Complete Google Guide Book
100% (1)
The Complete Google Guide Book
146 pages
Lecture 11 Google Architecture Design
No ratings yet
Lecture 11 Google Architecture Design
44 pages
Group E
No ratings yet
Group E
29 pages
Google Case Study
No ratings yet
Google Case Study
23 pages
TLW Assignment 3 27-Sep-2024 10-32-28
No ratings yet
TLW Assignment 3 27-Sep-2024 10-32-28
28 pages
Sub: Dbms.. Topic: Architectue of Google..: Googie Architecture Is A Form of Modern Architecture
No ratings yet
Sub: Dbms.. Topic: Architectue of Google..: Googie Architecture Is A Form of Modern Architecture
7 pages
CC
No ratings yet
CC
17 pages
Unit4
No ratings yet
Unit4
41 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Big Data NoSLQ Kopyası
No ratings yet
Big Data NoSLQ Kopyası
51 pages
Unit - 4-Cloud
No ratings yet
Unit - 4-Cloud
122 pages
Google Talk: Ed Austin 12-09-09
No ratings yet
Google Talk: Ed Austin 12-09-09
51 pages
5.3.1 Google App Engine
No ratings yet
5.3.1 Google App Engine
5 pages
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Case Study-Google
100% (1)
Case Study-Google
16 pages
1. GAE
No ratings yet
1. GAE
13 pages
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
From Everand
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
David Hecksel
5/5 (2)
UNIT-IV notes.docx
No ratings yet
UNIT-IV notes.docx
15 pages
2. Programming Environment for GAE
No ratings yet
2. Programming Environment for GAE
35 pages
storage-systems
No ratings yet
storage-systems
23 pages
Saritha Gfs Report
No ratings yet
Saritha Gfs Report
28 pages
Refer Slide Time: 00:15
No ratings yet
Refer Slide Time: 00:15
31 pages
Google Platform
No ratings yet
Google Platform
6 pages
Chapter Three: Google Technology
No ratings yet
Chapter Three: Google Technology
25 pages
Bba Unit-1
No ratings yet
Bba Unit-1
11 pages
Google Summary GRP 8
No ratings yet
Google Summary GRP 8
13 pages
Group Policy: Fundamentals, Security, and the Managed Desktop
From Everand
Group Policy: Fundamentals, Security, and the Managed Desktop
Jeremy Moskowitz
5/5 (1)
Google: Designs, Lessons and Advice From Building Large Distributed Systems
100% (3)
Google: Designs, Lessons and Advice From Building Large Distributed Systems
73 pages
Unit 5
No ratings yet
Unit 5
19 pages
Web Application Architecture. What's Web Application Architecture - by Viplove Prakash - Geek Culture - Sep, 2021 - Medium
No ratings yet
Web Application Architecture. What's Web Application Architecture - by Viplove Prakash - Geek Culture - Sep, 2021 - Medium
7 pages
(Omran) Introduction To Google Cloud Platform
No ratings yet
(Omran) Introduction To Google Cloud Platform
45 pages
Chubby System and Google API
No ratings yet
Chubby System and Google API
13 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
CS571-Note
No ratings yet
CS571-Note
2 pages
Google SearchEngine
No ratings yet
Google SearchEngine
13 pages
GCP Fund Module 9 Summary and Review
No ratings yet
GCP Fund Module 9 Summary and Review
13 pages
Standard Web Search Engine Architecture: User Query
No ratings yet
Standard Web Search Engine Architecture: User Query
101 pages
Google Cloud Fundamentals: Core Infrastructure: Summary and Next Steps
No ratings yet
Google Cloud Fundamentals: Core Infrastructure: Summary and Next Steps
15 pages
Git and GitHub
From Everand
Git and GitHub
Alisa Turing
No ratings yet
Professional Heroku Programming
From Everand
Professional Heroku Programming
Chris Kemp
4/5 (2)
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Mastering BigQuery: Scalable Analytics on Google Cloud
From Everand
Mastering BigQuery: Scalable Analytics on Google Cloud
Robert Johnson
No ratings yet
Ccomputing Madurya
No ratings yet
Ccomputing Madurya
20 pages
Simple Golang Programming for Beginners
From Everand
Simple Golang Programming for Beginners
Terry T. Diaz
No ratings yet
Storage Architecture and Challenges: Faculty Summit, July 29, 2010 Andrew Fikes, Principal Engineer
No ratings yet
Storage Architecture and Challenges: Faculty Summit, July 29, 2010 Andrew Fikes, Principal Engineer
25 pages
UNIT5
No ratings yet
UNIT5
34 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Data Center Fundamentals: The Datacenter As A Computer: George Porter CSE 124 February 3, 2015
No ratings yet
Data Center Fundamentals: The Datacenter As A Computer: George Porter CSE 124 February 3, 2015
46 pages
ASP.NET Core 1.0 High Performance
From Everand
ASP.NET Core 1.0 High Performance
James Singleton
No ratings yet
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
No ratings yet
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
42 pages
Cloud: Get All The Support And Guidance You Need To Be A Success At Using The CLOUD
From Everand
Cloud: Get All The Support And Guidance You Need To Be A Success At Using The CLOUD
John Hawkins
No ratings yet
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Programming Backend with Go: Build robust and scalable backends for your applications using the efficient and powerful tools of the Go ecosystem
From Everand
Programming Backend with Go: Build robust and scalable backends for your applications using the efficient and powerful tools of the Go ecosystem
Julian Braun
No ratings yet
An Overview of Google File System (GFS) _ Medium
No ratings yet
An Overview of Google File System (GFS) _ Medium
10 pages
Case Study Google
100% (1)
Case Study Google
6 pages
Distributed File System
No ratings yet
Distributed File System
21 pages
Transactions and Concurrency Control: Lecturer: Assistant Pro. Aslamzai Prepared By: Nazifa Kazimi Attendance #: 41
No ratings yet
Transactions and Concurrency Control: Lecturer: Assistant Pro. Aslamzai Prepared By: Nazifa Kazimi Attendance #: 41
30 pages
RMI Chapter 5 Nematulllah and Fariba
No ratings yet
RMI Chapter 5 Nematulllah and Fariba
26 pages
Samea Yusofi-Name Service
No ratings yet
Samea Yusofi-Name Service
44 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
Imd311 - Group Assignment
No ratings yet
Imd311 - Group Assignment
23 pages
Assess The Internet
No ratings yet
Assess The Internet
50 pages
Questionnaire
No ratings yet
Questionnaire
4 pages
Roles of The Internet and The WWW in Education
100% (1)
Roles of The Internet and The WWW in Education
9 pages
Periodical Search Journal
No ratings yet
Periodical Search Journal
4 pages
Digital Marketing
No ratings yet
Digital Marketing
23 pages
Methodology To Get Organic Traffic To Website
No ratings yet
Methodology To Get Organic Traffic To Website
5 pages
Unit 1A_Literature Discovery
No ratings yet
Unit 1A_Literature Discovery
29 pages
(Minor Project)
No ratings yet
(Minor Project)
46 pages
Kiani Neda
No ratings yet
Kiani Neda
61 pages
Digital Marketing: February 2022
No ratings yet
Digital Marketing: February 2022
9 pages
Research Impact: A Guide To Creating, Capturing, and Evaluating The Impact of Your Research
No ratings yet
Research Impact: A Guide To Creating, Capturing, and Evaluating The Impact of Your Research
18 pages
Off Page SEO
No ratings yet
Off Page SEO
21 pages
SADANAND BBA6th Sem Project Report
No ratings yet
SADANAND BBA6th Sem Project Report
70 pages
Edpb Guidelines 201905 Rtbfsearchengines Afterpublicconsultation en
No ratings yet
Edpb Guidelines 201905 Rtbfsearchengines Afterpublicconsultation en
16 pages
2. Introduction to SEO 2 (Search Engine Optimization) (12th IT)_fddc1f70-8255-4013-Aeee-21fbb735cfa5
100% (1)
2. Introduction to SEO 2 (Search Engine Optimization) (12th IT)_fddc1f70-8255-4013-Aeee-21fbb735cfa5
29 pages
Lecture 10 - Search Engine Optimization
No ratings yet
Lecture 10 - Search Engine Optimization
30 pages
Unit-1 WAD
No ratings yet
Unit-1 WAD
13 pages
Operate An Online Information System
No ratings yet
Operate An Online Information System
33 pages
Semantic Crawling: An Approach Based On Named Entity Recognition
No ratings yet
Semantic Crawling: An Approach Based On Named Entity Recognition
5 pages
Search Engine Optimization - Major Project
No ratings yet
Search Engine Optimization - Major Project
3 pages
Universiti Teknologi Mara Faculty of Information Management
No ratings yet
Universiti Teknologi Mara Faculty of Information Management
16 pages
AeroScout MobileView User Guide
No ratings yet
AeroScout MobileView User Guide
64 pages
Shubham FCL Practical 1-3 PDF
No ratings yet
Shubham FCL Practical 1-3 PDF
29 pages
Master SEO in 2024 - The Ultimate Beginner's Guide - Site Invention
No ratings yet
Master SEO in 2024 - The Ultimate Beginner's Guide - Site Invention
20 pages
English: Quarter 2 - Module 2 Electronic Search Engine
No ratings yet
English: Quarter 2 - Module 2 Electronic Search Engine
26 pages
Chapter-7
No ratings yet
Chapter-7
35 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Google Distributed System

Uploaded by

Google Distributed System

Uploaded by

‫بسم هللا الرمحن الرحمی‬

 Introducing the case study: Google

 The ability to create an effective design is an important skill in distributed systems

 Google Search engine is complex but the general process is simple.

 More complex phase.

 Google has responded to these needs by developing the overall system

 complementary services in the Google infrastructure:

 Google file system is designed to solve the problem of bigData.

 is a crucial service at the heart of the Google infrastructure

 Nosql database developed by google.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.