0% found this document useful (0 votes)
128 views

Google Distributed System

The document provides an overview of Google's search engine architecture and design philosophy. It discusses how Google uses large numbers of commodity servers to provide scalable, reliable, and high-performance services like search and Google Apps. It describes key components like the Google File System for data storage, the Chubby coordination service, and BigTable for structured data storage. The architecture is designed to scale horizontally by adding more servers while maintaining reliability even with frequent hardware failures.

Uploaded by

sebghat aslamzai
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Google Distributed System

The document provides an overview of Google's search engine architecture and design philosophy. It discusses how Google uses large numbers of commodity servers to provide scalable, reliable, and high-performance services like search and Google Apps. It describes key components like the Google File System for data storage, the Chubby coordination service, and BigTable for structured data storage. The architecture is designed to scale horizontally by adding more servers while maintaining reliability even with frequent hardware failures.

Uploaded by

sebghat aslamzai
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

‫بسم هللا الرمحن الرحمی‬

‫‪Safia Zadran‬‬
Google
What we will cover?

 Introducing the case study: Google


 Overall architecture and design philosophy
 Underlying communication paradigms
 Data storage and coordination services
 Distributed computation
Introduction to the Case Study (Google)

 The ability to create an effective design is an important skill in distributed systems


 We illustrate distributed design through a substantial case study, examining in detail the
design of the Google infrastructure a platform and associated middleware that supports
both Google search and a set of associated web services and applications including Google
Apps.
 Google is one of the largest distributed systems in use today, and the Google infrastructure
has successfully dealt with a variety of demanding requirements.
 Google [www.google.com III] is a US-based corporation with its headquarters in Mountain
View, California (the Googleplex), offering Internet search
 Google was born out of a research project at Stanford University, with the company
launched in 1998.
 providing a search engine is now a major player in cloud computing.
 with the growth of the company from its initial production system in 1998 to dealing with
over 88 billion queries a month by the end of 2010, that the main search engine has never
experienced an outage in all that time and that users can expect query results in around 0.2
seconds
The Google search engine

 Google Search engine is complex but the general process is simple.


 to take a given query and return an ordered list of the most relevant results that match that
query by searching the content of the Web.
 The underlying search engine consists of a set of services for
*crawling the Web
*indexing
* ranking
Crawling

 The task of the crawler is to locate and retrieve the contents of the Web and pass the
contents onto the indexing subsystem
 At the root of every search engines are Software known as Crawler
 Crawler also known as bots, robots, or spider.
What Crawlers Do?
What Crawlers Do?
 After the crawler copied the websites this data must be stored on search engine server.
 You can accesss to this copy from the cache.
 Google search engine is not working on your live site,
But on a copy of the site in its server.
Indexing

 More complex phase.


 Occurs in many sub_phases. Sth that happens in many datacenters of the world.
 Search engine algorithms extract signals to find the best information.
 This process is hidden to the public.
Indexing:

 This index will allow us to discover web pages that include the search terms ‘distributed’,
‘systems’ and ‘book’ and, by careful analysis, we will be able to discover pages that
include all of these terms. Forexample, the search engine will be able to identify that the
three terms can all be found. in amazon.com, www.cdk5.net and indeed many other web
sites. Using the index, it is therefore possible to narrow down the set of candidate web
pages from billions to perhaps tens of thousands, depending on the level of discrimination
in the keywords chosen.
Ranking

 The search engine ranks(order) all possible results relevant to the search query.
 Ranking is based on some factors
 Such as:
 Past Researches
 Location
Ranking:

 whereby a higher rank is an indication of the importance of a page and it is used to ensure
that important pages are returned nearer to the top of the list of results than lower-ranked
pages
 in PageRank, a page will be viewed as important if it is linked to by a large number of
other pages
 For example, a link from bbc.co.uk will be viewed as more important than a link from
Gordon Blair’s personal web page
Overall architecture and design philosophy

Physical Model
 The key philosophy of Google in terms of physical infrastructure is to use very large
numbers of commodity PCs to produce a cost-effective environment for distributed storage
and computation.
 PC will typically have around 2 terabytes of disk storage and around 16 gigabytes of
DRAM
 The philosophy of building system from commodity pcs come from the original research
project (Sergey Brin & Larry Page at Stanford university)
 Google has recognized that parts of its infrastructure will fail
 has designed the infrastructure using a range of strategies to tolerate such failures.
 By far the most common source of failure is due to software, with about 20 machines
needing to be rebooted per day due to software failures. (Interestingly, the rebooting
process is entirely manual.)
 Hardware failures represent about 1/10 of the failures due to software with around 2–3% of
PCs failing per annum(year) due to hardware faults. Of these, 95% are due to faults in
disks or DRAM.
 This indicates the decision to procure commodity PCs; given that the vast majority of
failures are due to software, it is not worthwhile to invest in more expensive, more reliable
hardware.
 The physical architecture is constructed as follows [Hennessy and Patterson2006]:
 between 40 and 80 PCs in a given rack, double-sided, has an Ethernet switch
 Switch inside the rack is modular , supporting either 8 100-Mbps network interfaces or a
single 1-Gbps interface.
 Racks are organized into clusters
 A cluster typically consists of 30 or more racks and two high-bandwidth switches
providing connectivity to the outside world( internet & other google centers)
 Clusters are housed in Google data centres that are spread around the world.
 2000, Google relied on key data centres in Silicon Valley (two centres) and in Virginia.
 now centres in many geographical locations across the US and in Dublin (Ireland), Saint-
Ghislain (Belgium), Zurich (Switzerland), Tokyo (Japan) and Beijing (China).
 to build fault-tolerant, large-scale systems
 If each PC offers 2 terabytes of storage, then a rack of 80 PCs will provide 160 terabytes,
with a cluster of 30 racks offering 4.8 petabytes.
To avoid clutter the Ethernet connections are shown from only of the clusters to the external links.
Overall system architecture

 Key requirements:
 Scalability: first , most important

More queries
Better Results

More Data

Scalability problem.
Reliability
Google Apps (Gmail, Google Calender, Google map)

Performance:
To achieve low latency of user interactions.
Better performance = better user return with more queries
 completing web search operations in 0.2 seconds
 is an end-to-end property requiring all associated underlying resources to work together,
including network, storage and computational resources.
 Openness: It is well known that Google as an organization encourages and nurtures
innovation, and this is most evident in the development of new web applications. This is
only possible with an infrastructure that is extensible and provides support for the
development of new applications

 Google has responded to these needs by developing the overall system


architecture
Data storage and coordination services

 complementary services in the Google infrastructure:


 Google File System
 Chubby
 BigTable
The Google File System (GFS)

 Google file system is designed to solve the problem of bigData.


 GFS is a distributed file system
 What is DFS?
 is any file system that allow access to file from multiple hosts sharing via a computer network.
 May include facilities for replication and fault tolerant.
 A DFS manages files and folders across multiple computers.
 What is google file system?
 Google file system is a scalabel Distributed File System created by Google and developed to
accommodate Google’s expanding data processing requirements.
 GFS is formed from many storage systems designed from low-cost commodity hardware
elements.
Google file system Architecture

 Cluster:
 Google organized the GFS clusters of computers. A cluster is simply a network of computers.
 Within GFS clusters there are 3 kinds of entities
 Clients
 Master servers
 Chunk servers
 Client:
 Any entity that that makes a file request. Clients can be other computers or computer
applications. Think of client as the customer of the GFS.
 Master:
 Master server acts as the coordinator for the culsters.
 The master’s duties include maintaining an operation log, which keep track of the activities
of the masters cluster. Masters maintains historical record of critical metadata changes,
namespace and mapping
 The operation log helps keep service interruptions to a minimum—
if the master server crashes, a replacement server can take its place.
 Chunkserver:
 Chunkservers are the horsepower of GFS. They are responsible for storing the 64 mb file
chunks.
 The chunkserver don’t send chunks to the master server. Instead they send requested
chunks directly to the client.
 The GFS copies every chunk multiple times and store it on different chunkservers. Each
copy is called a replica.
 By default the GFS makes 3 replica per chunks but user can change the setting and make
more or fewer replicas if desired.
Chubby

 is a crucial service at the heart of the Google infrastructure


 Chubby is a self described lock service
 offering storage and coordination services for other infrastructure services, including GFS
and Bigtable.
 It provides coarse-grained distributed locks to synchronize distributed activities
 In the role of a lock-management tool, the main operations provided are:
 Acquire,
 TryAcquire
 Release
BigTable

 Nosql database developed by google.


 Very large dataset.
 Highly distributed
 Row/column/timestamp indexing
 No joins.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy