Google Distributed System
Google Distributed System
Safia Zadran
Google
What we will cover?
The task of the crawler is to locate and retrieve the contents of the Web and pass the
contents onto the indexing subsystem
At the root of every search engines are Software known as Crawler
Crawler also known as bots, robots, or spider.
What Crawlers Do?
What Crawlers Do?
After the crawler copied the websites this data must be stored on search engine server.
You can accesss to this copy from the cache.
Google search engine is not working on your live site,
But on a copy of the site in its server.
Indexing
This index will allow us to discover web pages that include the search terms ‘distributed’,
‘systems’ and ‘book’ and, by careful analysis, we will be able to discover pages that
include all of these terms. Forexample, the search engine will be able to identify that the
three terms can all be found. in amazon.com, www.cdk5.net and indeed many other web
sites. Using the index, it is therefore possible to narrow down the set of candidate web
pages from billions to perhaps tens of thousands, depending on the level of discrimination
in the keywords chosen.
Ranking
The search engine ranks(order) all possible results relevant to the search query.
Ranking is based on some factors
Such as:
Past Researches
Location
Ranking:
whereby a higher rank is an indication of the importance of a page and it is used to ensure
that important pages are returned nearer to the top of the list of results than lower-ranked
pages
in PageRank, a page will be viewed as important if it is linked to by a large number of
other pages
For example, a link from bbc.co.uk will be viewed as more important than a link from
Gordon Blair’s personal web page
Overall architecture and design philosophy
Physical Model
The key philosophy of Google in terms of physical infrastructure is to use very large
numbers of commodity PCs to produce a cost-effective environment for distributed storage
and computation.
PC will typically have around 2 terabytes of disk storage and around 16 gigabytes of
DRAM
The philosophy of building system from commodity pcs come from the original research
project (Sergey Brin & Larry Page at Stanford university)
Google has recognized that parts of its infrastructure will fail
has designed the infrastructure using a range of strategies to tolerate such failures.
By far the most common source of failure is due to software, with about 20 machines
needing to be rebooted per day due to software failures. (Interestingly, the rebooting
process is entirely manual.)
Hardware failures represent about 1/10 of the failures due to software with around 2–3% of
PCs failing per annum(year) due to hardware faults. Of these, 95% are due to faults in
disks or DRAM.
This indicates the decision to procure commodity PCs; given that the vast majority of
failures are due to software, it is not worthwhile to invest in more expensive, more reliable
hardware.
The physical architecture is constructed as follows [Hennessy and Patterson2006]:
between 40 and 80 PCs in a given rack, double-sided, has an Ethernet switch
Switch inside the rack is modular , supporting either 8 100-Mbps network interfaces or a
single 1-Gbps interface.
Racks are organized into clusters
A cluster typically consists of 30 or more racks and two high-bandwidth switches
providing connectivity to the outside world( internet & other google centers)
Clusters are housed in Google data centres that are spread around the world.
2000, Google relied on key data centres in Silicon Valley (two centres) and in Virginia.
now centres in many geographical locations across the US and in Dublin (Ireland), Saint-
Ghislain (Belgium), Zurich (Switzerland), Tokyo (Japan) and Beijing (China).
to build fault-tolerant, large-scale systems
If each PC offers 2 terabytes of storage, then a rack of 80 PCs will provide 160 terabytes,
with a cluster of 30 racks offering 4.8 petabytes.
To avoid clutter the Ethernet connections are shown from only of the clusters to the external links.
Overall system architecture
Key requirements:
Scalability: first , most important
More queries
Better Results
More Data
Scalability problem.
Reliability
Google Apps (Gmail, Google Calender, Google map)
Performance:
To achieve low latency of user interactions.
Better performance = better user return with more queries
completing web search operations in 0.2 seconds
is an end-to-end property requiring all associated underlying resources to work together,
including network, storage and computational resources.
Openness: It is well known that Google as an organization encourages and nurtures
innovation, and this is most evident in the development of new web applications. This is
only possible with an infrastructure that is extensible and provides support for the
development of new applications
Cluster:
Google organized the GFS clusters of computers. A cluster is simply a network of computers.
Within GFS clusters there are 3 kinds of entities
Clients
Master servers
Chunk servers
Client:
Any entity that that makes a file request. Clients can be other computers or computer
applications. Think of client as the customer of the GFS.
Master:
Master server acts as the coordinator for the culsters.
The master’s duties include maintaining an operation log, which keep track of the activities
of the masters cluster. Masters maintains historical record of critical metadata changes,
namespace and mapping
The operation log helps keep service interruptions to a minimum—
if the master server crashes, a replacement server can take its place.
Chunkserver:
Chunkservers are the horsepower of GFS. They are responsible for storing the 64 mb file
chunks.
The chunkserver don’t send chunks to the master server. Instead they send requested
chunks directly to the client.
The GFS copies every chunk multiple times and store it on different chunkservers. Each
copy is called a replica.
By default the GFS makes 3 replica per chunks but user can change the setting and make
more or fewer replicas if desired.
Chubby