Chirag Modi Data Science Report
Chirag Modi Data Science Report
“Data Science”
Submitted to the University College Of Engineering Banswara
In Partial Fulfilment of the requirement for the degree of
Bachelor of Technology
Affiliated to
Govind Guru Tribal University, Banswara
Submitted To:
Submitted By :
Mrs. Kamna Agarwal Shakyavanshi
Mr. Chirag Modi
(Assistant Professor,
RollCSE
NO.Dept., GEC
: 500009
Banswara)
(December, 2024)
B. Tech. (2021-25)
ACKNOWLEDGEMENT
I would like to acknowledge the contributions of the following people without whose help and guidance
this report would not have been completed. I am also thankful to Mrs. Kamna Agarwal Shakyavanshi
University College of Engineering, Banswara, Rajasthan for his constant encouragement, valuable
suggestions and moral support and blessings. Although it is not possible to name individually, I shall
ever remain indebted to the faculty members of University College Of Engineering, Banswara,
Rajasthan for their persistent support and cooperation extended during this work. This
acknowledgement will remain incomplete if I fail to express our deep sense of obligation to my parents
and God for their consistent blessings and encouragement.
Contents
S. No. Contents Page no.
1 Chapter 1 (Introduction to Data Science) 1-7
2 Chapter 2 (Technologies implemented) 8-12
3 Chapter 3 (Implementation) 12-21
4 Chapter 4 (Conclusion and results) 22
5 Chapter 5 (References) 23
Contents of figures
S. No. Figure Page no.
1 Data Science. 6
2 Pre-processing the datasets. 13
3 Total match wins by teams 13
4 Total finals played and won 13
5 CSK vs MI head to head 14
6 Most Man of the Match winners 14
7 Highest run getters in the league 15
8 Excluding irrelevant wickets for bowlers 15
9 Excluding irrelevant wickets for bowlers 16
10 Top wicket taking bowlers in the league. 16
11 Top all-rounders in the league 17
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and
systems to extract knowledge and insights from noisy, structured and unstructured data, and apply
knowledge and actionable insights from data across a broad range of application domains. Data
science is related to data mining, machine learning and big data.
Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in
order to "understand and analyse actual phenomena" with data. [3] It uses techniques and theories
drawn from many fields within the context of mathematics, statistics, computer science, information
science, and domain knowledge. However, data science is different from computer science and
information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of
science (empirical, theoretical, computational, and now data-driven) and asserted that "everything
about science is changing because of the impact of information technology" and the data deluge.
The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative
name for Computer Science. In 1996, the International Federation of Classification Societies became
the first conference to specifically feature data science as a topic. However, the definition was still in
flux. After the 1985 lecture in the Chinese Academy of Sciences in Beijing, in 1997 C.F. Jeff
Wu again suggested that statistics should be renamed data science. He reasoned that a new name
would help statistics shed inaccurate stereotypes, such as being synonymous with accounting, or
limited to describing data. In 1998, Hayashi Chikio argued for data science as a new,
interdisciplinary concept, with three aspects: data design, collection, and analysis.
During the 1990s, popular terms for the process of finding patterns in datasets (which were
increasingly large) included "knowledge discovery" and "data mining".
Modern Usage
1
technical areas; because this would significantly change the field, it warranted a new name. "Data
science" became more widely used in the next few years: in 2002, the Committee on Data for
Science and Technology launched Data Science Journal. In 2003, Columbia University
launched The Journal of Data Science. In 2014, the American Statistical Association's Section on
Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and
Data Science, reflecting the ascendant popularity of data science.
The professional title of "data scientist" has been attributed to DJ Patil and Jeff Hammerbacher in
2008. Though it was used by the National Science Board in their 2005 report, "Long-Lived Digital
Data Collections: Enabling Research and Education in the 21st Century," it referred broadly to any
key role in managing a digital data collection.
There is still no consensus on the definition of data science and it is considered by some to be a
buzzword.
Impact
Big data is very quickly becoming a vital tool for businesses and companies of all sizes. The
availability and interpretation of big data has altered the business models of old industries and
enabled the creation of new ones. Data-driven businesses are worth $1.2 trillion collectively in 2020,
an increase from $333 billion in the year 2015. Data scientists are responsible for breaking down big
data into usable information and creating software and algorithms that help companies and
organizations determine optimal operations. As big data continues to have a major impact on the
world, data science does as well due to the close relationship between the two.
2
Dimensionality reduction is used to reduce the complexity of data computation so that it can
be performed more quickly.
Machine learning is a technique used to perform tasks by inferencing patterns from data
Naive Bayes classifiers are used to classify by applying the Bayes' theorem. They are mainly
used in datasets with large amounts of data, and can aptly generate accurate results.
1. Data Preparation
Data scientists spend nearly 80% of their time cleaning and preparing data to improve its quality –
i.e., make it accurate and consistent, before utilizing it for analysis. However, 57% of them consider
it as the worst part of their jobs, labelling it as time-consuming and highly mundane. They are
required to go through terabytes of data, across multiple formats, sources, functions, and platforms,
on a day-to-day basis, whilst keeping a log of their activities to prevent duplication.
One way to solve this challenge is by adopting emerging AI-enabled data science technologies like
Augmented Analytics and Auto feature engineering. Augmented Analytics automates manual data
cleansing and preparation tasks and enables data scientists to be more productive.
As organizations continue to utilize different types of apps and tools and generate different formats
of data, there will be more data sources that the data scientists need to access to produce meaningful
decisions. This process requires manual entry of data and time-consuming data searching, which
leads to errors and repetitions, and eventually, poor decisions.
Organizations need a centralized platform integrated with multiple data sources to instantly access
information from multiple sources. Data in this centralized platform can be aggregated and controlled
effectively and in real-time, improving its utilization and saving huge amounts of time and efforts of
the data scientists.
3. Data Security
As organizations transition into cloud data management, cyberattacks have become increasingly
common. This has caused two major problems –
3
b) As a response to repeated cyberattacks, regulatory standards have evolved which have
extended the data consent and utilization processes adding to the frustration of the data scientists.
Organizations should utilize advanced machine learning enabled security platforms and instill
additional security checks to safeguard their data. At the same time, they must ensure strict adherence
to the data protection norms to avoid time-consuming audits and expensive fines.
Before performing data analysis and building solutions, data scientists must first thoroughly
understand the business problem. Most data scientists follow a mechanical approach to do this and
get started with analysing data sets without clearly defining the business problem and objective.
Therefore, data scientists must follow a proper workflow before starting any analysis. The workflow
must be built after collaborating with the business stakeholders and consist of well-defined checklists
to improve understanding and problem identification.
It is imperative for the data scientists to communicate effectively with business executives who may
not understand the complexities and the technical jargon of their work. If the executive, stakeholder,
or the client cannot understand their models, then their solutions will, most likely, not be executed.
This is something that data scientists can practice. They can adopt concepts like “data storytelling” to
give a structured approach to their communication and a powerful narrative to their analysis and
visualizations.
Organizations usually have data scientists and data engineers working on the same projects. This
means there must be effective communication across them to ensure the best output. However, the
two usually have different priorities and workflows, which causes misunderstanding and stifles
knowledge sharing.
Management should take active steps to enhance collaboration between data scientists and data
engineers. It can foster open communication by setting up a common coding language and a real-
time collaboration tool. Moreover, appointing a Chief Data Officer to oversee both the departments
has also proven to have improved collaboration between the two teams.
4
7. Misconceptions About the Role
In big organizations, a data scientist is expected to be a jack of all trades – they are required to clean
data, retrieve data, build models, and conduct analysis. However, this is a big ask for any data
scientist. For a data science team to function effectively, tasks need to be distributed among
individuals pertaining to data visualization, data preparation, model building and so on.
It is critical for data scientists to have a clear understanding of their roles and responsibilities before
they start working with any organization.
The lack of understanding of data science among management teams leads to unrealistic expectations
on the data scientist, which affects their performance. Data scientists are expected to produce a silver
bullet and solve all the business problems. This is very counterproductive.
A) Well-defined metrics to measure the accuracy of analysis generated by the data scientists
B) Proper business KPIs to analyse the business impact generated by the analysis
5
Fig. 1: Data Science
Future Scope
Health care sector
There is a huge requirement of data scientists in the healthcare sector because they create a lot of data
on a daily basis. Tackling a massive amount of data is not possible by any unprofessional candidate.
Hospitals need to keep a record of patients’ medical history, bills, staff personal history, and much
other information. Data scientists are getting hired in the medical sector to enhance the quality and
safety of the data.
Transport Sector
The transport sector requires a data scientist to analyse the data collected through passenger counting
systems, asset management, location system, fare collecting, and ticketing.
E-commerce
The e-commerce industry is booming just because of data scientists who analyse the data and create
customized recommendation lists for providing great results to end-users.
6
2. Python Programming.
3. Data Science Concepts.
4. Analyse the data.
5. Plotting charts.
6. Visualising the models.
7. ML Library Scikit, NumPy, Matplotlib, Pandas.
Methodologies
There were several facilitation techniques used by the trainer which included question and answer,
brainstorming, group discussions, case study discussions and practical implementation of some of the
topics by trainees. The multitude of training methodologies was utilized in order to make sure all the
participants get the whole concepts and they practice what they learn, because only listening to the
trainers can be forgotten, but what the trainees do by themselves they will never forget. After the
post-tests were administered and the final course evaluation forms were filled in by the participants,
the trainer expressed his closing remarks and reiterated the importance of the training for the trainees
in their daily activities and their readiness for applying the learnt concepts in their assigned tasks.
Certificates of completion were distributed among the participants at the end.
7
Chapter 2
Technology Implemented
Python is a widely used general-purpose, high level programming language. It was initially designed
by Guido van Rossum in 1991 and developed by Python Software Foundation. It was mainly
developed for an emphasis on code readability, and its syntax allows programmers to express
concepts in fewer lines of code. Python is dynamically typed and garbage-collected. It supports
multiple programming paradigms, including procedural, object-oriented, and functional
programming. Python is often described as a "batteries included" language due to its comprehensive
standard library.
Features
• Interpreted
In Python there is no separate compilation and execution steps like C/C++. It directly run the
program from the source code. Internally, Python converts the source code into an
intermediate form called bytecodes which is then translated into native language of specific
computer to run it.
• Platform Independent
Python programs can be developed and executed on the multiple operating system platform.
Python can be used on Linux, Windows, Macintosh, Solaris and many more.
• Multi- Paradigm
Python is a multi-paradigm programming language. Object-oriented programming and
structured programming are fully supported, and many of its features support functional
programming and aspect-oriented programming.
• Simple
Python is a very simple language. It is a very easy to learn as it is closer to English language.
In python more emphasis is on the solution to the problem rather than the syntax.
8
• Rich Library Support
Python standard library is very vast. It can help to do various things involving regular
expressions, documentation generation, unit testing, threading, databases, web browsers, CGI,
email, XML, HTML, WAV files, cryptography, GUI and many more.
• Free and Open Source
Firstly, Python is freely available. Secondly, it is open-source. This means that its source
code is available to the public. We can download it, change it, use it, and distribute it. This is
called FLOSS (Free/Libre and Open-Source Software). As the Python community, we’re all
headed toward one goal- an ever-bettering Python.
Libraries used:
1. Pandas
Pandas is a Python library used for working with data sets. Pandas is used to analyse data. It has
functions for analysing, cleaning, exploring, and manipulating data. The name "Pandas" has a
reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney
10
in 2008. Pandas allows us to analyse big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant. Relevant data is very
important in data science.
Syntax:
import pandas as pd
Example:
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
2. Matplotlib
Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and
JavaScript for Platform compatibility.
Syntax:
Example:
3. NumPy
11
NumPy is a Python library. NumPy is used for working with arrays. NumPy is short for
"Numerical Python".
Syntax:
Import numpy as np
Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
4. Plotly
12
Chapter 3
Implementation
The next task, as in any data related task, was of pre-processing the datasets and perform data cleaning in
order to make the analysis easier. For instance, a few franchisees have changed the name of their teams.
They needed to be changed. There were also some errors with the names of stadiums, there were two
13
different names for the same stadium, etc. I also replaced to team names with their initials to make their
representation less complex.
14
On finding that the two most successful teams
in the leagues have been Chennai Super Kings
(CSK) and Mumbai Indians (MI), a head to
head analysis of the teams was also done to
identify who actually dominated the battle.
And MI turned out to be on top of CSK.
Player analysis:
The process of identifying the most successful turned out to be most complex and time consuming. There
were several grounds on which the most successful player could be identified. For example, most Man of the
Matches for any player, highest run scorers, highest wicket takers, best all-rounders, best death bowlers, best
death batsmen, most economical bowlers, etc.
Most Man of the Match winners:
15
Fig. 7: Highest run getters in the league
16
Fig. 9: Excluding irrelevant wickets for bowlers
17
Fig. 11: Top all-rounders in the league
Death over analysis of players:
Overs 16-20 are considered as death overs of the innings. Mostly, this is considered to be one of the most
challenging phases of the game for any team, batsman or bowler. Hence, we have performed a separate
death over analysis for batsmen and bowlers to find out which batsmen has scored most runs or runs with
highest strike rate and which bowler has got maximum wickets in death bowling.
Fig. 12: Highest death runs scorers in both innings of the game.
18
Fig. 13: Highest wicket takers of the league in death overs.
Toss decision:
20
Fig. 17: Toss winning decision Fig. 18: Win % batting second
It is quite visible from the pie chart that majority of the toss winning teams have decided to field first and
chase the score on board. The reason for the above decisions can be concluded from the adjoining pie
chart.
21
Fig. 20: Team wise toss decision analysis.
22
Chapter 4
Conclusion
The given datasets included needed to be pre-processed due to several reasons. Then we were able to
analyse the data properly and make proper conclusions. Top teams were identified and their head-to-head
clashes were also counted. Top run scorers, highest wicket takers, best all-rounders, best death batsmen
and best death bowlers were also listed. With the help of another dataset, we identified what factors led to
winning matches, i.e., venue and toss decision.
Learning Experience
With the completion of this project, we can take a bag full of innumerous learnings. This project motivated
me to explore and learn more new things, and to push myself beyond my own limits and expectations. It also
helped me in improving my presentation and communication skills.
23
Chapter 5
References
Python - https://developers.google.com/edu/python/
Pandas- https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-
osiE80TeTsWmV9i9c58mdDCSskIFdDS
Data visualization - https://matplotlib.org/gallery/index.html
Matplotlib - https://www.youtube.com/watch?
v=UO98lJQ3QGI&list=PLosiE80TeTvipOqomVEeZ1HRcEvtZB_
24