Grade 10 Unit 4 - Data Science
Grade 10 Unit 4 - Data Science
• Data science is the study of data to extract meaningful insights for business.
• Data Sciences majorly work around analysing the data and when it comes to AI, the analysis helps in
making the machine intelligent enough to perform tasks by itself.
• Data science is a concept to combine statistics, data analysis, machine learning and their related
methods in order to understand and analyse actual phenomena with data.
• Data Science employs techniques and theories drawn from many fields within the context of
Mathematics, Statistics, Computer Science, and Information Science to analyze large amounts of data.
Applications of Data Science
Data Science is a branch of computer science where we study how to store, use and analyze data for deriving information from it.
There exist various applications of Data Science in today’s world. Some of them are:
1.Fraud and Risk Detection:
Fraud and risk detection is crucial for protecting businesses, customers, and individuals from
financial losses and other negative impacts.
By using data analysis and intelligent algorithms, organizations can identify and respond to
fraudulent activities and potential risks more effectively, enhancing security and trust.
Eg: when a customer approaches for a bank loan, Data science analyse the customer’s data like
customer profiling, their past debts and if they have settled debt properly or they failed to do so.
The earliest applications of data science were in Finance. Companies were fed up of bad debts and
losses every year. However, they had a lot of data which use to get collected during the initial
paperwork while sanctioning loans. They decided to bring in data scientists in order to rescue
them from losses.
Sources of Data
There exist various sources of data from where we can collect any type of data required and the data
collection process can be categorised in two ways:
Offline Data Collection - Sensors, Surveys , Interviews, Observations
Online Data Collection - Open-sourced Government Portals, Reliable Websites (Kaggle), World
Organisations’ open-sourced statistical websites
While accessing data from any of the data sources, following points should be kept in mind:
1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
3. One should never breach someone’s privacy to collect data.
4. Data should only be taken form reliable sources as the data collected from random sources can be
wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper training of the AI
model.
Types of Data
For Data Science, usually the data is collected in the form of tables. These tabular datasets can be stored in
different formats. Some of the commonly used formats are:
1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each
line of this file is a data record and reach record consists of one or more fields which are separated by
commas. Since the values of records are separated by a comma, hence they are known as CSV files.
2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting and
recording data using rows and columns into which information can be entered. Microsoft excel is a
program which helps in creating spreadsheets.
3. SQL: SQL is a programming language also known as Structured Query Language. It is a domain-specific
language used in programming and is designed for managing data held in different kinds of DBMS
(Database Management System) It is particularly useful in handling structured data.
Data Access
After collecting the data we should be able to use it for programming purposes, we should know how to
access the same in a Python code. To make our lives easier, there exist various Python packages which help
us in accessing structured data (in tabular form) inside the code. Some of the Python packages are:
1. NumPy 2. Matplotlib 3. Pandas 4.Statistics
1.NumPy
• NumPy stands for Numerical Python, is the fundamental package for Mathematical and logical
operations on arrays in Python.
• NumPy works around numbers and gives a wide range of arithmetic operations around numbers giving
us an easier approach in working with them.
• NumPy also works with arrays. An array is nothing but a set of multiple values which are of same
datatype that is its a homogenous collection of Data.
• In NumPy, the arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a
feature of creating n-dimensional arrays in Python.
Difference Between Arrays and List
NumPy Arrays Lists
Homogenous collection of Data. It Heterogenous collection of Data. It
can contain only one type of data. contain multiple types of data.
Cannot be directly initialized. Can Can be directly initialized as it is a part
be operated with Numpy package of Python syntax.
only.
Widely used for arithmetic Widely used for data management
operations
Arrays take less memory space Lists acquire more memory space
Functions like concatenation, Functions like concatenation,
appending etc are not possible with Appending etc are possible with lists
arrays.
Can be accessed only through Can be accessed directly used in Python
package support. without any package support.
Example : Example:
import numpy A=[1,2,3,4,5,6,7,8,9,0]
A=numpy.array([1,2,3,4])
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays.(NumPy arrays)
One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in
easily digestible visuals.
Matplotlib comes with a wide variety of plots which us helps to understand trends, patterns, and to
make correlations.
Pandas [ panel data ]
Pandas is a software library written for the Python programming language for data manipulation and
analysis.
Pandas offers data structures and operations for manipulating numerical tables and time series.
Panel data is an econometrics term for data sets that include observations over multiple time periods for
the same individuals.
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL
values. This is called cleaning the data.
Pandas is well suited for many different kinds of data:
• Pandas is a Python library used for working with data sets.
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data. data actually need not be
labelled at all to be placed into a Pandas data structure.