0% found this document useful (0 votes)
17 views

Data Modeling and Data Engineering

Data Modeling and Data Engineering

Uploaded by

Sakshi Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Data Modeling and Data Engineering

Data Modeling and Data Engineering

Uploaded by

Sakshi Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Modeling and Data Engineering

Content Overview: -
• Data Modeling
o Conceptual Data Modeling
o Logical Data Modeling
o Physical Data Modeling
o Identifying and non-identifying relationships
o Relationship Cardinalities (One to one, one to many, many to
many)
o How to resolve many to many relationships problem

• Data Engineering
o Data Engineering with SQL
o SQL Basics
o Understanding OLTP and OLAP
o Understanding Joins
o Aggregate Functions
o Analytical Functions
Data Modeling
Data Model
A data model is simply a diagram that displays a set of tables and the relationship between
them. We can understand a lot more by looking at a data model diagram than by looking at
a list of tables. This helps us in understanding the purpose of the table as well as their
dependency. A data model is applicable to any software development that involves creation
of database objects, to store and manipulate data. Now this includes transactional systems
as well as data warehouse systems. When the data model is being designed. We progress
through three main stages, they are:
➢ Conceptual data model.
➢ Logical data model.
➢ Physical data model.

Conceptual data model


A conceptual data model is just a set of square shapes connected by a line. The square
shape represents an entity, and the line represents a relationship between the entities. A
conceptual data model can be easily drawn on a whiteboard or a piece of paper. It need not
be a digital document. This makes it easy and quick to change and can be rapidly updated.
So, what are some of the attributes of the conceptual data model. First it is highly abstract.
When we say abstract, we refer to the to the fact that we do not have too much details. It is
at a very high level; hence we call it highly abstract.
It is easily understood. So, whether the user is a technical or a non-technical person. It's
easy for anyone to understand what this model is about.

As you can look at this diagram it's easy to say that there are four main entities- time,
product, sales, and store. All the three entities which is time product and store have a direct
relationship with the sales entity. So that way there is a lot of information that can be easily
obtained by looking at the conceptual data model and since it is not a digital document it
can be easily enhanced and the thing to notice here is only the entities are visible but there
is something else called as attributes, those are not visible, but we will be talking about it in
just a bit. Even the relationships are quite abstract meaning we just know that the product is
connected to sales but what is the column on which the relationship is established that is
not clear yet. So, this is a way of hiding the complexity at the very initial stages and since a
conceptual model can be written on a piece of paper or a whiteboard you really do not need
a software tool to create a conceptual data model. That makes it a whole lot easier. Once
the conceptual data model is finalized, we can elaborate it into a logical data model. So, let's
look at a logical data model.

Logical data model


Logical data model expands the conceptual data model by adding more detail to it and what
are those details. So, first you'll notice the presence of attributes, earlier what we used to be
a simple square shape, now has a list of attributes.
These attributes are further identified as key attributes and non-key attributes. Key
attributes or attributes that define the uniqueness of that entity, such as in the time entity
it's the date. That's a key attribute. Similarly, we have Product ID for product and Store ID
for store. So, in the logical data model you draw a line within each entity.

All the attributes mentioned or displayed above the line form the key attribute and all the
other attributes below the line are called non key attributes. Meaning they do not help in
uniquely identifying the record. An example is the category in the product entity. So,
category something that could repeat for a number of records, hence it's a non-key
attribute and that is why it is listed below the line in this entity. Then we have the primary
key foreign key relationships clearly defined. So, the key attributes that are mentioned here
for each entity can also be used as a primary key and these primary keys are referred as
foreign keys in the sales entity table. As it is apparent from the word FK enclosed within
parenthesis. So, this is a detail that has been added and this was not available in the
conceptual data model. The other thing to notice is the user-friendly attribute names. So,
these are very easily. Any technical or a non-technical person can easily understand what
each of these entities means and to help in the readability. It doesn't take too much time to
understand what each column means because they are self-explanatory. All these changes
that we have done or new things we have added to the logical model, it makes it more
detailed than the conceptual model. At this stage, this logical model is not dependent on
any specific database, meaning you can take this logical model and you can implement it in
any database. It may be Oracle it may be sequel server it could be even a OLAP tools such as
sequel server analysis services and so on. All these additional properties of a logical data
model make it slightly more difficult than a conceptual model to update. Once you have
finalized the logical data model, we go into the last step of a data model design which is a
physical data model.

Physical data model


A physical data model looks a little similar to logical data model however there are some
significant changes. Here, we don't refer to the entities as entities. Instead, we refer to them
as tables and what we used to call as attributes in the logical data model now we refer to as
columns. So, you see tables and columns are words specific to a database whereas entities
and attributes are specific to a logical data model design. So, when we create a physical data
model, we should clearly be referring to these as tables, and columns. The other thing that
you notice is the column names. These column names are no longer user-friendly instead,
they are database compatible names. So, if you have worked on a database, you know that
as a rule you do not use a space when naming a table name or column name. Although you
can use a space, but it becomes very difficult when you're writing queries using those lists of
tables and columns. Hence you avoid using any special characters or any space between the
words.

One other thing that we do is, we try to keep the column length as minimal as possible. So,
as it's evident from here for product, the short form is prod. So, product description has is
now replaced here with prod underscore DESC. So, these are database compatible. This
makes the life of a DBA a lot easier by using names that are fully compatible with the
database as well as any queries that we are going to write. So, the same applies for the table
name as well as the column name and now we have introduced the concept of a data type.
So, these data types mention what is the type of data that is going to be stored in every
column. Here we have VARCHAR, we have integer and float. So, these data types are specific
to a database.
In this example, this physical data model is created for which Microsoft sequel database, so
these data types are specific to sequel server. If you were creating a physical data model for
a different database such as Oracle on my sequel these data types would be different.
Hence, a physical data model is specific to a certain database now this makes it difficult for
users to understand. So, if you are talking about non-technical users, they will have a hard
time understanding what each of these tables mean, what of what these columns mean,
and what are these data types for right. So, usually it's not recommended to share the
physical data model with the users, you only share the logical data model. Now, this has
more details than a logical data model. It makes it even more difficult in order to enhance in
comparison to a logical model.
So, let's assume that you got a sign off on the logical data model and you go ahead and
created a physical data model for a specific database. Now if there are any changes, you first
need to apply those changes in the logical data model and then to the physical data model.
So that's one kind of change which will take time. Other changes let's suppose the database
itself changed now. You're thinking of implementing this entire data instead of sequel server
which means a lot of effort must be involved in converting these data types to something
specific. These are the objects that are very much required to implement a physical data
model.

Identifying and Non-identifying Relationship


Identifying Relationship
As you can see there are two tables- VEHICLE and VEHICLE OWNER. In VEHICLE table Vehicle
ID is a primary key and in VEHICLE OWNER table, there are three primary keys- Vehicle ID,
Vehicle Owner ID as well as Vehicle Ownership start date.

In a data model, parent table and child tables are present. Here VEHICLE tables are parent
table and VEHICLE OWNER table is child table. Parent table and child are connected by a
relationship line. So, these two tables are connected by a relationship line. If the referenced
column in the child table is a part of the primary key in the child table. So, Vehicle_ID is a
referenced column in child table VEHICLE OWNER from parent table VEHICLE and Vehicle_ID
is a part of primary key in VEHICLE OWNER child table. Then relationship is drawn by thick
lines by connecting the parent table and child table. Here VEHICLE table and VEHICLE
OWNER table are connected by a thick line of relationship, which is called as identifying
relationship.
Non-identifying Relationship

Here, parent table is Vehicle manufacturer and child table is Vehicle. If the referenced
column in the child table is not a part of the primary key and standalone column in the child
table. Here you can see Vehicle_ManufacturerID, which is primary key in Vehicle
manufacturer table. But in Vehicle table, Vehicle_ManufacturerID is a foreign key, not a part
of primary key. Then relationship is drawn by dotted lines by connecting these two tables,
which is called as non-identifying relationship.

Relationship Cardinality
Cardinality is a mathematical term that refers to the number of elements in a given set.
Database administrators may use cardinality to count tables and values. In a database,
cardinality usually represents the relationship between the data in two different tables by
highlighting how many times a specific entity occurs compared to another. For example, the
database of an auto repair shop may show that a mechanic works with multiple customers
every day. This means that the relationship between the mechanic entity and the customer
entity is one mechanic to many customers.
However, each customer has exactly one vehicle that they bring to the auto repair shop
during their visit. This means the relationship between the customer entity and the car
entity is a one-to-one relationship. Using cardinality can help database administrators
automatically establish these relationships in a software program or database. This can
make it easy for users to see the correlation between mechanics, customers and cars when
searching for specific data or files.

Importance of Cardinality
Cardinality is important because it creates links from one table or entity to another in a
structured manner. This has a significant impact on the query execution plan. A query
execution plan is a sequence of steps users can take to search for and access data stored in
a database system. Having a well-structured query execution plan can make it easier for
users to locate the data they need quickly. Cardinality can be applied to databases for a
variety of reasons, but businesses typically use the cardinality model to analyze information
about their customers or their inventory.
For example, an online retailer may have a database table that lists each one of its unique
customers. They may also have another database table that lists all the purchases
customers have made from their store. Since it's likely that each customer purchased
multiple items from the store, the database administrator may represent this by using a
one-to-many cardinality relationship that links each customer in the first table to all the
purchases they made in the second table.

Types for Cardinality


➢ One-to-One
➢ One-to Many
➢ Many-to-Many

One-to-One Relationship Cardinality


The ONE-TO-ONE (1:1) RELATIONSHIP defines the fact that one row in a database table
relates to exactly one row in a second table. In an ER diagram, 1 to 1 means that one
occurrence of an entity is related to one event in a second entity.
Examples of the 1 to 1 relationship include student to student contact details, country or
state to capital city, and person to social security or identity number.

The 1 to 1 relationship is notated in an ER diagram with a single line connecting the two
entities. In our scenario, the line connects the Student entity to the Student Contact Details
entity. The two perpendicular lines (|) indicate a mandatory relationship between the two
entities. In other words, the student must have contact details, and the contact details must
have a related student.

One-To-Many Relationship Cardinality


The ONE-TO-MANY (1: N) RELATIONSHIP is the most common database relationship. It is
used to indicate the relationship between the majority of tables found in a relational
database.
In summary, the one-to-many relationship means that one row in a database table relates
to many rows in a second table. It is also known as a Primary Key-Foreign Key relationship
because it uses primary keys and foreign keys to enforce this relationship.
There are innumerable instances of a 1 to N relationship, including a student to subjects,
courses or degrees to a student, and a sales invoice to invoice transactions.
In an E R diagram, the cardinality of one-to-many is notated with a line joining the two
entities. The connectors reflect the different characteristics of this relationship. The single
vertical line (on the left side of this relationship line) indicates that this connector only has
one row affected by this relationship.
The crow’s foot with an open circle indicates that this connector has many rows influenced
by this relationship. The open circle indicates optionality. In other words, there does not
have to be a student enrollment record linked to a course.

Many-to-Many Relationship Cardinality


The MANY-TO-MANY RELATIONSHIP means that many rows in one table are related to many
rows in a second table. In other words, many instances in one entity correlate with many
instances in a second entity. For example, a student can sign up for many classes, and a class
can have many students signed up.
It is slightly more difficult to model a cardinality of many-to-many. A direct many-to-many
relationship between these two example entities is not possible. A cross-reference table is
required to convert this relationship into two one-to-many relationships.

As with the one-to-many relationship described above, the relationship between two
entities is indicated by a line between them. The connectors on each end describe the
nature of this relationship.
The single vertical line (|) on the Students entity side indicates that the connector only has
one row affected by this relationship. And the crow’s foot on the other side of the line
shows that this relationship influences multiple rows.
The middle table (Class Student) consists of two primary/foreign keys, one of which is the
primary key for the Students table and the other the primary key for the Classes table.
Therefore, there must be a StudentID and a ClassID for each row in the Class-Student table.
Because these elements of the Class Student table are also primary keys of the entity on
each side of it, each element has to exist in the Students and Classes tables, respectively.
How to resolve Many-to Many relationships problem
Many-to-many (M: N) relationships add complexity and confusion to your model and to the
application development process. The key to resolve M: N relationships is to separate the
two entities and create two one-to-many (1: N) relationships between them with a
third intersect entity. The intersect entity usually contains attributes from both connecting
entities.

The telephone directory example has a M: N relationship between


the name and fax entities, as shown in figure. The business rules say, “One person can have
zero, one, or many fax numbers; a fax number can be for several people.” Based on what we
selected earlier as our primary key for the voice entity, an M: N relationship exists.
A problem exists in the fax entity because the telephone number, which is designated as the
primary key, can appear more than one time in the fax entity; this violates the qualification
of a primary key. Remember, the primary key must be unique.
To resolve this M: N relationship, you can add an intersect entity between
the name and fax entities, as shown in figure. The new intersect entity, fax name, contains
two attributes, fax_num and rec_num. The primary key for the entity is a composite of both
attributes. Individually, each attribute is a foreign key that references the table from which
it came. The relationship between the name and fax name tables is 1: N because one name
can be associated with many fax numbers; in the other direction, each fax
name combination can be associated with one rec_num. The relationship between
the fax and fax name tables is 1: N because each number can be associated with many fax
name combinations.
Data Engineering
Businesses produce a lot of data. Everything from customer feedback to sales performance
and stock price influences how a company operates. But understanding what stories the
data tells isn’t always easy or intuitive, which is why many businesses rely on data
engineering.
Data engineering is the process of designing and building systems that let people collect and
analyze raw data from multiple sources and formats. These systems empower people to find
practical applications of the data, which businesses can use to thrive.

Importance
Companies of all sizes have huge amounts of disparate data to comb through to answer
critical business questions. Data engineering is designed to support the process, making it
possible for consumers of data, such as analysts, data scientists and executives, to inspect
all the data available reliably, quickly and securely.
Data analysis is challenging because the data is managed by different technologies and
stored in various structures. Yet, the tools used for analysis assume the data is managed by
the same technology and stored in the same structure. This rift can cause headaches for
anybody trying to answer questions about business performance.
For example, consider all the data a brand collects about its customers:

• One system contains information about billing and shipping


• Another system maintains order history
• And other systems store customer support, behavioral information, and third-
party data
Together, this data provides a comprehensive view of the customer. However, these
different datasets are independent, which makes answering certain questions — like what
types of orders result in the highest customer support costs — very difficult.
Data engineering unifies these data sets and lets you find answers to your questions quickly
and efficiently.
Data engineers play a crucial role in designing, operating, and supporting the increasingly
complex environments that power modern data analytics. Once a data set has been fully
cleaned and formatted through data engineering, it’s easier and faster. to read and
understand.

Data Engineering Tools


Data engineers use many different tools to work with data. They use a specialized skill set to
create end-to-end data pipelines that move data from source systems to target
destinations.
Data engineers work with a variety of tools and technologies, including:
• ETL Tools: ETL (extract, transform, load) tools move data between systems. They
access data, then apply rules to “transform” the data through steps that make it
more suitable for analysis.
• SQL: Structured Query Language (SQL) is the standard language for querying
relational databases.
• Python: Python is a general programming language. Data engineers may choose to
use Python for ETL tasks.
• Cloud Data Storage: Including Amazon S3, Azure Data Lake Storage (ADLS), Google
Cloud Storage, etc.

• Query Engines: Engines run queries against data to return answers. Data engineers
may work with engines like Spark, Flink, and others.

Data Engineering with SQL


Table
A table is a collection of related data in an organized manner in the form of rows and
columns. Here is a simple example of a table containing the data about different students
that is their ID, name, age, and course. Each type of information present in the vertical
position are known as columns or fields, and the data of each student is known as row or
record.

Each record holds the total data for a specific student. In this way we can make as many
tables as possible with the different combinations of different data.
A table is an organized arrangement of data and information in the tabular form containing
rows and columns making it easier to understand and compare data.
Database
A database is a collection of multiple tables in a single container. We have a type of
database known as relational database. So, when a database contains multiple tables which
are related to each other in a specific manner then that database is known as a relational
database.

Database is stored on computer and so can be easily modified. Database is a large collection
of data or information specifically a large number of tables organized in a way in which it
can be easily updated or accessed with a computer system.

DBMS (Data Base Management System)


We all know that we need a software, basically an app where we can do things. Similarly in
order to make a database, edit a database or do anything related to a database, we need a
software. This type of software that deals with the database is known as database
management system or DBMS.

A DBMS or a database management system is a software package designed to define,


manipulate, retrieve, and manage data in a database. It helps the users to store and retrieve
data. Retrieve data means accessing the data that is stored so that you can display or copy
it.

Structured Query Language (SQL)


A DBMS or a database management system is a software that work with databases. But we
know that the software could not work on its own. We have to give instructions to it on
what is to be done and how it is to be done. Consider command prompt. It is a software in
our computer in which we write commands and execute it to do a specific task. We write
commands in a specific way and in a specific syntax.
Similarly in a DBMS when we need to do a task on a database, we give commands, and
these commands are known as SQL or structured query language.
So, SQL is nothing, but it is simply a programming language that is a set of syntax and rules
which helps us to give instructions to the DBMS software to work with databases.
Indirectly SQL helps us to communicate with a database through a database management
system. There are many databases management system software which could be used to
manage databases, but they all use a common language that is the structured query
language or SQL.
I hope you are quite clear with the concept of SQL, now let's take a look at the definition of
SQL. It will help you understand it in a better way. Structured Query language or SQL is a
computer language for management of databases and data manipulation. SQL is used to
query, insert, update, and modify data in a database. It contains a lot of commands which a
user can execute to perform operations on a database.

Basic Queries in SQL


➢ Data Definition Language
➢ Data Manipulation Language
➢ Data Control Language
➢ Transaction Control Language

Data Definition Language (DDL)


DDL or data definition language consists of the SQL commands that can be used to define
the database schema. Let's see DDL commands.
Create: - Create command is used to create the database or its objects like table, index,
function, views, stored procedure, and triggers.
Drop: - Drop command is used to delete objects from the database.
Truncate: - Truncate command is used to remove all records from a table.
Alter is used to add, delete, or modify the structure of the database.

Data Manipulation Language (DML)


DML or data manipulation language consists of the SQL commands that deals with the
manipulation of data present in the database.
Insert into: - This command is used to insert data into a table.
Update: - Update is used to update existing data within a table.
Delete: - Delete command is used to delete records from a database table.
Select: - This command helps you to select the attribute based on the condition described
by the Where clause.
Transaction Control Language (TCL)
TCL or transaction control language these commands deal with the transaction within the
database.
Commit: - Commit commands commit a transaction.
Rollback: - Rollback command rollbacks a transaction in case of any error occurs.
Save point: - Save point sets a save point within a transaction.
Set transaction: - Set transaction it specifies characteristics for the transaction.

Data Control Language (DCL)


DCL or data control language it consists of the commands such as grant and revoke which
mainly deals with the rights permissions and other controls of the database system.
Grant: - Grant command gives user access privileges to database.
Revoke-it: - This command withdraws user success privileges given by using the grant
command.

Understanding OLTP and OLAP


OLAP
OLAP or Online Analytical Processing is a category of software tools which provides analysis
of data for business decisions. OLAP systems allow users to analyze database information
from multiple database systems online. We must keep one thing in mind which is the
primary objective of OLAP, or data analysis is not just data processing it has beyond that.
Now, moving ahead we shall consider some of the basic examples of OLAP systems. Any
data warehouse is an example for OLAP system. The uses of OLAP system are as follows a
company might compare their sales in the month of January with the month of February.
Then compare those results with another location which may be stored in a separate
database.
Amazon analyzes purchases made by its customers to come up with a personalized home
page what products which are likely to be interested by their customers, so this is one of the
good examples of OLAP systems.

Advantages of OLAP
• OLAP creates a single platform for all type of business analytical needs.
• The main benefit of OLAP is the consistency of information and calculations.
• Easily apply security restrictions on users and objects to comply with regulations
and protect sensitive data.
Disadvantages of OLAP
• Implementation and maintenance are dependent on IT professionals because the
traditional OLAP tools require a complicated modeling procedure.
• OLAP needs cooperation between people or various departments to be effective
which might always be not possible.

OLTP
OLTP or online transaction processing supports transaction-oriented applications in a three-
tier architecture. OLTP administers day-to-day transaction of an organization. Here we need
to consider one major that is the primary objective of OLTP systems is data processing not
data analysis.
An example of OLTP system is an ATM center. Assume that a couple has a joint account with
a bank. One day both simultaneously reach different ATM centers at precisely the same
time and want to withdraw the total amount present in their bank account. However, the
person that completes authentication process first will be able to get the money. In this case
OLTP systems make sure that withdrawn amount will never be more than the amount
present in the bank.
The key to note here is that OLTP systems are optimized for transactional superiority instead
of data analysis.

Advantages of OLTP
• OLTP method administers daily transactions for an organization
• OLTP widens the customer base of an organization by simplifying individual
processes.

Disadvantages of OLTP
• If OLTP system faces hardware failures, then online transactions get severely
affected.
• OLTP systems allows multiple users to access and change the same data and
same kind which many times created unprecedented situation.
SQL Joins
The information you want to retrieve is often stored in various tables. In such scenarios
you'll need to join these tables to view data in a much better way. This is where SQL join
comes into picture. SQL joins is widely used clause SQL essentially to combine and retrieve
data from two or more tables based on related columns or you can say common fields
between them.

Now consider two tables. Here, table one has three columns, ABC and three records. Let's
say, for reference we'll take them as one two three. Similarly, table two also has three
columns BCD and three records three, four, five. Here, I have taken a different color
combination to represent values, that are present in various columns. Now, instead of
querying each table every time to retrieve data we will simply join these two tables, and this
will be the resultant table three.
Also make sure when you're joining two tables, it should compulsorily have a common
column. Here C is the common field, which forms the basis to join these two tables.

Types of SQL joins


➢ Inner join
➢ Outer join
➢ Left join
➢ Right join

Inner Join
SQL inner join joins two tables based on a common column and selects the records that
have matching values in these columns.

Now when the condition is applied for these columns the query checks all the rows of table
1 and table 2. Only the rows that satisfy the join predicate are included in the resultant
table.
Syntax
SELECT
Table1.column1, Table1.column2, Table2.column1, Table2.column2 and so on
From Table1
INNER JOIN Table2
ON Table1.column = Table2.column

Now inner join syntax basically compares rows of table one with table 2 to check if anything
matches based on the condition provided in the on clause and when the condition is met it
returns matched rows in both tables with the selected columns in the select clause.

Outer Join
SQL outer join or else it is called as SQL full join or full outer join is used to get all the rows
which are present in both the tables. That means it will return all the records which are
present in either left table that is the table 1 or the right table that is table 2. Even, if there
are no matching records present in both the tables.

The syntax remains same that is


SELECT Table1.column1, Table1.column2 and so on up to
Table2.column2
FROM Table1
FULL OUTER JOIN Table2
ON Table1.column = Table2.column

Here you must mention the same or the similar column name after the ON statement .

Left Join
Left join or left outer join results in a table containing all the rows from the table on the left
side of the join, that is the first table and only the rows that satisfy the join condition, from
the table on the right side of the join, that is the second table. Any missing values for the
rows from the right table in the result of the join tables are represented by null values.
Syntax
SELECT column_lists
FROM Table1
LEFT JOIN Table2
ON Table1.column = Table2.column

So, in this way you can use the left join to display the records.

Right Join
Right join or Right outer join is opposite to the left outer join. It follows the same rules as
the left join and the only difference is that all the rows from the right table and only the
conditions satisfying the rows from the left table are present in the resultant table. That
means it will return all the rows from the right table and all the matching records that are
present in the left table.

Syntax
SELECT column_lists
FROM table 1
RIGHT JOIN Table 2
ON Table1.column = Table2.column
Aggregate Function
• Aggregate function is a function where values of multiple rows are grouped
together as input on certain criteria to form a single value of more significant
meaning.
• It returns a single value.
• Aggregate functions are also used to summarize the data.
Some Important Aggregate Functions

COUNT
Basically, we use the count function to count the total number of rows of a particular
column of a table. Also, we can use count on numeric and non-numeric datatype. For
example, you can count salary column which is numeric, and you can also count name
column which is non-numeric.
SYNTAX:

SELECT COUNT (Column_Name)


FROM Table_Name.

SUM
Sum is used to calculate the sum of non-null values over the selected columns. Secondly, we
cannot sum non-numeric values, we can sum only numeric values.
Here the query is same as count function, only difference is we have to use sum at the place
of count.
SYNTAX:

SELECT SUM (Column_Name)


FROM Table_Name;

AVG
We can use this function to calculate average of particular column of numeric type. Like sum
function, average also consider only null values. Basically, average function is equals to sum
function divided by count function.
SYNTAX:

SELECT AVG(Column_Name)
FROM Table_Name;
MIN
• MIN function is used to find the minimum value of a certain column.
• It determines the smallest value of all selected value of a column.
• It can work on both numeric and non-numeric data type.
SYNTAX:

SELECT MIN(Column_Name)
FROM Table_Name;

MAX
• MAX function is used to find the maximum value of a certain column.
• It determines the largest value of all selected value of a column.
• It can work on both numeric and non-numeric data type.
SYNTAX:

SELECT MAX(Column_Name)
FROM Table_Name;

Analytical Functions in SQL


An analytical function computes values over a group of rows and returns a single result for
each row. This is different from aggregate function which returns a single result for a group
of rows.

In this diagram in a table, when you apply analytical function on this group of rows it will
return a single result for each row. On the other hand, when you apply an aggregate
function in a group of rows it will return a single row for each group. This is the key
difference between analytical function and aggregate function. Some analytical functions
are mentioned below.
RANK
The RANK Function in SQL Server is a kind of Ranking Function. This function will assign the
number to each row within the partition of an output. It assigns the rank to each row as one
plus the previous row rank. When the RANK function finds two values that are identical
within the same partition, it assigns them with the same rank number. In addition, the next
number in the ranking will be the previous rank plus duplicate numbers. Therefore, this
function does not always assign the ranking of rows in consecutive order.

Here we have a Demo table with Name column. Let us use RANK function to assign ranks to
the rows in the Demo table. Query for getting the desired output is,
SYNTAX:

SELECT Name, RANK () OVER (ORDER BY Name)


AS Rank_no FROM Demo_table;

In output results you can see that same rank has been given to the same name and plus one
to row number rank is given to next name.

DENSE RANK

The Dense Rank function assigns a unique rank for each row within a partition as per the
specified column value without any gaps. It always specifies ranking in consecutive order. If
we get a duplicate value, this function will assign it with the same rank, and the next rank
being the next sequential number. This characteristic differs DENSE_RANK() function from
the RANK() function.
Consider the following employee table. For example, we have to calculate row no., rank,
dense rank of employees is employee table according to salary within each department.
Query for calculation this will be,
SYNTAX:
SELECT
ROW_NUMBER () OVER (PARTITION BY Dept ORDER BY Salary DESC)
AS emp_row_no, Name, Dept, Salary,
RANK () OVER (PARTITION BY Dept ORDER BY Salary DESC)
AS emp_rank, DENSE_RANK () OVER (PARTITION BY Dept ORDER BY Salary DESC)
AS emp_dense_rank, FROM employee

The output table is result of above query. So, we can see that row numbers are consecutive
integers within each partition. Also, we can see difference between rank and dense rank
that in dense rank there is no gap between rank values while, there is gap in rank values
after repeated rank.

Row Number
Row Number function is used to return the unique sequential number for each row within
its partition. The row numbering begins at one and increases by one until the partition's
total number of rows is reached. It will return the different ranks for the row having similar
values that make it different from the RANK () function.

Consider this employee table to display employee with top five highest salary. The query for
this will be as follows:
SYNTAX:
SELECT Emp_No, Name, Salary,
FROM (SELECT Emp_No, Name, Salary, ROW_NUMBER () OVER (ORDER BY Salary DESC) AS
Row_number FROM employee)
WHERE Row_number<=5;

In output table you can see the row numbers are given in a sequence even though same
salary value.
LAG
Lag function returns previous row data with the current row. If previous row doesn’t exist, it
will display null with the current row. The LAG () function allows access to a value stored in a
different row above the current row. The row above may be adjacent, or some number of
rows above, as sorted by a specified column or set of columns.

Let’s consider the sale table for example and following query with a LAG () function:
SYNTAX:
SELECT Seller_name, Sale_value,
LAG(Sale_value) OVER (ORDER BY Sale_value) as previous_sale_value
FROM sale;

The result of this query is the output table. This simplest use of LAG () displays the value
from the adjacent row above. For example, the second record displays Alice’s sale amount
($12,000) with Stef’s ($7,000) from the row above, in
columns Sale_value and previous_sale_value, respectively. Notice that the first row does
not have an adjacent row above, and consequently the previous_sale_value field is empty
(NULL) since the row from which the value of Sale_value should be obtained does not exist.
LEAD
This function displays next row data with the current row. If, no next row is available,
then LEAD () function will display null with the current row by default.
LEAD () is similar to LAG (). Whereas LAG () accesses a value stored in a row above, LEAD
accesses a value stored in a row below.

SYNTAX:
SELECT Seller_name, Sale_value,
LEAD(Sale_value) OVER (ORDER BY Sale_value) as next_sale_value
FROM sale;
The rows are sorted by the column specified in ORDER BY (Sale_value). The LEAD () function
grabs the sale amount from the row below. For example, Stef’s own sale amount is $7,000
in the column Sale_value, and the column next_sale_value in the same record contains
$12,000. The latter comes from the Sale_value column for Alice, the seller in the next row.
Note that the last row does not have a next row, so the next_sale_value field is empty
(NULL) for the last row.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy