Data Modeling and Data Engineering
Data Modeling and Data Engineering
Content Overview: -
• Data Modeling
o Conceptual Data Modeling
o Logical Data Modeling
o Physical Data Modeling
o Identifying and non-identifying relationships
o Relationship Cardinalities (One to one, one to many, many to
many)
o How to resolve many to many relationships problem
• Data Engineering
o Data Engineering with SQL
o SQL Basics
o Understanding OLTP and OLAP
o Understanding Joins
o Aggregate Functions
o Analytical Functions
Data Modeling
Data Model
A data model is simply a diagram that displays a set of tables and the relationship between
them. We can understand a lot more by looking at a data model diagram than by looking at
a list of tables. This helps us in understanding the purpose of the table as well as their
dependency. A data model is applicable to any software development that involves creation
of database objects, to store and manipulate data. Now this includes transactional systems
as well as data warehouse systems. When the data model is being designed. We progress
through three main stages, they are:
➢ Conceptual data model.
➢ Logical data model.
➢ Physical data model.
As you can look at this diagram it's easy to say that there are four main entities- time,
product, sales, and store. All the three entities which is time product and store have a direct
relationship with the sales entity. So that way there is a lot of information that can be easily
obtained by looking at the conceptual data model and since it is not a digital document it
can be easily enhanced and the thing to notice here is only the entities are visible but there
is something else called as attributes, those are not visible, but we will be talking about it in
just a bit. Even the relationships are quite abstract meaning we just know that the product is
connected to sales but what is the column on which the relationship is established that is
not clear yet. So, this is a way of hiding the complexity at the very initial stages and since a
conceptual model can be written on a piece of paper or a whiteboard you really do not need
a software tool to create a conceptual data model. That makes it a whole lot easier. Once
the conceptual data model is finalized, we can elaborate it into a logical data model. So, let's
look at a logical data model.
All the attributes mentioned or displayed above the line form the key attribute and all the
other attributes below the line are called non key attributes. Meaning they do not help in
uniquely identifying the record. An example is the category in the product entity. So,
category something that could repeat for a number of records, hence it's a non-key
attribute and that is why it is listed below the line in this entity. Then we have the primary
key foreign key relationships clearly defined. So, the key attributes that are mentioned here
for each entity can also be used as a primary key and these primary keys are referred as
foreign keys in the sales entity table. As it is apparent from the word FK enclosed within
parenthesis. So, this is a detail that has been added and this was not available in the
conceptual data model. The other thing to notice is the user-friendly attribute names. So,
these are very easily. Any technical or a non-technical person can easily understand what
each of these entities means and to help in the readability. It doesn't take too much time to
understand what each column means because they are self-explanatory. All these changes
that we have done or new things we have added to the logical model, it makes it more
detailed than the conceptual model. At this stage, this logical model is not dependent on
any specific database, meaning you can take this logical model and you can implement it in
any database. It may be Oracle it may be sequel server it could be even a OLAP tools such as
sequel server analysis services and so on. All these additional properties of a logical data
model make it slightly more difficult than a conceptual model to update. Once you have
finalized the logical data model, we go into the last step of a data model design which is a
physical data model.
One other thing that we do is, we try to keep the column length as minimal as possible. So,
as it's evident from here for product, the short form is prod. So, product description has is
now replaced here with prod underscore DESC. So, these are database compatible. This
makes the life of a DBA a lot easier by using names that are fully compatible with the
database as well as any queries that we are going to write. So, the same applies for the table
name as well as the column name and now we have introduced the concept of a data type.
So, these data types mention what is the type of data that is going to be stored in every
column. Here we have VARCHAR, we have integer and float. So, these data types are specific
to a database.
In this example, this physical data model is created for which Microsoft sequel database, so
these data types are specific to sequel server. If you were creating a physical data model for
a different database such as Oracle on my sequel these data types would be different.
Hence, a physical data model is specific to a certain database now this makes it difficult for
users to understand. So, if you are talking about non-technical users, they will have a hard
time understanding what each of these tables mean, what of what these columns mean,
and what are these data types for right. So, usually it's not recommended to share the
physical data model with the users, you only share the logical data model. Now, this has
more details than a logical data model. It makes it even more difficult in order to enhance in
comparison to a logical model.
So, let's assume that you got a sign off on the logical data model and you go ahead and
created a physical data model for a specific database. Now if there are any changes, you first
need to apply those changes in the logical data model and then to the physical data model.
So that's one kind of change which will take time. Other changes let's suppose the database
itself changed now. You're thinking of implementing this entire data instead of sequel server
which means a lot of effort must be involved in converting these data types to something
specific. These are the objects that are very much required to implement a physical data
model.
In a data model, parent table and child tables are present. Here VEHICLE tables are parent
table and VEHICLE OWNER table is child table. Parent table and child are connected by a
relationship line. So, these two tables are connected by a relationship line. If the referenced
column in the child table is a part of the primary key in the child table. So, Vehicle_ID is a
referenced column in child table VEHICLE OWNER from parent table VEHICLE and Vehicle_ID
is a part of primary key in VEHICLE OWNER child table. Then relationship is drawn by thick
lines by connecting the parent table and child table. Here VEHICLE table and VEHICLE
OWNER table are connected by a thick line of relationship, which is called as identifying
relationship.
Non-identifying Relationship
Here, parent table is Vehicle manufacturer and child table is Vehicle. If the referenced
column in the child table is not a part of the primary key and standalone column in the child
table. Here you can see Vehicle_ManufacturerID, which is primary key in Vehicle
manufacturer table. But in Vehicle table, Vehicle_ManufacturerID is a foreign key, not a part
of primary key. Then relationship is drawn by dotted lines by connecting these two tables,
which is called as non-identifying relationship.
Relationship Cardinality
Cardinality is a mathematical term that refers to the number of elements in a given set.
Database administrators may use cardinality to count tables and values. In a database,
cardinality usually represents the relationship between the data in two different tables by
highlighting how many times a specific entity occurs compared to another. For example, the
database of an auto repair shop may show that a mechanic works with multiple customers
every day. This means that the relationship between the mechanic entity and the customer
entity is one mechanic to many customers.
However, each customer has exactly one vehicle that they bring to the auto repair shop
during their visit. This means the relationship between the customer entity and the car
entity is a one-to-one relationship. Using cardinality can help database administrators
automatically establish these relationships in a software program or database. This can
make it easy for users to see the correlation between mechanics, customers and cars when
searching for specific data or files.
Importance of Cardinality
Cardinality is important because it creates links from one table or entity to another in a
structured manner. This has a significant impact on the query execution plan. A query
execution plan is a sequence of steps users can take to search for and access data stored in
a database system. Having a well-structured query execution plan can make it easier for
users to locate the data they need quickly. Cardinality can be applied to databases for a
variety of reasons, but businesses typically use the cardinality model to analyze information
about their customers or their inventory.
For example, an online retailer may have a database table that lists each one of its unique
customers. They may also have another database table that lists all the purchases
customers have made from their store. Since it's likely that each customer purchased
multiple items from the store, the database administrator may represent this by using a
one-to-many cardinality relationship that links each customer in the first table to all the
purchases they made in the second table.
The 1 to 1 relationship is notated in an ER diagram with a single line connecting the two
entities. In our scenario, the line connects the Student entity to the Student Contact Details
entity. The two perpendicular lines (|) indicate a mandatory relationship between the two
entities. In other words, the student must have contact details, and the contact details must
have a related student.
As with the one-to-many relationship described above, the relationship between two
entities is indicated by a line between them. The connectors on each end describe the
nature of this relationship.
The single vertical line (|) on the Students entity side indicates that the connector only has
one row affected by this relationship. And the crow’s foot on the other side of the line
shows that this relationship influences multiple rows.
The middle table (Class Student) consists of two primary/foreign keys, one of which is the
primary key for the Students table and the other the primary key for the Classes table.
Therefore, there must be a StudentID and a ClassID for each row in the Class-Student table.
Because these elements of the Class Student table are also primary keys of the entity on
each side of it, each element has to exist in the Students and Classes tables, respectively.
How to resolve Many-to Many relationships problem
Many-to-many (M: N) relationships add complexity and confusion to your model and to the
application development process. The key to resolve M: N relationships is to separate the
two entities and create two one-to-many (1: N) relationships between them with a
third intersect entity. The intersect entity usually contains attributes from both connecting
entities.
Importance
Companies of all sizes have huge amounts of disparate data to comb through to answer
critical business questions. Data engineering is designed to support the process, making it
possible for consumers of data, such as analysts, data scientists and executives, to inspect
all the data available reliably, quickly and securely.
Data analysis is challenging because the data is managed by different technologies and
stored in various structures. Yet, the tools used for analysis assume the data is managed by
the same technology and stored in the same structure. This rift can cause headaches for
anybody trying to answer questions about business performance.
For example, consider all the data a brand collects about its customers:
• Query Engines: Engines run queries against data to return answers. Data engineers
may work with engines like Spark, Flink, and others.
Each record holds the total data for a specific student. In this way we can make as many
tables as possible with the different combinations of different data.
A table is an organized arrangement of data and information in the tabular form containing
rows and columns making it easier to understand and compare data.
Database
A database is a collection of multiple tables in a single container. We have a type of
database known as relational database. So, when a database contains multiple tables which
are related to each other in a specific manner then that database is known as a relational
database.
Database is stored on computer and so can be easily modified. Database is a large collection
of data or information specifically a large number of tables organized in a way in which it
can be easily updated or accessed with a computer system.
Advantages of OLAP
• OLAP creates a single platform for all type of business analytical needs.
• The main benefit of OLAP is the consistency of information and calculations.
• Easily apply security restrictions on users and objects to comply with regulations
and protect sensitive data.
Disadvantages of OLAP
• Implementation and maintenance are dependent on IT professionals because the
traditional OLAP tools require a complicated modeling procedure.
• OLAP needs cooperation between people or various departments to be effective
which might always be not possible.
OLTP
OLTP or online transaction processing supports transaction-oriented applications in a three-
tier architecture. OLTP administers day-to-day transaction of an organization. Here we need
to consider one major that is the primary objective of OLTP systems is data processing not
data analysis.
An example of OLTP system is an ATM center. Assume that a couple has a joint account with
a bank. One day both simultaneously reach different ATM centers at precisely the same
time and want to withdraw the total amount present in their bank account. However, the
person that completes authentication process first will be able to get the money. In this case
OLTP systems make sure that withdrawn amount will never be more than the amount
present in the bank.
The key to note here is that OLTP systems are optimized for transactional superiority instead
of data analysis.
Advantages of OLTP
• OLTP method administers daily transactions for an organization
• OLTP widens the customer base of an organization by simplifying individual
processes.
Disadvantages of OLTP
• If OLTP system faces hardware failures, then online transactions get severely
affected.
• OLTP systems allows multiple users to access and change the same data and
same kind which many times created unprecedented situation.
SQL Joins
The information you want to retrieve is often stored in various tables. In such scenarios
you'll need to join these tables to view data in a much better way. This is where SQL join
comes into picture. SQL joins is widely used clause SQL essentially to combine and retrieve
data from two or more tables based on related columns or you can say common fields
between them.
Now consider two tables. Here, table one has three columns, ABC and three records. Let's
say, for reference we'll take them as one two three. Similarly, table two also has three
columns BCD and three records three, four, five. Here, I have taken a different color
combination to represent values, that are present in various columns. Now, instead of
querying each table every time to retrieve data we will simply join these two tables, and this
will be the resultant table three.
Also make sure when you're joining two tables, it should compulsorily have a common
column. Here C is the common field, which forms the basis to join these two tables.
Inner Join
SQL inner join joins two tables based on a common column and selects the records that
have matching values in these columns.
Now when the condition is applied for these columns the query checks all the rows of table
1 and table 2. Only the rows that satisfy the join predicate are included in the resultant
table.
Syntax
SELECT
Table1.column1, Table1.column2, Table2.column1, Table2.column2 and so on
From Table1
INNER JOIN Table2
ON Table1.column = Table2.column
Now inner join syntax basically compares rows of table one with table 2 to check if anything
matches based on the condition provided in the on clause and when the condition is met it
returns matched rows in both tables with the selected columns in the select clause.
Outer Join
SQL outer join or else it is called as SQL full join or full outer join is used to get all the rows
which are present in both the tables. That means it will return all the records which are
present in either left table that is the table 1 or the right table that is table 2. Even, if there
are no matching records present in both the tables.
Here you must mention the same or the similar column name after the ON statement .
Left Join
Left join or left outer join results in a table containing all the rows from the table on the left
side of the join, that is the first table and only the rows that satisfy the join condition, from
the table on the right side of the join, that is the second table. Any missing values for the
rows from the right table in the result of the join tables are represented by null values.
Syntax
SELECT column_lists
FROM Table1
LEFT JOIN Table2
ON Table1.column = Table2.column
So, in this way you can use the left join to display the records.
Right Join
Right join or Right outer join is opposite to the left outer join. It follows the same rules as
the left join and the only difference is that all the rows from the right table and only the
conditions satisfying the rows from the left table are present in the resultant table. That
means it will return all the rows from the right table and all the matching records that are
present in the left table.
Syntax
SELECT column_lists
FROM table 1
RIGHT JOIN Table 2
ON Table1.column = Table2.column
Aggregate Function
• Aggregate function is a function where values of multiple rows are grouped
together as input on certain criteria to form a single value of more significant
meaning.
• It returns a single value.
• Aggregate functions are also used to summarize the data.
Some Important Aggregate Functions
COUNT
Basically, we use the count function to count the total number of rows of a particular
column of a table. Also, we can use count on numeric and non-numeric datatype. For
example, you can count salary column which is numeric, and you can also count name
column which is non-numeric.
SYNTAX:
SUM
Sum is used to calculate the sum of non-null values over the selected columns. Secondly, we
cannot sum non-numeric values, we can sum only numeric values.
Here the query is same as count function, only difference is we have to use sum at the place
of count.
SYNTAX:
AVG
We can use this function to calculate average of particular column of numeric type. Like sum
function, average also consider only null values. Basically, average function is equals to sum
function divided by count function.
SYNTAX:
SELECT AVG(Column_Name)
FROM Table_Name;
MIN
• MIN function is used to find the minimum value of a certain column.
• It determines the smallest value of all selected value of a column.
• It can work on both numeric and non-numeric data type.
SYNTAX:
SELECT MIN(Column_Name)
FROM Table_Name;
MAX
• MAX function is used to find the maximum value of a certain column.
• It determines the largest value of all selected value of a column.
• It can work on both numeric and non-numeric data type.
SYNTAX:
SELECT MAX(Column_Name)
FROM Table_Name;
In this diagram in a table, when you apply analytical function on this group of rows it will
return a single result for each row. On the other hand, when you apply an aggregate
function in a group of rows it will return a single row for each group. This is the key
difference between analytical function and aggregate function. Some analytical functions
are mentioned below.
RANK
The RANK Function in SQL Server is a kind of Ranking Function. This function will assign the
number to each row within the partition of an output. It assigns the rank to each row as one
plus the previous row rank. When the RANK function finds two values that are identical
within the same partition, it assigns them with the same rank number. In addition, the next
number in the ranking will be the previous rank plus duplicate numbers. Therefore, this
function does not always assign the ranking of rows in consecutive order.
Here we have a Demo table with Name column. Let us use RANK function to assign ranks to
the rows in the Demo table. Query for getting the desired output is,
SYNTAX:
In output results you can see that same rank has been given to the same name and plus one
to row number rank is given to next name.
DENSE RANK
The Dense Rank function assigns a unique rank for each row within a partition as per the
specified column value without any gaps. It always specifies ranking in consecutive order. If
we get a duplicate value, this function will assign it with the same rank, and the next rank
being the next sequential number. This characteristic differs DENSE_RANK() function from
the RANK() function.
Consider the following employee table. For example, we have to calculate row no., rank,
dense rank of employees is employee table according to salary within each department.
Query for calculation this will be,
SYNTAX:
SELECT
ROW_NUMBER () OVER (PARTITION BY Dept ORDER BY Salary DESC)
AS emp_row_no, Name, Dept, Salary,
RANK () OVER (PARTITION BY Dept ORDER BY Salary DESC)
AS emp_rank, DENSE_RANK () OVER (PARTITION BY Dept ORDER BY Salary DESC)
AS emp_dense_rank, FROM employee
The output table is result of above query. So, we can see that row numbers are consecutive
integers within each partition. Also, we can see difference between rank and dense rank
that in dense rank there is no gap between rank values while, there is gap in rank values
after repeated rank.
Row Number
Row Number function is used to return the unique sequential number for each row within
its partition. The row numbering begins at one and increases by one until the partition's
total number of rows is reached. It will return the different ranks for the row having similar
values that make it different from the RANK () function.
Consider this employee table to display employee with top five highest salary. The query for
this will be as follows:
SYNTAX:
SELECT Emp_No, Name, Salary,
FROM (SELECT Emp_No, Name, Salary, ROW_NUMBER () OVER (ORDER BY Salary DESC) AS
Row_number FROM employee)
WHERE Row_number<=5;
In output table you can see the row numbers are given in a sequence even though same
salary value.
LAG
Lag function returns previous row data with the current row. If previous row doesn’t exist, it
will display null with the current row. The LAG () function allows access to a value stored in a
different row above the current row. The row above may be adjacent, or some number of
rows above, as sorted by a specified column or set of columns.
Let’s consider the sale table for example and following query with a LAG () function:
SYNTAX:
SELECT Seller_name, Sale_value,
LAG(Sale_value) OVER (ORDER BY Sale_value) as previous_sale_value
FROM sale;
The result of this query is the output table. This simplest use of LAG () displays the value
from the adjacent row above. For example, the second record displays Alice’s sale amount
($12,000) with Stef’s ($7,000) from the row above, in
columns Sale_value and previous_sale_value, respectively. Notice that the first row does
not have an adjacent row above, and consequently the previous_sale_value field is empty
(NULL) since the row from which the value of Sale_value should be obtained does not exist.
LEAD
This function displays next row data with the current row. If, no next row is available,
then LEAD () function will display null with the current row by default.
LEAD () is similar to LAG (). Whereas LAG () accesses a value stored in a row above, LEAD
accesses a value stored in a row below.
SYNTAX:
SELECT Seller_name, Sale_value,
LEAD(Sale_value) OVER (ORDER BY Sale_value) as next_sale_value
FROM sale;
The rows are sorted by the column specified in ORDER BY (Sale_value). The LEAD () function
grabs the sale amount from the row below. For example, Stef’s own sale amount is $7,000
in the column Sale_value, and the column next_sale_value in the same record contains
$12,000. The latter comes from the Sale_value column for Alice, the seller in the next row.
Note that the last row does not have a next row, so the next_sale_value field is empty
(NULL) for the last row.