LP 3 - UCP Lab Manual
LP 3 - UCP Lab Manual
Lavale,Pune.
4 4 4 4 4 20
Group A
Assignment No: 1
Title of the Assignment: Write a program non-recursive and recursive program to calculate
Fibonacci numbers and analyze their time and space complexity.
Objective of the Assignment: Students should be able to perform non-recursive and recursive
programs to calculate Fibonacci numbers and analyze their time and space complexity.
Prerequisite:
1. Basic of Python or Java Programming
2. Concept of Recursive and Non-recursive functions
3. Execution flow of calculate Fibonacci numbers
4. Basic of Time and Space complexity
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction to Fibonacci numbers
2. Time and Space complexity
● The Fibonacci series is the sequence of numbers (also called Fibonacci numbers), where
every number is the sum of the preceding two numbers, such that the first two terms are '0' and '1'.
● In some older versions of the series, the term '0' might be omitted. A Fibonacci series can thus
be given as, 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . . It can be thus be observed that every term can be
calculated by adding the two terms before it.
● Given the first term, F0 and second term, F1 as '0' and '1', the third term here can be given as,
F2 = 0 + 1 = 1
Similarly,
F3 = 1 + 1 = 2
F4 = 2 + 1 = 3
Fn = Fn-1+Fn-2
Here, the sequence is defined using two different parts, such as kick-off and recursive relation.
It is noted that the sequence starts with 0 rather than 1. So, F5 should be the 6th term of the sequence.
Examples:
Input : n = 2
Output : 1
Input : n = 9
Output : 34
Fn Fibonacci Number
0 0
1 1
2 1
3 2
4 3
5 5
7 13
8 21
9 34
Next, we‘ll iterate through array positions 2 to n-1. At each position i, we store the sum of the two
preceding array values in F[i].
Finally, we return the value of F[n-1], giving us the number at position n in the sequence.
n1, n2 = 0, 1
count = 0
if nterms <= 0:
elif nterms == 1:
print(n1)
print("Fibonacci sequence:")
print(n1)
nth = n1 + n2
# update values
n1 = n2
n2 = nth
count += 1
Output
Fibonacci sequence:
● The time complexity of the Fibonacci series is T(N) i.e, linear. We have to find the sum of two terms
and it is repeated n times depending on the value of n.
● The space complexity of the Fibonacci series using dynamic programming is O(1).
● The time complexity of the above code is T(N) i.e, linear. We have to find the sum of two terms and it
is repeated n times depending on the value of n.
Let‘s start by defining F(n) as the function that returns the value of Fn.
To evaluate F(n) for n > 1, we can reduce our problem into two smaller problems of the
same kind: F(n-1) and F(n-2). We can further reduce F(n-1) and F(n-2) to F((n-1)-1) and
F((n-1)-2); and F((n-2)-1) and F((n-2)-2), respectively.
If we repeat this reduction, we‘ll eventually reach our known base cases and, thereby, obtain
a solution to F(n).
Employing this logic, our algorithm for F(n) will have two steps:
def recur_fibo(n):
if n <= 1:
return n
else:
return(recur_fibo(n-1) + recur_fibo(n-2))
nterms = 7
if nterms <= 0:
else:
print("Fibonacci sequence:")
for i in range(nterms):
print(recur_fibo(i))
Output
Fibonacci sequence:
The Fibonacci series finds application in different fields in our day-to-day lives. The different
patterns found in a varied number of fields from nature, to music, and to the human body follow the
Fibonacci series. Some of the applications of the series are given as,
● It is used in the grouping of numbers and used to study different other special mathematical
sequences.
● It finds application in Coding (computer algorithms, distributed systems, etc). For example,
Fibonacci series are important in the computational run-time analysis of Euclid's algorithm, used for
determining the GCF of two integers.
● It is applied in numerous fields of science like quantum mechanics, cryptography, etc.
● In finance market trading, Fibonacci retracement levels are widely used in technical analysis.
Conclusion- In this way we have explored Concept of Fibonacci series using recursive and non
recursive method and also learn time and space complexity
Reference link
● https://www.scaler.com/topics/fibonacci-series-in-c/
● https://www.baeldung.com/cs/fibonacci-computational-complexity
4 4 4 4 4 20
Group A
Assignment No: 2
Title of the Assignment: Write a program to implement Huffman Encoding using a greedy strategy.
Objective of the Assignment: Students should be able to understand and solve Huffman Encoding
using greedy method
Prerequisite:
1. Basic of Python or Java Programming
2. Concept of Greedy method
3. Huffman Encoding concept
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Greedy Method
2. Huffman Encoding
3. Example solved using huffman encoding
---------------------------------------------------------------------------------------------------------------
● This algorithm may not produce the best result for all the problems. It's because it always goes
for the local best choice to produce the global best result.
● This algorithm can perform better than other algorithms (but, not in all cases).
● As mentioned earlier, the greedy algorithm doesn't always produce the optimal solution. This is
the major disadvantage of the algorithm
● For example, suppose we want to find the longest path in the graph below from root to leaf.
Greedy Algorithm
2. At each step, an item is added to the solution set until a solution is reached.
3. If the solution set is feasible, the current item is kept.
Huffman Encoding
● Huffman Coding is a technique of compressing data to reduce its size without losing any of the
details. It was first developed by David Huffman.
● Huffman Coding is generally useful to compress the data in which there are frequently occurring
characters.
● Huffman Coding is a famous Greedy Algorithm.
● It is used for the lossless compression of data.
● It uses variable length encoding.
● It assigns variable length code to all the characters.
● The code length of a character depends on how frequently it occurs in the given text.
● The character which occurs most frequently gets the smallest code.
● The character which occurs least frequently gets the largest code.
● It is also known as Huffman Encoding.
Prefix Rule-
● Each character occupies 8 bits. There are a total of 15 characters in the above string. Thus, a total of
8 * 15 = 120 bits are required to send this string.
● Using the Huffman Coding technique, we can compress the string to a smaller size.
● Huffman coding first creates a tree using the frequencies of the character and then generates code for
each character.
● Once the data is encoded, it has to be decoded. Decoding is done using the same tree.
● Huffman Coding prevents any ambiguity in the decoding process using the concept of prefix code
ie. a code associated with a character should not be present in the prefix of any other code. The tree
created above helps in maintaining the property.
● Huffman coding is done with the help of the following steps.
1. Calculate the frequency of each character in the string.
2. Sort the characters in increasing order of the frequency. These are stored in a priority queue Q.
4. Create an empty node z. Assign the minimum frequency to the left child of z and assign the
second minimum frequency to the right child of z. Set the value of the z as the sum of the above two
minimum frequencies.
5. Remove these two minimum frequencies from Q and add the sum into the list of frequencies (*
denote the internal nodes in the figure above).
8. For each non-leaf node, assign 0 to the left edge and 1 to the right edge
For sending the above string over a network, we have to send the tree as well as the above
compressed-code. The total size is given by the table below.
Without encoding, the total size of the string was 120 bits. After encoding the size is reduced to 32
+ 15 + 28 = 75.
Example:
A file contains the following characters with the frequencies as shown. If Huffman Coding is used for data
compression, determine-
After assigning weight to all the edges, the modified Huffman Tree is-
To write Huffman Code for any character, traverse the Huffman Tree from root node to the leaf node of that character.
Following this rule, the Huffman Code for each character is-
a = 111
e = 10
i = 00
o = 11001
u = 1101
s = 01
t = 11000
Time Complexity-
Code :-
Output
Conclusion- In this way we have explored Concept ofHuffman Encoding using greedy method
Assignment Question
Reference link
● https://towardsdatascience.com/huffman-encoding-python-implementation-8448c3654328
● https://www.programiz.com/dsa/huffman-coding#cpp-code
● https://www.gatevidyalay.com/tag/huffman-coding-example-ppt/
4 4 4 4 4 20
Group A
Assignment No: 3
Title of the Assignment: Write a program to solve a fractional Knapsack problem using a greedy
method.
Objective of the Assignment: Students should be able to understand and solve fractional Knapsack
problems using a greedy method.
Prerequisite:
1. Basic of Python or Java Programming
2. Concept of Greedy method
3. fractional Knapsack problem
---------------------------------------------------------------------------------------------------------------
● This algorithm may not produce the best result for all the problems. It's because it always goes
for the local best choice to produce the global best result.
● This algorithm can perform better than other algorithms (but, not in all cases).
● As mentioned earlier, the greedy algorithm doesn't always produce the optimal solution. This is
the major disadvantage of the algorithm
● For example, suppose we want to find the longest path in the graph below from root to leaf.
Greedy Algorithm
2. At each step, an item is added to the solution set until a solution is reached.
3. If the solution set is feasible, the current item is kept.
Knapsack Problem
● The value or profit obtained by putting the items into the knapsack is maximum.
● And the weight limit of the knapsack does not exceed.
possible.
● It is solved using the Greedy Method.
Step-02:
Arrange all the items in decreasing order of their value / weight ratio.
Step-03:
Start putting the items into the knapsack beginning from the item with the highest ratio.
Problem-
For the given set of items and knapsack capacity = 60 kg, find the optimal solution for the fractional
knapsack problem making use of greedy approach.
Now,
= 160 + (20/22) x 77
= 160 + 70
= 230 units
Time Complexity-
● The main time taking step is the sorting of all items in decreasing order of their value / weight ratio.
● If the items are already arranged in the required order, then while loop takes O(n) time.
● The average time complexity of Quick Sort is O(nlogn).
● Therefore, total time taken including the sort is O(nlogn).
Code:-
class Item:
def init (self, value, weight):
self.value = value
self.weight = weight
# Result(value in Knapsack)
finalvalue = 0.0
# Driver Code
if name == " main ":
W = 50
arr = [Item(60, 10), Item(100, 20), Item(120, 30)]
# Function call
max_val = fractionalKnapsack(W, arr)
print(max_val)
Output
Maximum value we can obtain = 24
Conclusion-In this way we have explored Concept of Fractional Knapsack using greedy method
Assignment Question
Reference link
● https://www.gatevidyalay.com/fractional-knapsack-problem-using-greedy-approach/
4 4 4 4 4 20
Group A
Assignment No: 4
Title of the Assignment: Write a program to solve a 0-1 Knapsack problem using dynamic
programming or branch and bound strategy.
Objective of the Assignment: Students should be able to understand and solve 0-1 Knapsack
problem using dynamic programming
Prerequisite:
1. Basic of Python or Java Programming
2. Concept of Dynamic Programming
3. 0/1 Knapsack problem
---------------------------------------------------------------------------------------------------------------
● Dynamic Programming algorithm solves each sub-problem just once and then saves its answer in a
table, thereby avoiding the work of re-computing the answer every time.
● Two main properties of a problem suggest that the given problem can be solved using Dynamic
Programming. These properties are overlapping sub-problems and optimal substructure.
● Dynamic Programming also combines solutions to sub-problems. It is mainly used where the
solution of one sub-problem is needed repeatedly. The computed solutions are stored in a table, so
that these don‘t have to be re-computed. Hence, this technique is needed where overlapping sub-
problem exists.
● For example, Binary Search does not have overlapping sub-problem. Whereas recursive program of
Fibonacci numbers have many overlapping sub-problems.
Knapsack Problem
● The value or profit obtained by putting the items into the knapsack is maximum.
● And the weight limit of the knapsack does not exceed.
0/1 knapsack problem is solved using dynamic programming in the following steps-
Step-01:
● Draw a table say ‗T‘ with (n+1) number of rows and (w+1) number of columns.
● Fill all the boxes of 0th row and 0th column with zeroes as shown-
Step-02:
Start filling the table row wise top to bottom from left to right.
Here, T(i , j) = maximum value of the selected items if we can take items 1 to i and have weight restrictions
of j.
Step-03:
● To identify the items that must be put into the knapsack to obtain that maximum profit,
● Consider the last column of the table.
● Start scanning the entries from bottom to top.
● On encountering an entry whose value is not same as the value stored in the entry immediately
above it, mark the row label of that entry.
● After all the entries are scanned, the marked labels represent the items that must be put into the
knapsack
Problem-.
For the given set of items and knapsack capacity = 5 kg, find the optimal solution for the 0/1 knapsack
problem making use of a dynamic programming approach.
Solution-
Given
● Knapsack capacity (w) = 5 kg
● Number of items (n) = 4
Step-01:
● Draw a table say ‗T‘ with (n+1) = 4 + 1 = 5 number of rows and (w+1) = 5 + 1 = 6 number of columns.
● Fill all the boxes of 0th row and 0th column with 0.
Step-02:
Start filling the table row wise top to bottom from left to right using the formula-
After all the entries are computed and filled in the table, we get the following table-
● The last entry represents the maximum possible value that can be put into the knapsack.
● So, maximum possible value that can be put into the knapsack = 7.
Following Step-04,
Time Complexity-
● Each entry of the table requires constant time θ(1) for its computation.
● It takes θ(nw) time to fill (n+1)(w+1) table entries.
● It takes θ(n) time for tracing the solution since tracing process traces the n rows.
● Thus, overall θ(nw) time is taken to solve 0/1 knapsack problem using dynamic programming
Code :-
# code
if wt[i-1] <= w:
# Driver code
W = 50
n = len(val)
Output
220
Conclusion-In this way we have explored Concept of 0/1 Knapsack using Dynamic approch
Assignment Question
Reference link
● https://www.gatevidyalay.com/0-1-knapsack-problem-using-dynamic-programming-appr
oach/
● https://www.youtube.com/watch?v=mMhC9vuA-70
● https://www.tutorialspoint.com/design_and_analysis_of_algorithms/design_and_analysi
s_of_algorithms_fractional_knapsack.htm
4 4 4 4 4 20
Group A
Assignment No: 5
Title of the Assignment: Design n-Queens matrix having first Queen placed. Use backtracking to
place remaining Queens to generate the final n-queen‘s matrix.
Objective of the Assignment: Students should be able to understand and solve n-Queen
Problem,and understand basics of Backtracking
Prerequisite:
1. Basic of Python or Java Programming
2. Concept of backtracking method
3. N-Queen Problem
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction to Backtracking
2. N-Queen Problem
---------------------------------------------------------------------------------------------------------------
Introduction to Backtracking
● Many problems are difficult to solve algorithmically. Backtracking makes it possible to solve at
least some large instances of difficult combinatorial problems.
What is backtracking?
Backtracking is finding the solution of a problem whereby the solution depends on the previous steps taken.
For example, in a maze problem, the solution depends on all the steps you take one-by-one. If any of those
steps is wrong, then it will not lead us to the solution. In a maze problem, we first choose a path and continue
moving along it. But once we understand that the particular path is incorrect, then we just come back and
change it. This is what backtracking basically is.
In backtracking, we first take a step and then we see if this step taken is correct or not i.e., whether it will give
a correct answer or not. And if it doesn‘t, then we just come back and change our first step. In general, this is
accomplished by recursion. Thus, in backtracking, we first start with a partial sub-solution of the problem
(which may or may not lead us to the solution) and then check if we can proceed further with this sub-solution
or not. If not, then we just come back and change it.
Applications of Backtracking:
● N Queens Problem
● Sum of subsets problem
● Graph coloring
● Hamiltonian cycles.
One of the most common examples of the backtracking is to arrange N queens on an NxN chessboard such
that no queen can strike down any other queen. A queen can attack horizontally, vertically, or diagonally. The
solution to this problem is also attempted in a similar way. We first place the first queen anywhere arbitrarily
and then place the next queen in any of the safe places. We continue this process until the number of unplaced
queens becomes zero (a solution is found) or no safe place is left. If no safe place is left, then we change the
position of the previously placed queen.
N-Queens Problem:
A classic combinational problem is to place n queens on a n*n chess board so that no two attack, i.,e no
N Queen problem is the classical Example of backtracking. N-Queen problem is defined as,
―given N x N chess board, arrange N queens in such a way that no two queens attack each other
by being in the same row, column or diagonal‖.
● For N = 1, this is a trivial case. For N = 2 and N = 3, a solution is not possible. So we start with N = 4
and we will generalize it for N queens.
Algorithm
a) If the queen can be placed safely in this row then mark this [row, column] as part of the
solution and recursively check if placing queen here leads to a solution.
b) If placing the queen in [row, column] leads to a solution then return true.
c) If placing queen doesn't lead to a solution then unmark this [row, column] (Backtrack) and go
to step (a) to try other rows.
4) If all rows have been tried and nothing worked,return false to trigger backtracking.
4- Queen Problem
Problem 1 : Given 4 x 4 chessboard, arrange four queens in a way, such that no two queens attack each other.
That is, no two queens are placed in the same row, column, or diagonal.
● We have to arrange four queens, Q1, Q2, Q3 and Q4 in 4 x 4 chess board. We will put with queen in
ith row. Let us start with position (1, 1). Q1 is the only queen, so there is no issue. partial solution is
<1>
● We cannot place Q2 at positions (2, 1) or (2, 2). Position (2, 3) is acceptable. the partial solution is <1,
3>.
● Next, Q3 cannot be placed in position (3, 1) as Q1 attacks her. And it cannot be placed at (3, 2), (3, 3)
or (3, 4) as Q2 attacks her. There is no way to put Q3 in the third row. Hence, the algorithm backtracks
and goes back to the previous solution and readjusts the position of queen Q2. Q2 is moved from
positions (2, 3) to
(2, 4). Partial solution is <1, 4>
● Now, Q3 can be placed at position (3, 2). Partial solution is <1, 4, 3>.
● Queen Q4 cannot be placed anywhere in row four. So again, backtrack to the previous solution and
readjust the position of Q3. Q3 cannot be placed on (3, 3) or(3, 4). So the algorithm backtracks even
further.
● All possible choices for Q2 are already explored, hence the algorithm goes back to partial solution
<1> and moves the queen Q1 from (1, 1) to (1, 2). And this process continues until a solution is found.
All possible solutions for 4-queen are shown in fig (a) & fig. (b)
Fig. (d) describes the backtracking sequence for the 4-queen problem.
The solution of the 4-queen problem can be seen as four tuples (x1, x2, x3, x4), where xi represents the
column number of queen Qi. Two possible solutions for the 4-queen problem are (2, 4, 1, 3) and (3, 1, 4,
2).
Explanation :
The above picture shows an NxN chessboard and we have to place N queens on it. So, we will start by
placing the first queen.
Now, the second step is to place the second queen in a safe position and then the third queen.
Now, you can see that there is no safe place where we can put the last queen. So, we will just change the
position of the previous queen. And this is backtracking.
Also, there is no other position where we can place the third queen so we will go back one more step and
change the position of the second queen.
And now we will place the third queen again in a safe position until we find a solution.
We will continue this process and finally, we will get the solution as shown below.
We need to check if a cell (i, j) is under attack or not. For that, we will pass these two in our function along
with the chessboard and its size - IS-ATTACK(i, j, board, N).
If there is a queen in a cell of the chessboard, then its value will be 1, otherwise, 0.
The cell (i,j) will be under attack in three condition - if there is any other queen in row i, if there is any other
queen in the column j or if there is any queen in the diagonals.
We are already proceeding row-wise, so we know that all the rows above the current row(i) are filled but not
the current row and thus, there is no need to check for row i.
We can check for the column j by changing k from 1 to i-1 in board[k][j] because only the rows from 1 to i-1
are filled.
for k in 1 to i-1
if board[k][j]==1
return TRUE
Now, we need to check for the diagonal. We know that all the rows below the row i are empty, so we need to
check only for the diagonal elements which above the row i.
If we are on the cell (i, j), then decreasing the value of i and increasing the value of j will make us traverse
over the diagonal on the right side, above the row i.
k = i-1
l = j+1
if board[k][l] == 1
return TRUE
k=k-1
l=l+1
Also if we reduce both the values of i and j of cell (i, j) by 1, we will traverse over the left diagonal, above the
row i.
k = i-1
l = j-1
if board[k][l] == 1
return TRUE
k=k-1
l=l-1
At last, we will return false as it will be return true is not returned by the above statements and the cell (i,j) is
safe.
IS-ATTACK(i, j, board, N)
for k in 1 to i-1
if board[k][j]==1
return TRUE
k = i-1
l = j+1
if board[k][l] == 1
return TRUE
k=k+1
l=l+1
k = i-1
l = j-1
if board[k][l] == 1
return TRUE
k=k-1
l=l-1
return FALSE
Now, let's write the real code involving backtracking to solve the N Queen problem.
Our function will take the row, number of queens, size of the board and the board itself - N-QUEEN(row, n, N,
board).
If the number of queens is 0, then we have already placed all the queens.
if n==0
return TRUE
Otherwise, we will iterate over each cell of the board in the row passed to the function and for each cell, we will
check if we can place the queen in that cell or not. We can't place the queen in a cell if it is under attack.
for j in 1 to N
if !IS-ATTACK(row, j, board, N)
board[row][j] = 1
After placing the queen in the cell, we will check if we are able to place the next queen with this arrangement or
not. If not, then we will choose a different position for the current queen.
for j in 1 to N
...
return TRUE
board[row][j] = 0
if N-QUEEN(row+1, n-1, N, board) - We are placing the rest of the queens with the current arrangement. Also,
since all the rows up to 'row' are occupied, so we will start from 'row+1'. If this returns true, then we are successful
in placing all the queen, if not, then we have to change the position of our current queen. So, we are leaving the
current cell board[row][j] = 0 and then iteration will find another place for the queen and this is backtracking.
Take a note that we have already covered the base case - if n==0 → return TRUE. It means when all queens will
be placed correctly, then N-QUEEN(row, 0, N, board) will be called and this will return true.
At last, if true is not returned, then we didn't find any way, so we will return false.
N-QUEEN(row, n, N, board)
...
return FALSE
N-QUEEN(row, n, N, board)
if n==0
return TRUE
for j in 1 to N
if !IS-ATTACK(row, j, board, N)
board[row][j] = 1
return TRUE
return FALSE
Code :-
# Python3 program to solve N Queen
# Problem using backtracking
global N
N=4
def printSolution(board):
for i in range(N):
for j in range(N):
print(board[i][j], end = " ")
print()
return True
if isSafe(board, i, col):
board[i][col] = 1
if solveNQUtil(board, 0) == False:
print ("Solution does not exist")
return False
printSolution(board)
return True
# Driver Code
solveNQ()
Output:-
Conclusion- In this way we have explored Concept of Backtracking method and solve n-Queen
problem using backtracking method
Assignment Question
Reference link
● https://www.codesdope.com/blog/article/backtracking-explanation-and-n-queens-problem/
BHARATI VIDYAPEETH‘S COLLEGE OF ENGINEERING LAVALE PUNE.
Department of Computer Engineering Course : Laboratory Practice III
● https://www.codesdope.com/course/algorithms-backtracking/\
● https://codecrucks.com/n-queen-problem/
●
ASSIGNMENT NO: 6
Title: Write a program for analysis of quick sort by using deterministic and randomized variant.
Objective: Students should learn to implement Quick Sort and know the difference in performance
with deterministic and randomized variant.
Prerequisite:
Theory:
Quick Sort is a popular sorting algorithm that efficiently sorts an array by selecting a "pivot" element
and partitioning the other elements into two sub-arrays, according to whether they are less than or
greater than the pivot. The process is then recursively applied to the sub-arrays.
To understand the working of quick sort, let's take an unsorted array. It will make the concept more
clear and understandable.
In the given array, we consider the leftmost element as pivot. So, in this case, a[left] = 24, a[right] =
27 and a[pivot] = 24.
Since, pivot is at left, so algorithm starts from right and move towards left.
Now, a[pivot] < a[right], so algorithm moves forward one position towards left, i.e. -
Because, a[pivot] > a[right], so, algorithm will swap a[pivot] with a[right], and pivot moves to right,
as -
Now, a[left] = 19, a[right] = 24, and a[pivot] = 24. Since, pivot is at right, so algorithm starts from left
and moves to right.
Now, a[left] = 9, a[right] = 24, and a[pivot] = 24. As a[pivot] > a[left], so algorithm moves one
position to right as -
Now, a[left] = 29, a[right] = 24, and a[pivot] = 24. As a[pivot] < a[left], so, swap a[pivot] and a[left],
now pivot is at left, i.e. -
Since, pivot is at left, so algorithm starts from right, and move to left. Now, a[left] = 24, a[right] = 29,
and a[pivot] = 24. As a[pivot] < a[right], so algorithm moves one position to left, as -
Now, a[pivot] = 24, a[left] = 24, and a[right] = 14. As a[pivot] > a[right], so, swap a[pivot] and a[right],
now pivot is at right, i.e. -
Now, a[pivot] = 24, a[left] = 14, and a[right] = 24. Pivot is at right, so the algorithm starts from left
and move to right.
Now, a[pivot] = 24, a[left] = 24, and a[right] = 24. So, pivot, left and right are pointing the same
element. It represents the termination of procedure.
Element 24, which is the pivot element is placed at its exact position.
Elements that are right side of element 24 are greater than it, and the elements that are left side of
element 24 are smaller than it.
Now, in a similar manner, quick sort algorithm is separately applied to the left and right sub-arrays.
After sorting gets done, the array will be -
There are two primary variants of Quick Sort: deterministic and randomized. Let's analyze both
variants.
Deterministic Quick Sort: In the deterministic variant, a fixed strategy is used to select the pivot
element. The most common strategies are to select the first element, the last element, or the middle
element as the pivot. Here is the analysis:
Time Complexity:
In the worst-case scenario, when the array is already sorted or mostly sorted, the pivot selection
strategy can lead to significant performance degradation. This worst-case behavior is a notable
drawback of deterministic Quick Sort.
Randomized Quick Sort: In the randomized variant, the pivot element is chosen randomly. This
randomness mitigates the worst-case scenario that deterministic Quick Sort can encounter. Here is
the analysis:
Time Complexity:
The expected time complexity of randomized Quick Sort is O(n log n) for all cases. This is because the
random choice of pivot reduces the likelihood of encountering the worst-case scenarios.
Comparative Analysis:
Not suitable for cases where deterministic results are needed, but often preferred
for practical applications.
In practice, randomized Quick Sort is often favored due to its predictable and efficient expected
performance across various input distributions. However, in situations where deterministic behavior
is necessary or for educational purposes, deterministic variants can be used with appropriate pivot
selection strategies.
Conclusion:
Hence by this way we have successfully implemented and analysed quick sort by using deterministic
and randomized variant.
MINI PROJECT 1
Title: Write a program to implement matrix multiplication. Also implement multithreaded matrix
multiplication with either one thread per row or one thread per cell. Analyze and compare their
performance.
Objective of the Assignment: Students will be able to implement and analyze matrix multiplication and
multithreaded matrix multiplication with either one thread per row or one thread per cell.
Prerequisite:
1. Basic Knowledge of python or Java
2. Concept of Matrix Multiplication
Theory
Multiplication of matrix does take time surely. Time complexity of matrix multiplication is O(n^3)
using normal matrix multiplication. And Strassen algorithm improves it and its time complexity is
O(n^(2.8074)).
But, Is there any way to improve the performance of matrix multiplication using the normalmethod.
Multi-threading can be done to improve it. In multi-threading, instead of utilizing a single core of
your processor, we utilizes all or more core to solve the problem.
We create different threads, each thread evaluating some part of matrix multiplication.
Depending upon the number of cores your processor has, you can create the number of threads
required. Although you can create as many threads as you need, a better way is to create each
thread for one core.
In second approach,we create a separate thread for each element in resultant matrix. Using
pthread_exit() we return computed value from each thread which is collected by pthread_join().This
approach does not make use of any global variables.
To compare the performance of matrix multiplication with one thread per row and one thread per cell, we'll
analyze the execution times and consider factors like matrix size and the number of CPU threads available.
In general, the choice of which method is more efficient will depend on the specific use case and hardware
configuration. Let's analyze the performance:
1. Matrix Size (m, n, p): In this example, we used matrices of size 1000x1000 for A and B. Smaller
matrices may not demonstrate significant performance differences, while larger matrices may show
more noticeable speedups from parallelization.
2. Multithreading Overhead: The use of multiple threads introduces overhead in terms of thread
creation, synchronization, and context switching. For smaller matrices, this overhead can dominate,
and one thread per cell may perform better. For larger matrices, the actual matrix multiplication
work may dominate, and one thread per row may perform better.
3. Number of CPU Threads: The number of available CPU threads can affect the performance. If
there are fewer threads than rows/columns, it might not make sense to use one thread per row or
cell. On a multi-core CPU, if you have many threads available, parallelization can be more
beneficial.
BHARATI VIDYAPEETH‘S COLLEGE OF ENGINEERING LAVALE PUNE.
Department of Computer Engineering Course : Laboratory Practice III
4. Cache and Memory: The efficiency of cache usage can play a role in performance. One thread per
row may perform better when it utilizes the cache more effectively.
5. Optimized Libraries: Specialized libraries like NumPy have highly optimized matrix
multiplication routines that can outperform custom multithreaded implementations.
6. Hardware: The performance can vary based on the specific hardware and CPU architecture.
8. Load Balancing: Load balancing is important when using one thread per row. Uneven workloads
can lead to inefficient use of CPU resources.
9. Scaling: The performance may not scale linearly with the number of threads. Beyond a certain
point, adding more threads may not lead to significant improvements due to the limitations of the
hardware
Code :-
// CPP Program to multiply two matrix using pthreads
#include <bits/stdc++.h>
using namespace std;
// maximum size of
matrix#define MAX 4
int
matA[MAX][MAX];
int
matB[MAX][MAX];
int
matC[MAX][MAX];
int step_i = 0;
// Driver
Codeint
main()
{
// Generating random values in matA and matB
for (int i = 0; i < MAX; i++) {
for (int j = 0; j < MAX; j++) {
matA[i][j] = rand() %
10;
matB[i][j] = rand() % 10;
}
// Displaying
matAcout << endl
<< "Matrix A" <<
endl;for (int i = 0; i < MAX;
i++) {
for (int j = 0; j < MAX; j++)
cout << matA[i][j] << "
";cout << endl;
}
// Displaying
matBcout << endl
<< "Matrix B" <<
endl;for (int i = 0; i < MAX;
i++) {
for (int j = 0; j < MAX; j++)
cout << matB[i][j] << "
";cout << endl;
}
Conclusion
Hence we have successfully completed the implementation of this Mini Project.
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 1
—---------------------------------------------------------------------------------------
Title of the Assignment:Predict the price of the Uber ride from a given pickup point to the
agreed drop-off location. Perform following tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
Dataset Description:The project is about on world's largest taxi company Uber inc. In this
project, we're looking to predict the fare for their future transactional cases. Uber delivers
service to lakhs of customers daily. Now it becomes really important to manage their data
properly to come up with new business ideas to get best results. Eventually, it becomes
really important to estimate the fare prices accurately.
Prerequisite:
1. Basic knowledge of Python
2. Concept of preprocessing data
3. Basic knowledge of Data Science and Big Data Analytics.
1. Data Preprocessing
2. Linear regression
3. Random forest regression models
4. Box Plot
5. Outliers
6. Haversine
7. Mathplotlib
8. Mean Squared Error
Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for amachine
learning model. It is the rst and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean itand put in a formatted
way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the
data and making it suitable for a machine learning model which also increases the accuracy and e ciency
of a machine learning model.
● Importing libraries
● Importing datasets
● Feature scaling
Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such assales, salary, age, product price,etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it nds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Random Forest is a popular machine learning algorithm that belongs to the supervisedlearning
technique. It can be used for both Classi cation and Regression problems in ML. Itis based on the
concept ofensemble learning,which is a process ofcombining multiple classi ers to solve a complex
problem and to improve the performance of the model.
prediction from each tree and based on the majority votes of predictions, and it predicts the
nal output.
The greater number of trees in the forest leads to higher accuracy and prevents theproblem of
over tting.
Boxplot:
Boxplots are a measure of how well data is distributed across a data set. This divides thedata set into
three quartiles. This graph represents the minimum, maximum, average, rst quartile, and the third
quartile in the data set. Boxplot is also useful in comparing thedistribution of data in a data set by
drawing a boxplot for each of them.
R provides a boxplot() function to create a boxplot. There is the following syntax of boxplot()function:
Here,
1. x It is a vector or a formula.
4. varwidth It is also a logical value set as true to draw the width of the boxsame
as the sample size.
5. names It is the group of labels that will be printed under each boxplot.
Outliers:
As the name suggests, "outliers" refer to the data points that exist outside of what is to beexpected. The
major thing about the outliers is what you do with them. If you are going toanalyze any task to analyze
data sets, you will always have some assumptions based onhow this data is generated. If you nd some
data points that are likely to contain some form of error, then these are de nitely outliers, and
depending on the context, you want toovercome those errors. The data mining process involves the
analysis and prediction of datathat the data holds. In 1969, Grubbs introduced the rst de nition of
outliers.
Global Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers.
When data points deviate from all the rest of the data points in a given data set, it is known as the global
outlier. In most cases, all the outlier detection procedures are targeted to determine the global outliers.
The green data point is the global outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the data set is called collective
outliers. Here, the particular set of data objects may not be outliers, but when you consider the data
objects as a whole, they may behave as outliers. To identify thetypes of different outliers, you need to go
through background information about the relationship between the behavior of outliers shown by
different data objects. For example, in an Intrusion Detection System, the DOS package from one system
to another is taken as normal behavior. Therefore, if this happens with the various computer
simultaneously, it is considered abnormal behavior, and as a whole, they are called collective outliers.
The greendata points as a whole represent the collective outlier.
Contextual Outliers
As the name suggests, "Contextual" means this outlier introduced within a context. For example, in the
speech recognition technique, the single background noise. Contextual outliers are also known as
Conditional outliers. These types of outliers happen if a data object deviates from the other data points
because of any speci c condition in a given data set. As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral attributes. Contextual outlier analysis enables the
users to examine outliers in different contexts and conditions, which can be useful in various applications.
For example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the context of a summer season. In thegiven diagram, a
green dot representing the low-temperature value in June is a contextual outlier since the same value in
December is not an outlier.
Haversine:
The Haversine formula calculates the shortest distance between two points on a sphere using
their latitudes and longitudes measured along the surface. It is important for use in navigation.
Matplotlib:
One of the greatest benefits of visualization is that it allows us visual access to huge amounts
of data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc.
Code :- https://www.kaggle.com/code/proxzima/uber-fare-price-prediction
Conclusion:
In this way we have explored Concept correlation and implement linear regression andrandom
forest regression models.
Assignment Questions:
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 2
—---------------------------------------------------------------------------------------
Title of the Assignment:Classify the email using the binary classification method. Email
Spam detection has two states:
a) Normal State – Not Spam,
b) Abnormal State – Spam.
Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze their
performance.
Dataset Description:The csv file contains 5172 rows, each row for each email. There are
3002 columns. The first column indicates Email name. The name has been set with numbers
and not recipients' name to protect privacy. The last column has the labels for prediction : 1for
spam, 0 for not spam. The remaining 3000 columns are the 3000 most common words inall the
emails, after excluding the non-alphabetical characters/words. For each row, thecount of
each word(column) in that email(row) is stored in the respective cells. Thus,information
regarding all 5172 emails are stored in a compact dataframe rather than asseparate text
files.
Link:https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv
Students should be able to classify email using the binary Classification and implement email
spam detection technique by using K-Nearest Neighbors and Support Vector Machine
algorithm.
Prerequisite:
1. Basic knowledge of Python
1. Data Preprocessing
2. Binary Classification
3. K-Nearest Neighbours
4. Support Vector Machine
5. Train, Test and Split Procedure
Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for amachine
learning model. It is the rst and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean itand put in a formatted
way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the
data and making it suitable for a machine learning model which also increases the accuracy and e ciency
of a machine learning model.
● Importing libraries
● Importing datasets
● Feature scaling
Code :- https://www.kaggle.com/code/mfaisalqureshi/email-spam-detection-98-accuracy/notebook
Binary classification
Binary classification is a supervised machine learning task where the goal is to assign each
data point to one of two classes or categories. Classification into one of two classes is a
common machine learning problem. You might want to predict whether or not a customer is
likely to make a purchase, whether or not a credit card transaction was fraudulent, whether
deep space signals show evidence of a new planet, or a medical test evidence of a disease.
These are all binary classification problems.
Two common algorithms used for binary classification are K-Nearest Neighbors (K-NN) and
Support Vector Machine (SVM). Here's an overview of these algorithms and the train-test
split procedure:
K-NN is a lazy learner because it doesn't build a specific model during the
training phase; it memorizes the training data.
The choice of 'k' is a hyperparameter that can significantly affect the model's
performance. A small 'k' may make the model sensitive to noise, while a large
'k' might lead to oversmoothing. K-NN can be computationally expensive for
large datasets, as it requires calculating distances to all data points.
The steps for using K-NN for binary classification are as follows:
SVM can handle both balanced and imbalanced datasets. SVM can be
computationally intensive, especially for large datasets. Techniques like
stochastic gradient descent and linear SVMs may be used for efficiency.
Support vectors are essential for defining the decision boundary, so they play a
crucial role in the SVM's performance. Support Vector Machines are a
versatile algorithm that works well in a wide range of applications, including
image classification, text classification, and anomaly detection, among others.
The train-test split is a critical step in machine learning to assess the model's
performance. It involves dividing the dataset into two parts: a training set and
a test set.
The training set is used to train the machine learning model, while the test set
is used to evaluate the model's performance and generalization.
a. Randomly shuffle the dataset to ensure that the data is evenly distributed
between the training and test sets.
b. Split the dataset into two parts, typically with a ratio like 70-30 or 80-20,
where the training set gets the larger portion.
c. Ensure that the two sets are mutually exclusive, meaning that no data point
appears in both sets.
e. Use the test set to evaluate the model's performance by making predictions
and comparing them to the actual labels.
The train-test split procedure helps you estimate how well your model is likely to perform on
new, unseen data and avoid overfitting, where the model performs well on the training data
but poorly on new data.
Conclusion
Hence in this way we have studied the Concept of K-Nearest Neighbors and Support Vector
Machine for classification.
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 3
—---------------------------------------------------------------------------------------
Dataset Description:The case study is from an open-source dataset from Kaggle. The
dataset contains 10,000 sample points with 14 distinct features such as CustomerId,
CreditScore, Geography, Gender, Age, Tenure, Balance, etc.
Prerequisite:
1. Basic knowledge of Python
2. Concept of Confusion Matrix
The term "Arti cial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
arti cial neural networks also have neurons that are interconnected to one another in various layers of the
networks. These neurons are known as nodes.
The given gure illustrates the typical diagram of Biological Neural Network.
The typical Arti cial Neural Network looks something like the given gure.
Dendrites from Biological Neural Network represent inputs in Arti cial Neural Networks, cellnucleus
represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and arti cial neural network:
Dendrites Inputs
Synapse Weights
Axon Output
AnArti cial Neural Networkin the eld ofArti cial intelligencewhere it attempts to mimic the network
of neurons makes up a human brain so that computers will have an option tounderstand things and make
decisions in a human-like manner. The arti cial neural networkis designed by programming computers to
behave simply like interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association point
somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in such a manner as to
be distributed, and we can extract more than one piece of this data when necessary from our memory
parallelly. We can say that the human brain is made up ofincredibly amazing parallel processors.
We can understand the arti cial neural network with an example, consider an example of adigital logic
gate that takes an input and gives an output. "OR" gate, which takes two inputs. If one or both the inputs
are "On," then we get "On" in output. If both the inputs are "Off," thenwe get "Off" in output. Here the
output depends upon input. Our brain does not perform the same task. The outputs to inputs relationship
keep changing because of the neurons in our brain, which are "learning."
To understand the concept of the architecture of an arti cial neural network, we have to understand what
a neural network consists of. In order to de ne a neural network that consists of a large number of
arti cial neurons, which are termed units arranged in a sequence of layers. Lets us look at various types
of layers available in an arti cial neural network.
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculationsto nd hidden
features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which nally results in output
that is conveyed using this layer.
The arti cial neural network takes input and computes the weighted sum of the inputs andincludes a bias.
This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should re or not. Only those who are
red make it to the output layer. There are distinctive activation functions available that canbe applied
upon the sort of task we are performing.
Keras:
Keras is an open-source high-level Neural Network library, which is written in Python is capable enough
to run on Theano, TensorFlow, or CNTK. It was developed by one of the Google engineers, Francois
Chollet. It is made user-friendly, extensible, and modular for facilitating faster experimentation with deep
neural networks. It not only supports Convolutional Networks and Recurrent Networks individually but
also their combination.
It cannot handle low-level computations, so it makes use of theBackendlibrary to resolve it.The backend
library act as a high-level API wrapper for the low-level API, which lets it run onTensorFlow, CNTK, or
Theano.
Initially, it had over 4800 contributors during its launch, which now has gone up to 250,000 developers. It
has a 2X growth ever since every year it has grown. Big companies like Microsoft, Google, NVIDIA, and
Amazon have actively contributed to the development of Keras. It has an amazing industry interaction,
and it is used in the development of popular
rms likes Net ix, Uber, Google, Expedia, etc.
Tensor ow:
TensorFlow is a Google product, which is one of the most famous deep learning tools widelyused in the
research area of machine learning and deep neural network. It came into the market on 9th November
2015 under the Apache License 2.0. It is built in such a way that it can easily run on multiple CPUs and
GPUs as well as on mobile operating systems. It consists of various wrappers in distinct languages such as
Java, C++, or Python.
Normalization:
Normalization is a scaling technique in Machine Learning applied during data preparation tochange the
values of numeric columns in the dataset to use a common scale. It is not necessary for all datasets in a
model. It is required only when features of machine learning models have different ranges.
Where,
● Xn = Value of Normalization
Example:Let's assume we have a model dataset having maximum and minimum values of feature as
mentioned above. To normalize the machine learning model, values are shiftedand rescaled so their
range can vary between 0 and 1. This technique is also known asMin-Max scaling. In this scaling
technique, we will change the feature values as follows:
Case1-If the value of X is minimum, the value of Numerator will be 0; hence Normalizationwill also be
0.
Case2-If the value of X is maximum, then the value of the numerator is equal to the
denominator; hence Normalization will be 1.
Case3-On the other hand, if the value of X is neither maximum nor minimum, then values of
normalization will also be between 0 and 1.
Hence, Normalization can be de ned as a scaling method where values are shifted and rescaled to
maintain their ranges between 0 and 1, or in other words; it can be referred to as Min-Max scaling
technique.
Although there are so many feature normalization techniques in Machine Learning, few of them are most
frequently used. These are as follows:
● Standardization scaling:
Standardization scaling is also known asZ-scorenormalization, in which values are centered around the
mean with a unit standard deviation, which means the attributebecomes zero and the resultant
distribution has a unit standard deviation. Mathematically,we can calculate the standardization by
subtracting the feature value from the mean anddividing it by standard deviation.
Here,µrepresents the mean of feature value, andσrepresents the standard deviation offeature values.
However, unlike Min-Max scaling technique, feature values are not restricted to a speci crange in the
standardization technique.
This technique is helpful for various machine learning algorithms that use distance measures such
asKNN, K-means clustering, and Principal component analysis, etc. Further,it is also important that the
model is built on assumptions and data is normally distributed.
Which is suitable for our machine learning model, Normalization or Standardization? This is probably a
big confusion among all data scientists as well as machine learning engineers. Although both terms have
the almost same meaning choice of using normalization or standardization will depend on your problem
and the algorithm you are using in models.
1. Normalization is a transformation technique that helps to improve the performance as well as the
accuracy of your model better. Normalization of a machine learning model is useful when you don't know
feature distribution exactly. In other words, the feature distribution of data does not follow
aGaussian(bell curve) distribution. Normalization must
have an abounding range, so if you have outliers in data, they will be affected by Normalization.
Further, it is also useful for data having variable scaling techniques such asKNN, arti cial neural
networks. Hence, you can't use assumptions for the distribution of data.
2. Standardization in the machine learning model is useful when you are exactly aware of the feature
distribution of data or, in other words, your data follows a Gaussian distribution. However, this does not
have to be necessarily true. Unlike Normalization, Standardization does not necessarily have a bounding
range, so if you have outliers in your data, they will notbe affected by Standardization.
Further, it is also useful when data has variable dimensions and techniques such aslinear regression,
logistic regression, and linear discriminant analysis.
Example:Let's understand an experiment where we have a dataset having two attributes, i.e., age and
salary. Where the age ranges from 0 to 80 years old, and the income varies from 0 to 75,000 dollars or
more. Income is assumed to be 1,000 times that of age. As a result, the ranges of these two attributes are
much different from one another.
Because of its bigger value, the attributed income will organically in uence the conclusion more when
we undertake further analysis, such as multivariate linear regression. However, this does not necessarily
imply that it is a better predictor. As a result, we normalize the dataso that all of the variables are in the
same range.
Further, it is also helpful for the prediction of credit risk scores where normalization isapplied to
all numeric data except the class column. It uses thetanh transformation technique, which converts
all numeric features into values of range between 0 to 1.
Confusion Matrix:
The confusion matrix is a matrix used to determine the performance of the classi cation models for a
given set of test data. It can only be determined if the true values for test data are known. The matrix itself
can be easily understood, but the related terminologies may beconfusing. Since it shows the errors in the
model performance in the form of a matrix, hence also known as anerror matrix. Some features of
Confusion matrix are given below:
● For the 2 prediction classes of classi ers, the matrix is of 2*2 table, for 3 classes, it is3*3 table,
and so on.
● Predicted values are those values, which are predicted by the model, and actualvalues are
the true values for the given observations.
● True Negative:Model has given prediction No, and the real or actual value was alsoNo.
● True Positive:The model has predicted yes, and the actual value was also true.
● False Negative:The model has predicted no, but the actual value was Yes, it is alsocalled
asType-II error.
● False Positive:The model has predicted Yes, but the actual value was No. It is alsocalled
aType-I error.
● It evaluates the performance of the classi cation models, when they makepredictions
on test data, and tells how good our classi cation model is.
● It not only tells the error made by the classi ers but also the type of errors such as itis either type-
I or type-II error.
● With the help of the confusion matrix, we can calculate the different parameters forthe model,
such as accuracy, precision, etc.
Suppose we are trying to create a model that can predict the result for the disease that iseither a person
has that disease or not. So, the confusion matrix for this is given as:
● The table is given for the two-class classi er, which has two predictions "Yes" and
"NO." Here, Yes de nes that patient has the disease, and No de nes that patient doesnot has that
disease.
● The classi er has made a total of100 predictions. Out of 100 predictions,89 are true
predictions, and11 are incorrect predictions.
● The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas theactual
"Yes" was 27, and actual "No" was 73 times.
We can perform various calculations for the model, such as the model's accuracy, using thismatrix. These
calculations are given below:
● Classi cation Accuracy:It is one of the important parameters to determine the accuracy of the
classi cation problems. It de nes how often the model predicts thecorrect output. It can be
calculated as the ratio of the number of correct predictionsmade by the classi er to all number of
predictions made by the classi ers. The formula is given below:
● Misclassi cation rate:It is also termed as Error rate, and it de nes how often the model gives
the wrong predictions. The value of error rate can be calculated as thenumber of incorrect
predictions to all number of the predictions made by the classi er. The formula is given
below:
● Precision:It can be de ned as the number of correct outputs provided by the model
or out of all positive classes that have predicted correctly by the model, how many ofthem were
actually true. It can be calculated using the below formula:
● Recall:It is de ned as the out of total positive classes, how our model predictedcorrectly.
The recall must be as high as possible.
● F-measure:If two models have low precision and high recall or vice versa, it is di cult to
compare these models. So, for this purpose, we can use F-score. This score helps us to evaluate
the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision. It can be calculated using the belowformula:
● Null Error rate:It de nes how often our model would be incorrect if it always predicted the
majority class. As per the accuracy paradox, it is said that "the bestclassi er has a higher
error rate than the null error rate."
● ROC Curve:The ROC is a graph displaying a classi er's performance for all possible thresholds.
The graph is plotted between the true positive rate (on the Y-axis) and thefalse Positive rate (on
the x-axis).
Code :- https://www.kaggle.com/code/jaysadguru00/starter-bank-customer-churn-modeling-
6dbfe05e-a
Conclusion:
In this way we build a a neural network-based classi er that can determine whether they willleave or not in
the next 6 months
Assignment Questions:
1) What is Normalization?
2) What is Standardization?
3) Explain Confusion Matrix ?
4) De ne the following: Classi cation Accuracy, Misclassi cation Rate, Precision.
5) One Example of Confusion Matrix?
Group B
ASSIGNMENT NO: 4
Title: Implement Gradient Descent Algorithm to find the local minima of a function. For example,
find the local minima of the function y=(x+3)² starting from the point x=2.
Prerequisite:
Theory:
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century.
Gradient Descent is defined as one of the most commonly used iterative optimization algorithms of
machine learning to train the machine learning and deep learning models. It helps in finding the local
minimum of a function.
The best way to define the local minimum or local maximum of a function using gradient descent is
as follows:
If we move towards a negative gradient or away from the gradient of the function at the current
point, it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objective of using a gradient descent algorithm is to minimize the cost function using iteration.
Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and update
the model after evaluating all training examples. This procedure is known as the training epoch. In
simple words, it is a greedy approach where we have to sum over all examples for each update.
o It is Computationally efficient as all resources are used for all training samples.
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per
iteration. Or in other words, it processes a training epoch for each example within a dataset and
updates each training example's parameters one at a time. As it requires only one training example
at a time, hence it is easier to store in allocated memory. However, it shows some computational
efficiency losses in comparison to batch gradient systems as it shows frequent updates that require
more detail and speed. Further, due to frequent updates, it is also treated as a noisy gradient.
However, sometimes it can be helpful in finding the global minimum and also escaping the local
minimum.
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the updates on
those batches separately. Splitting training datasets into smaller batches make a balance to maintain
the computational efficiency of batch gradient descent and speed of stochastic gradient descent.
Hence, we can achieve a special type of gradient descent with higher computational efficiency and
less noisy gradient descent.
o It is computationally efficient.
Input:
Output:
Procedure:
1. Initialize parameters: Set θ to an initial value, often randomly. Initialize a variable to keep
track of the previous cost, J_prev, with a large value (e.g., +∞).
2. Loop until convergence: a. Compute the gradient of the cost function with respect to the
parameters: ∇J(θ). b. Update the parameters θ: θ = θ - α * ∇J(θ) This update is performed for
each parameter θ_i: θ_i = θ_i - α * ∂J(θ) / ∂θ_i
4. Check for convergence: If |J_new - J_prev| < ε, where ε is the convergence threshold, stop
the algorithm. The algorithm has converged, and θ is the optimized parameter set.
Otherwise, set J_prev = J_new and go back to step 2.
Hyperparameters:
Learning Rate (α): The learning rate controls the step size in each update. Choosing an
appropriate learning rate is crucial for the convergence and stability of the algorithm. It may
require some experimentation.
Convergence Threshold (ε): The convergence threshold determines when to stop the
algorithm. Setting it too low may result in longer training times, while setting it too high may
result in suboptimal solutions.
Additional Considerations:
Gradient Descent can be sensitive to the choice of the learning rate, and various techniques
like learning rate schedules or adaptive learning rates (e.g., Adam, RMSprop) can be used to
improve convergence.
Mini-batch Gradient Descent and Stochastic Gradient Descent are variants of this algorithm
that use subsets of the training data in each iteration, which can be more computationally
efficient.
Make sure the cost function is differentiable with respect to the parameters θ for this
algorithm to work.
Gradient Descent is a foundational optimization algorithm and serves as the basis for many
advanced optimization techniques used in machine learning and deep learning.
Conclusion:
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 5
—---------------------------------------------------------------------------------------
Dataset Description: We will try to build a machine learning model to accurately predict whether or
not the patients in the dataset have diabetes or not?
The datasets consists of several medical predictor variables and one target variable, Outcome.
Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level,
age, and so on.
Link for Dataset: Diabetes predication system with KNN algorithm | Kaggle
Prerequisite:
1. Basic knowledge of Python
2. Concept of Confusion Matrix
3. Concept of roc_auc curve.
4. Concept of Random Forest and KNN algorithms
k-Nearest-Neighbors (k-NN) is a supervised machine learning model. Supervised learning is when a model
learns from data that is already labeled. A supervised learning model takes in a set of input objects and output
values. The model then trains on that data to learn how to map the inputs to the desired output so it can learn to
make predictions on unseen data.
k-NN models work by taking a data point and looking at the ‗k‘ closest labeled data points. The data point is
then assigned the label of the majority of the ‗k‘ closest points.
For example, if k = 5, and 3 of points are ‗green‘ and 2 are ‗red‘, then the data point in question would be
labeled ‗green‘, since ‗green‘ is the majority (as shown in the above graph).
Scikit-learn is a machine learning library for Python. In this tutorial, we will build a k-NN model using Scikit-
learn to predict whether or not a patient has diabetes.
Reading in the training data
For our k-NN model, the first step is to read in the data we will use as input. For this example, we are
using the diabetes dataset. To start, we will use Pandas to read in the data. I will not go into detail on
Pandas, but it is a library you should become familiar with if you‘re looking to dive further into data
Next, let‘s see how much data we have. We will call the ‗shape‘ function on our dataframe to see how
Bharati
many Vidyapeeth‘s
rows College
and columns Of Engineering
there are Lavale
in our data. The Pune.
rows indicate the number of patients and the
columns indicate the number of features (age, weight, etc.) in the dataset for each patient. 1
Op (768,9)
We can see that we have 768 rows of data (potential diabetes patients) and 9 columns (8 input
features and 1 target output).
Now let‘s split up our dataset into inputs (X) and our target (y). Our input will be every column except
‗diabetes‘ because ‗diabetes‘ is what we will be attempting to predict. Therefore, ‗diabetes‘ will be our
target.
We will use pandas ‗drop‘ function to drop the column ‗diabetes‘ from our dataframe and store it in the
We will insert the ‗diabetes‘ column of our dataset into our target variable (y).
#separate target values
y = df[‗diabetes‘].values#view target values
y[0:5]
Now we will split the dataset into into training data and testing data. The training data is the data that
the model will learn from. The testing data is the data we will use to see how well the model performs
on unseen data.
Scikit-learn has a function we can use called ‗train_test_split‘ that makes it easy for us to split our
‗train_test_split‘ takes in 5 parameters. The first two parameters are the input and target data we split
up earlier. Next, we will set ‗test_size‘ to 0.2. This means that 20% of all the data will be used for
testing, which leaves 80% of the data as training data for the model to learn from. Setting
‗random_state‘ to 1 ensures that we get the same split each time so we can reproduce our results.
Setting ‗stratify‘ to y makes our training split represent the proportion of each value in the y variable.
For example, in our dataset, if 25% of patients have diabetes and 75% don‘t have diabetes, setting
‗stratify‘ to y will ensure that the random split has 25% of patients with diabetes and 75% of patients
without diabetes.
First, we will create a new k-NN classifier and set ‗n_neighbors‘ to 3. To recap, this means that if at
least 2 out of the 3 nearest points to an new data point are patients without diabetes, then the new
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
3
data point will be labeled as ‗no diabetes‘, and vice versa. In other words, a new data point is labeled
We have set ‗n_neighbors‘ to 3 as a starting point. We will go into more detail below on how to better
select a value for ‗n_neighbors‘ so that the model can improve its performance.
Next, we need to train the model. In order to train our new model, we will use the ‗fit‘ function and
pass in our training data as parameters to fit our model to the training data.
Once the model is trained, we can use the ‗predict‘ function on our model to make predictions on our
test data. As seen when inspecting ‗y‘ earlier, 0 indicates that the patient does not have diabetes and
1 indicates that the patient does have diabetes. To save space, we will only show print the first 5
We can see that the model predicted ‗no diabetes‘ for the first 4 patients in the test set and ‗has
Now let‘s see how our accurate our model is on the full test set. To do this, we will use the ‗score‘
function and pass in our test input and target data to see how well our model predictions match up to
Our model has an accuracy of approximately 66.88%. It‘s a good start, but we will see how we can
k-Fold Cross-Validation
Cross-validation is when the dataset is randomly split up into ‗k‘ groups. One of the groups is used as
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
the test set and the rest are used as the training set. The model is trained on the training set and
4
scored on the test set. Then the process is repeated until each unique group as been used as the test
set.
For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the model would
be trained and tested 5 separate times so each group would get a chance to be the test set. This can
The train-test-split method we used in earlier is called ‗holdout‘. Cross-validation is better than using
the holdout method because the holdout method score is dependent on how the data is split into train
and test sets. Cross-validation gives the model an opportunity to test on multiple splits so we can get
In order to train and test our model using cross-validation, we will use the ‗cross_val_score‘ function
with a cross-validation value of 5. ‗cross_val_score‘ takes in our k-NN model and our data as
parameters. Then it splits our data into 5 groups and fits and scores our data 5 seperate times,
recording the accuracy score in an array each time. We will save the accuracy scores in the
‗cv_scores‘ variable.
To find the average of the 5 scores, we will use numpy‘s mean function, passing in ‗cv_score‘. Numpy
how our model will perform on unseen data than our earlier testing using the holdout method.
When built our initial k-NN model, we set the parameter ‗n_neighbors‘ to 3 as a starting point with no
Hypertuning parameters is when you go through a process to find the optimal parameters for your
model to improve accuracy. In our case, we will use GridSearchCV to find the optimal value for
‗n_neighbors‘.
GridSearchCV works by training our model multiple times on a range of parameters that we specify.
That way, we can test our model with each parameter and figure out the optimal values to get the best
accuracy results.
For our model, we will specify a range of values for ‗n_neighbors‘ in order to see which value works
best for our model. To do this, we will create a dictionary, setting ‗n_neighbors‘ as the key and using
Our new model using grid search will take in a new k-NN classifier, our param_grid and a cross-
After training, we can check which of our values for ‗n_neighbors‘ that we tested performed the best.
We can see that 14 is the optimal value for ‗n_neighbors‘. We can use the ‗best_score_‘ function to
check the accuracy of our model when ‗n_neighbors‘ is 14. ‗best_score_‘ outputs the mean accuracy
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
of the scores obtained through cross-validation. 6
#check mean score for the top performing value of n_neighbors
knn_gscv.best_score_
By using grid search to find the optimal parameter for our model, we have improved our model
Code :- https://www.kaggle.com/code/shrutimechlearn/step-by-step-diabetes-classification-knn-detailed
Conclusion:
In this way we build a a neural network-based classifier that can determine whether they willleave or not in
the next 6 months
4 4 4 4 4 20
—--------------------------------------------------------------------------------------
Group B
Assignment No : 6
—---------------------------------------------------------------------------------------
Prerequisite:
1. Knowledge of Python
2. Unsupervised learning
3. Clustering
4. Elbow method
Clustering algorithms try to find natural clusters in data, the various aspects of
how the algorithms to cluster data can be tuned and modified. Clustering is based
on the principle that items within the same cluster must be similar to each other.
The data is grouped in such a way that related elements are close to each other.
Diverse and different types of data are subdivided into smaller groups.
Uses of Clustering
Marketing:
Real Estate:
Clustering can be used to understand and divide various property locations based
on value and importance. Clustering algorithms can process through the data and
identify various groups of property on the basis of probable price.
Libraries and Bookstores can use Clustering to better manage the book database.
With proper book ordering, better operations can be implemented.
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
Document Analysis: 9
K-Means Clustering
K-Means clustering is an unsupervised machine learning algorithm that divides
the given data into the given number of clusters. Here, the ―K‖ is the given number
of predefined clusters, that need to be created.
The algorithm takes raw unlabelled data as an input and divides the dataset into
clusters and the process is repeated until the best clusters are found.
The data is read. I will share a link to the entire code and excel data at the end of
the article.
data.corr()
Age Distribution:
#Distribution of age
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')
Gender Analysis:
genders = data.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values)
plt.show()
I have made more visualizations. Do have a look at the GitHub link at the end to
understand the data analysis and overall data exploration.
First, we work with two features only, annual income and spending score.
#We take just the Annual Income and Spending score
df1=data[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-100)"]]
X=df1[["Annual Income (k$)","Spending Score (1-100)"]]
#The input data
X.head()
Now we calculate the Within Cluster Sum of Squared Errors (WSS) for different
values of k. Next, we choose the k for which WSS first starts to diminish. This
value of K gives us the best number of clusters to make from the raw data.
wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)
#The elbow curve
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss)
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
plt.ylabel("WCSS")
1
plt.show()
The plot:
This is known as the elbow graph, the x-axis being the number of clusters, the
number of clusters is taken at the elbow joint point. This point is the point where
making clusters is most relevant as here the value of WCSS suddenly stops
decreasing. Here in the graph, after 5 the drop is minimal, so we take 5 to be the
number of clusters.
#Taking 5 clusters
km1=KMeans(n_clusters=5)
#Fitting the input data
km1.fit(X)
#predicting the labels of the input data
y=km1.predict(X)
#adding the labels to a column named label
df1["label"] = y
#The new dataframe with the clustering done
df1.head()
Now, we shall be working on 3 types of data. Apart from the spending score and
annual income of customers, we shall also take in the age of the customers.
#Taking the features
X2=df2[["Age","Annual Income (k$)","Spending Score (1-100)"]]
#Now we calculate the Within Cluster Sum of Squared Errors (WSS) for different values of
k.
wcss = []
for k in range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means++")
kmeans.fit(X2)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS")
plt.show()
The data:
The output:
What we get is a 3D plot. Now, if we want to know the customer IDs, we can do
that too.
cust1=df2[df2["label"]==1]
print('Number of customer in 1st group=', len(cust1))
print('They are -', cust1["CustomerID"].values)
print(" ")
cust2=df2[df2["label"]==2]
print('Number of customer in 2nd group=', len(cust2))
print('They are -', cust2["CustomerID"].values)
print(" ")
cust3=df2[df2["label"]==0]
print('Number of customer in 3rd group=', len(cust3))
print('They are -', cust3["CustomerID"].values)
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
print(" ")
2
cust4=df2[df2["label"]==3]
print('Number of customer in 4th group=', len(cust4))
print('They are -', cust4["CustomerID"].values)
print(" ")
cust5=df2[df2["label"]==4]
print('Number of customer in 5th group=', len(cust5))
print('They are -', cust5["CustomerID"].values)
print(" ")
——————————————–
Number of the customer in 2nd group= 29
They are - [ 47 51 55 56 57 60 67 72 77 78 80 82 84 86 90 93 94 97
99 102 105 108 113 118 119 120 122 123 127]
——————————————–
Number of the customer in 3rd group= 28
They are - [124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158
160 162 164 166 168 170 172 174 176 178]
——————————————–
Number of the customer in 4th group= 22
They are - [ 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 46]
——————————————–
Problem Statement: - Build a machine learning model that predicts the type of people who survived
the Titanic shipwreck using passenger data (i.e. name, age, gender, socio-economic class,etc.).
Theory:
Here's a step-by-step guide on how to approach this problem using Python and some popular libraries:
Start by obtaining the Titanic dataset, which contains passenger information and survival
labels. You can find datasets on websites like Kaggle.
2. Data Pre-processing:
Clean the data by handling missing values, outliers, and redundant features.
Encode categorical variables into numerical format using techniques like one-hot encoding.
3. Data Splitting:
Split your dataset into a training set and a test set. This allows you to evaluate your model's
performance on unseen data.
Choose a classification algorithm suitable for this problem. Common choices include
Decision Trees, Random Forests, Logistic Regression, Support Vector Machines, or Gradient
Boosting.
5. Model Training:
Fit your chosen algorithm to the training data. The model learns patterns from the data.
6. Model Evaluation:
Evaluate your model's performance using metrics like accuracy, precision, recall, F1-score,
and the ROC-AUC score. Cross-validation can help in assessing how well the model
generalizes to new data.
7. Hyperparameter Tuning:
8. Model Interpretation:
Understand the feature importance or coefficients of your model to interpret how different
features affect survival.
9. Prediction:
Use your trained model to make predictions on new, unseen data or the test set.
10. Post-processing:
You may need to further process the model's output, such as setting a threshold for classification.
# data processing
import pandas as pd
# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style
# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
Data Exploration/Analysis
train_df.info()
The training-set has 891 examples and 11 features + the target variable (survived). 2 of the
features are floats, 5 are integers and 5 areobjects. Below I have listed the features with a short
description:
survival: Survival
PassengerId: Unique Id of a passenger.
pclass: Ticket class
sex: Sex
Age: Age in years
sibsp: # of siblings / spouses aboard the Titanic
parch: # of parents / children aboard the Titanic
ticket: Ticket number
fare: Passenger fare
cabin: Cabin number
embarked: Port of Embarkationtrain_df.describe()
Above we can see that 38% out of the training-set survived the Titanic. We can also see that the
passenger ages range from 0.4 to 80. Ontop of that we can already detect some features, that contain
missing
values, like the „Age‟ feature.
train_df.head(8)
From the table above, we can note a few things. First of all, that we need to convert a lot of
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
features into numeric ones later on, so that the machine learning algorithms can process them.
2
Furthermore, we can see that the features have widely different ranges, that we will need to
BHARATI VIDYAPEETH‟S COLLEGE OF ENGINEERING LAVALE PUNE.
Department of Computer Engineering Course: Laboratory Practice-III
convert into roughly the same scale. We can also spot some more features,that contain missing values
(NaN = not a number), that wee need to deal with.
The Embarked feature has only 2 missing values, which can easily be filled. It will be much more
tricky, to deal with the „Age‟ feature, which has177 missing values. The „Cabin‟ feature needs further
investigation, but it looks like that we might want to drop it from the dataset, since 77 % of it
are missing.
train_df.columns.values
Above you can see the 11 features + the target variable (survived). Whatfeatures could contribute to
a high survival rate ?
You can see that men have a high probability of survival when they are between 18 and 30 years old,
which is also a little bit true for women butnot fully. For women the survival chances are higher
between 14 and 40.
For men the probability of survival is very low between the age of 5 and 18,but that isn‟t true for
women. Another thing to note is that infants also have a little bit higher probability of survival.
Since there seem to be certain ages, which have increased odds ofsurvival and because I want
every feature to be roughly on the same scale, I will create age groups later on.
Women on port Q and on port S have a higher chance of survival. The inverse is true, if they are at
port C. Men have a high survival probability ifthey are on port C, but a low probability if they are on
port Q or S.
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
2
Pclass also seems to be correlated with survival. We will generate anotherplot of it below.
4. Pclass:
sns.barplot(x='Pclass', y='Survived', data=train_df)
Here we see clearly, that Pclass is contributing to a persons chance of survival, especially if this
person is in class 1. We will create another pclassplot below.
The plot above confirms our assumption about pclass 1, but we can alsospot a high probability that a
person in pclass 3 will not survive.
SibSp and Parch would make more sense as a combined feature, thatshows the total number of
relatives, a person has on the Titanic. I willcreate it below and also a feature that sows if someone is
not alone.
axes = sns.factorplot('relatives','Survived',
data=train_df, aspect = 2.5, )
Here we can see that you had a high probabilty of survival with 1 to 3 realitves, but a lower one if
you had less than 1 or more than 3 (except forsome cases with 6 relatives).
Data Preprocessing
First, I will drop „PassengerId‟ from the train set, because it does not contribute to a persons survival
probability. I will not drop it from the testset, since it is required there for the submission.
Cabin:
As a reminder, we have to deal with Cabin (687), Embarked (2) and Age(177). First I thought, we
have to delete the „Cabin‟ variable but then I
import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U":
8}
data = [train_df, test_df]
Age:
Now we can tackle the issue with the age features missing values. I will create an array that contains
random numbers, which are computed basedon the mean age value in regards to the standard deviation
and is_null.
Since the Embarked feature has only 2 missing values, we will just fillthese with the most common
one.
train_df['Embarked'].describe()
BHARATI VIDYAPEETH‟S COLLEGE OF ENGINEERING LAVALE PUNE.
Department of Computer Engineering Course: Laboratory Practice-III
common_value = 'S'
data = [train_df, test_df]
Converting Features:
train_df.info()
Above you can see that „Fare‟ is a float and we have to deal with 4 categorical features: Name,
Sex, Ticket and Embarked. Lets investigateand transfrom one after another.
Fare:
Converting ―Fare‖ from float to int64, using the ―astype()‖ function pandasprovides:
Sex:
Convert „Sex‟ feature into numeric.
genders = {"male": 0, "female": 1}
data = [train_df, test_df]
Ticket:
train_df['Ticket'].describe()
Since the Ticket attribute has 681 unique tickets, it will be a bit tricky toconvert them into useful
categories. So we will drop it from the dataset.
train_df = train_df.drop(['Ticket'], axis=1)
test_df = test_df.drop(['Ticket'], axis=1)
Embarked:
Convert „Embarked‟ feature into numeric.
ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]
Bharati
for Vidyapeeth‘s
dataset College Of Engineering Lavale Pune.
in data:
dataset['Embarked'] = dataset['Embarked'].map(ports) 3
Creating Categories:
Age:
Now we need to convert the „age‟ feature. First we will convert it from floatinto integer. Then we will
create the new „AgeGroup‖ variable, by categorizing every age into a group. Note that it is important
to place attention on how you form these groups, since you don‟t want for examplethat 80% of your
data falls into group 1.
Fare:
For the „Fare‟ feature, we need to do the same as with the „Age‟ feature. Butit isn‟t that easy, because if
we cut the range of the fare values into a few equally big categories, 80% of the values would fall into
the first category. Fortunately, we can use sklearn ―qcut()‖ function, that we can use to see,
how we can form the categories.
train_df.head(10)
Bharati Vidyapeeth‟s College Of Engineering Lavale Pune.
3
I will add two new features to the dataset, that I compute out of otherfeatures.
Now we will train several Machine Learning models and compare their results. Note that because
the dataset does not provide labels for their testing-set, we need to use the predictions on the
training set to comparethe algorithms with each other. Later on, we will use cross validation.
sgd.score(X_train, Y_train)
Random Forest:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100,
2)
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
3
Logistic Regression:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
K Nearest Neighbor:
# KNN knn = KNeighborsClassifier(n_neighbors = 3) knn.fit(X_train,
Y_train) Y_pred = knn.predict(X_test) acc_knn =
round(knn.score(X_train, Y_train) * 100, 2)
Perceptron:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
Y_pred = linear_svc.predict(X_test)
Decision Tree
decision_tree = DecisionTreeClassifier() decision_tree.fit(X_train,
Y_train) Y_pred = decision_tree.predict(X_test) acc_decision_tree =
round(decision_tree.score(X_train, Y_train) * 100, 2)
As we can see, the Random Forest classifier goes on the first place. Butfirst, let us check, how
random-forest performs, when we use cross validation.
K-Fold Cross Validation randomly splits the training data into K subsetscalled folds. Let‟s image we
would split our data into 4 folds (K = 4). Ourrandom forest model would be trained and evaluated 4
times, using a different fold for evaluation everytime, while it would be trained on the remaining 3
folds.
The image below shows the process, using 4 folds (K = 4). Every row represents one training +
evaluation process. In the first row, the model get‟s trained on the first, second and third subset and
evaluated on the fourth. In the second row, the model get‟s trained on the second, third andfourth
subset and evaluated on the first. K-Fold Cross Validation repeats this process till every fold acted
once as an evaluation fold.
The result of our K-Fold Cross Validation example would be an array thatcontains 4 different scores.
We then need to compute the mean and the standard deviation for these scores.
The code below perform K-Fold Cross Validation on our random forestmodel, using 10 folds (K =
10). Therefore it outputs an array with 10 different scores.
This looks much more realistic than before. Our model has a average accuracy of 82% with a
standard deviation of 4 %. The standard deviationshows us, how precise the estimates are .
This means in our case that the accuracy of our model can differ + — 4%.
I think the accuracy is still really good and since random forest is an easyto use model, we will try to
increase it‟s performance even further in the following section.
Random Forest
Random Forest is a supervised learning algorithm. Like you can already see from it‟s name, it creates
a forest and makes it somehow random. The
„forest―
Bharati itVidyapeeth‘s
builds, is anCollege
ensembleOfof Decision Trees,
Engineering Lavalemost of the time trained with the ―bagging‖
Pune.
3
method. The general idea of the bagging method is that a combination of learning models increases
the overallresult.
To say it in simple words: Random forest builds multiple decision treesand merges them together to
get a more accurate and stable prediction.
One big advantage of random forest is, that it can be used for both classification and regression
problems, which form the majority of currentmachine learning systems. With a few exceptions a
random-forest classifier has all the hyperparameters of a decision-tree classifier and alsoall the
hyperparameters of a bagging classifier, to control the ensemble itself.
The random-forest algorithm brings extra randomness into the model, when it is growing the trees.
Instead of searching for the best feature whilesplitting a node, it searches for the best feature among a
random subset offeatures. This process creates a wide diversity, which generally results in abetter
model. Therefore when you are growing a tree in random forest, only a random subset of the features is
considered for splitting a node. Youcan even make trees more random, by using random thresholds on
top of it, for each feature rather than searching for the best possible thresholds (like a normal decision
tree does).
Below you can see how a random forest would look like with two trees:
Feature Importance
Another great quality of random forest is that they make it very easy to measure the relative
importance of each feature. Sklearn measure a features importance by looking at how much the treee
nodes, that use that
feature, reduce impurity on average (across all trees in the forest). It computes this score automaticall
for each feature after training and scalesthe results so that the sum of all importances is equal to 1. We
will acces this below:
importances =
pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_f
orest.feature_importances_,3)})
importances =
importances.sort_values('importance',ascending=False).set_index('featu
re')importances.head(15)
importances.plot.bar()
Conclusion:
BHARATI VIDYAPEETH‟S COLLEGE OF ENGINEERING LAVALE PUNE.
Department of Computer Engineering Course: Laboratory Practice-III
not_alone and Parch doesn‟t play a significant role in our random forestclassifiers prediction process.
Because of that I will drop them from thedataset and train the classifier again. We could also remove
more or less
features, but this would need a more detailed investigation of the featureseffect on our model. But I
think it‟s just fine to remove only Alone and Parch.
random_forest.score(X_train, Y_train)
92.82%
Our random forest model predicts as good as it did before. A general ruleis that, the more features
you have, the more likely your model will suffer from overfitting and vice versa. But I think our
data looksfine for now and hasn't too much features.
There is also another way to evaluate a random-forest classifier, which is probably much more
accurate than the score we used before. What I am talking about is the out-of-bag samples to
estimate the generalization accuracy. I will not go into details here about how it works. Just note that
out-of-bag estimate is as accurate as using a test set of the same size as thetraining set. Therefore,
using the out-of-bag error estimate removes the need for a set aside test set.
Hyperparameter Tuning
Below you can see the code of the hyperparamter tuning for the parameters criterion,
min_samples_leaf, min_samples_split andn_estimators.
I put this code into a markdown cell and not into a code cell, because it takes a long time to run it.
Directly underneeth it, I put a screenshot of thegridsearch's output.
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
random_forest.score(X_train, Y_train)
4
print("oob score:", round(random_forest.oob_score_, 4)*100, "%")
Now that we have a proper model, we can start evaluating it‟s performace in a more accurate way.
Previously we only used accuracy and the oob score, which is just another form of accuracy. The
problem is just, that it‟smore complicated to evaluate a classification model than a regression model.
We will talk about this in the following section.
Further Evaluation
Confusion Matrix:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
confusion_matrix(Y_train, predictions)
The first row is about the not-survived-predictions: 493 passengers were correctly classified as
not survived (called true negatives) and 56 where wrongly classified as not survived (false
positives).
The second row is about the survived-predictions: 93 passengers wherewrongly classified as survived
(false negatives) and 249 where correctly classified as survived (true positives).
A confusion matrix gives you a lot of information about how well yourmodel does, but theres a way
to get even more, like computing the classifiers precision.
Our model predicts 81% of the time, a passengers survival correctly (precision). The recall tells us
that it predicted the survival of 73 % of thepeople who actually survived.
F-Score
You can combine precision and recall into one score, which is called the F-score. The F-score is
computed with the harmonic mean of precision and recall. Note that it assigns much more weight to
low values. As a result of that, the classifier will only get a high F-score, if both recall and precision
are high.
There we have it, a 77 % F-score. The score is not that high, because we have a recall of 73%. But
unfortunately the F-score is not perfect, because it favors classifiers that have a similar precision and
recall. This is a problem, because you sometimes want a high precision and sometimes a high recall.
The thing is that an increasing precision, sometimes results inan decreasing recall and vice versa
(depending on the threshold). This is called the precision/recall tradeoff. We will discuss this in the
following section.
For each person the Random Forest algorithm has to classify, it computesa probability based on a
function and it classifies the person as survived (when the score is bigger the than threshold) or as not
survived (when thescore is smaller than the threshold). That‟s why the threshold plays an important
part.
We will plot
Bharati the precision
Vidyapeeth‘s and recall
College with the threshold
Of Engineering using matplotlib:
Lavale Pune.
from sklearn.metrics import precision_recall_curve 4
plt.figure(figsize=(14, 7))
plot_precision_and_recall(precision, recall, threshold)
plt.show()
Above you can clearly see that the recall is falling of rapidly at a precision of around 85%. Because of
that you may want to select the precision/recalltradeoff before that — maybe at around 75 %.
You are now able to choose a threshold, that gives you the best precision/recall tradeoff for your
current machine learning problem. If youwant for example a precision of 80%, you can easily look at
the plots and see that you would need a threshold of around 0.4. Then you could train a model with
exactly that threshold and would get the desired accuracy.
Another way is to plot the precision and recall against each other:
def plot_precision_vs_recall(precision, recall):
plt.plot(recall, precision, "g--", linewidth=2.5)
plt.ylabel("recall",
Bharati fontsize=19)
Vidyapeeth‘s College Of Engineering Lavale Pune.
plt.xlabel("precision", fontsize=19) 4
plt.axis([0, 1.5, 0, 1.5])
plt.figure(figsize=(14, 7))
plot_precision_vs_recall(precision, recall)
plt.show()
Another way to evaluate and compare your binary classifier is provided bythe ROC AUC Curve. This
curve plots the true positive rate (also called recall) against the false positive rate (ratio of incorrectly
classified negativeinstances), instead of plotting the precision versus the recall.
plt.figure(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()
The red line in the middel represents a purely random classifier (e.g a coin flip) and therefore your
classifier should be as far away from it as possible.Our Random Forest model seems to do a good job.
Of course we also have a tradeoff here, because the classifier producesmore false positives, the higher
the true positive rate is.
The ROC AUC Score is the corresponding score to the ROC AUC Curve. Itis simply computed by
measuring the area under the curve, which is calledAUC.
A classifiers that is 100% correct, would have a ROC AUC Score of 1 and a completely random
classiffier would have a score of 0.5.
from sklearn.metrics import roc_auc_score
r_a_score = roc_auc_score(Y_train, y_scores)
print("ROC-AUC-Score:", r_a_score)
ROC_AUC_SCORE: 0.945067587
4 4 4 4 4 20
Group C
Assignment No : 1
Title of the Assignment: Installation of MetaMask and study spending Ether per transaction
Objective of the Assignment: Students should be able to learn new technology such as metamask.Its
application and implementations
Prerequisite:
1. Basic knowledge of cryptocurrency
2. Basic knowledge of distributed computing concept
3. Working of blockchain
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction Blockchain
2. Cryptocurrency
3. Transaction Wallets
4. Ether transaction
5. Installation Process of Metamask
---------------------------------------------------------------------------------------------------------------
Introduction to Blockchain
● Blockchain can be described as a data structure that holds transactional records and
while ensuring security, transparency, and decentralization. You can also think of it as
a chain or records stored in the forms of blocks which are controlled by no single
authority.
● A blockchain is a distributed ledger that is completely open to any and everyone on
the network. Once an information is stored on a blockchain, it is extremely difficult to
change or alter it.
● Each transaction on a blockchain is secured with a digital signature that proves its
authenticity. Due to the use of encryption and digital signatures, the data stored on the
blockchain is tamper-proof and cannot be changed.
● Blockchain technology allows all the network participants to reach an agreement,
commonly known as consensus. All the data stored on a blockchain is recorded
digitally and has a common history which is available for all the network participants.
This way, the chances of any fraudulent activity or duplication of transactions is
eliminated without the need of a third-party.
Blockchain Features
The following features make the revolutionary technology of blockchain stand out:
● Decentralized
Blockchains are decentralized in nature meaning that no single person or group holds
the authority of the overall network. While everybody in the network has the copy of
the distributed ledger with them, no one can modify it on his or her own. This unique
feature of blockchain allows transparency and security while giving power to the
users.
● Peer-to-Peer Network
With the use of Blockchain, the interaction between two parties through a peer-to-peer
model is easily accomplished without the requirement of any third party. Blockchain
uses P2P protocol which allows all the network participants to hold an identical copy
of transactions, enabling approval through a machine consensus. For example, if you
wish to make any transaction from one part of the world to another, you can do that
with blockchain all by yourself within a few seconds. Moreover, any interruptions or
extra charges will not be deducted in the transfer.
● Immutable
The immutability property of a blockchain refers to the fact that any data once written
on the blockchain cannot be changed. To understand immutability, consider sending
email as an example. Once you send an email to a bunch of people, you cannot take it
back. In order to find a way around, you‘ll have to ask all the recipients to delete your
email which is pretty tedious. This is how immutability works.
● Tamper-Proof
With the property of immutability embedded in blockchains, it becomes easier to
detect tampering of any data. Blockchains are considered tamper-proof as any change
in even one single block can be detected and addressed smoothly. There are two key
ways of detecting tampering namely, hashes and blocks.
MetaMask is one of the most popular browser extensions that serves as a way of storing your
Ethereum and other ERC-20 Tokens. The extension is free and secure, allowing web
applications to read and interact with Ethereum‘s blockchain.
To create a new wallet, you have to install the extension first. Depending on your browser,
there are different marketplaces to find it. Most browsers have MetaMask on their stores, so
it‘s not that hard to see it, but either way, here they are Chrome, Firefox, and Opera.
nd it‘s as easy as that to install the extension on your browser, continue reading the next step
to figure out how to create an account.
Click Reveal Secret Words. There you will see a 12 words seed phrase. This is really
important and usually not a good idea to store digitally, so take your time and write it down
● Verify your secret phrase by selecting the previously generated phrase in order. Click
Confirm.
And that‘s it; now you have created your MetaMask account successfully. A new Ethereum wallet
address has just been created for you. It‘s waiting for you to deposit funds, and if you want to learn
how to do that, look at the next step below.
You can now see your public address and share it with other people. There are some
methods to buy coins offered by MetaMask, but you can do it differently as well; you just
need your address.
If you ever get logged out, you‘ll be able to log back in again by clicking the MetaMask
icon, which will have been added to your web browser (usually found next to the URL bar).
You can now access your list of assets in the ‗Assets‘ tab and view your transaction history in the
‗Activity‘ tab.
● Sending crypto is as simple as clicking the ‗Send‘ button, entering the recipient address and
amount to send, and selecting a transaction fee. You can also manually adjust the transaction
fee using the ‗Advanced Options‘ button, using information from ETH Gas Station or similar
platforms to choose a more acceptable gas price.
● After clicking ‗Next‘, you will then be able to either confirm or reject the transaction on the
subsequent page.
● To use MetaMask to interact with a dapp or smart contract, you‘ll usually need to find a
‗Connect to Wallet‘ button or similar element on the platform you are trying to use. After
clicking this, you should then see a prompt asking whether you want to let the dapp
connect to your wallet.
What advantages does MetaMask have?
● Popular - It is commonly used, so users only need one plugin to access a wide range of
dapps.
● Simple - Instead of managing private keys, users just need to remember a list of words, and
transactions are signed on their behalf.
● Saves space - Users don‘t have to download the Ethereum blockchain, as MetaMask sends
requests to nodes outside of the user‘s computer.
● Integrated - Dapps are designed to work with MetaMask, so it becomes much easier to send
Ether in and out.
Conclusion- In this way we have explored Concept Blockchain and metamat wallet for
transaction of digital currency
Assignment Question
Reference link
● https://hackernoon.com/blockchain-technology-explained-introduction-meaning-and-applications-edb
d6759a2b2
● https://levelup.gitconnected.com/how-to-use-metamask-a-step-by-step-guide-f380a3943fb1
● https://decrypt.co/resources/metamask
Group C
Assignment No:2
Title: Create your own wallet using MetaMask for crypto transactions.
Objective: Understand and explore the working of Blockchain technology and its applications.
Course Outcome:
Description:
A wallet is your personal key to interact with the cryptographic world. It powers you to buy, sell
or transfer assets on the Blockchain.
And MetaMask is a wallet for the most diverse Blockchain in existence–Ethereum. It‟s your
gateway to its DeFi ecosystem, non-fungible tokens (NFTs), ERC-20 tokens, and practically–
everything Ethereum.
It‟s available as an app for iOS and Android. In addition, you can use this as an extension with a
few web browsers: Chrome, Firefox, Brave, and Edge.
Ease of use
Starting with MetaMask is easy, quick, and anonymous. You don‟t even need an email
address. Just set up a password and remember (and store) the secret recovery phrase, and you‟re
done.
Security
Your information is encrypted in your browser that nobody has access to. In the event of
a lost password, you have the 12-word secret recovery phase (also called a seed phrase) for
recovery. Notably, it‟s essential to keep the seed phrase safe, as even MetaMask has no
information about it. Once lost, it can‟t be retrieved.
If you‟re wondering, no, you can‟t buy Bitcoin with MetaMask. It only supports Ether
and other Ether-related tokens, including the famous ERC-20 tokens. Cryptocurrencies
(excluding Ether) on Ethereum are built as ERC-20 tokens.
MetaMask stores your information locally. So, in case you switch browsers or machines,
you can restore your MetaMask wallet with your secret recovery phrase.
Community Support
As of August 2021, MetaMask was home to 10 million monthly active users around the
world. It‟s simple and intuitive user interface keeps pushing these numbers with a recorded
1800% increase from July 2020.
Conclusively, try MetaMask if hot wallets are your pick. Let‟s begin with the installation
before moving to its use cases. Further sections entail the illustration for Chrome web browser
and Android mobile platform.
Step 2: Create a password for your wallet. This password is to be entered every time the
browser is launched and wants to use MetaMask. A new password needs to be created if
chrome is uninstalled or if there is a switching of browsers. In that case, go through the Import
Wallet button. This is because MetaMask stores the keys in the browser. Agree to Terms of
Use.
Department of Computer Engineering Course: Laboratory Practice-III
Step 3:Click on the dark area which says Click here to reveal secret words to get your secret
phrase.
Step 4: This is the most important step. Back up your secret phrase properly. Do not store your
secret phrase on your computer. Please read everything on this screen until you understand it
completely before proceeding. The secret phrase is the only way to access your wallet if you
forget your password. Once done click the Next button.
Step 5:
MetaMask requires that you store your seed phrase in a safe place. It is the only way to recover
your funds should your device crash or your browser reset. We recommend you write it down.
The most common method is to write your 12-word phrase on a piece of paper and store it safely
in a place where only you have access. Note: if you lose your seed phrase, MetaMask can‟t help
you recover your wallet and your funds will be lost forever.
Never share your seed phrase or your private key to anyone or any site, unless you want them to
have full control over your funds.
Department of Computer Engineering Course: Laboratory Practice-III
Step 6: Click the Confirm button. Please follow the tips mentioned.
Department of Computer Engineering Course: Laboratory Practice-III
Step 7: One can see the balance and copy the address of the account by clicking on
the Account 1 area.
Assignment No. 3
Title: Write a smart contract on a test network for Bank account of a customer for following
operations.
• Deposit Money
• Withdraw Money
• Show Balance
Objective: Understand and explore the working of Blockchain technology and its applications.
Course Outcome:
Description:
First of all, we need to understand the differences between a paper contract and a smart
contract and the reason why smart contracts become increasingly popular and important in recent
years. A contract, by definition, is a written or spoken (mostly written) law-enforced agreement
containing the rights and duties of the parties. Because most of business contracts are
complicated and tricky, the parties need to hire professional agents or lawyers for protecting their
own rights. However, if we hire those professionals every time we sign contracts, it is going to
be extremely costly and inefficient. Smart contracts perfectly solve this by working on „If-Then‟
principle and also as escrow services. All participants need to put their money, ownership right
or other tradable assets into smart contracts before any successful transaction. As long as all
participating parties meet the requirement, smart contracts will simultaneously distribute stored
assets to recipients and the distribution process will be witnessed and verified by the nodes on
Ethereum network.
There are a couple of languages we can use to program smart contract. Solidity, an object-
oriented and high-level language, is by far the most famous and well maintained one. We can use
Solidity to create various smart contracts which can be used in different scenarios, including
voting, blind auctions and safe remote purchase. In this lab, we will discuss the semantics and
syntax of Solidity with specific explanation, examples and practices.
Department of Computer Engineering Course: Laboratory Practice-III
After deciding the coding language, we need to pick an appropriate compiler. Among
various compilers like Visual Code Studio, we will use Remix IDE in this and following labs
because it can be directly accessed from browser where we can test, debug and deploy smart
contracts without any installation.
Department of Computer Engineering Course: Laboratory Practice-III
Remix IDE is generally used to compile and run Solidity smart contracts. Below are the steps for
the compilation, execution, and debugging of the smart contract.
Step 1: Open Remix IDE on any of your browsers, select on the New File and click on Solidity to
choose the environment.
Step 2: Write the Smart contract in the code section, and click the Compile button under the
Compiler window to compile the contract.
Department of Computer Engineering Course: Laboratory Practice-III
Step 3: To execute the code, click on the Deploy button under Deploy and Run Transactions
window. After deploying the code click on the drop-down on the console.
Code
//SPDX-License-Identifier: MIT
pragma solidity ^0.6;
Department of Computer Engineering Course: Laboratory Practice-III
contract banking
{
mapping(address=>uint) public user_account;
mapping(address=>bool) public user_exists;
require(user_account[msg.sender]>amount,"Insufficient Balance");
require(user_exists[msg.sender]==true,"Account not created");
require(amount>0,"Amount should be more than zero");
user_account[msg.sender]=user_account[msg.sender]-amount;
msg.sender.transfer(amount);
return "Withdrawl Successful";
}
{
return user_account[msg.sender];
}
Sample Output
After deploying the contact successful you can observe following buttonscreate_account,
deposit, send_amt, transfer, account_exist, user_account, user_balance and user_exists.
• Create account
Department of Computer Engineering Course: Laboratory Practice-III
• Deposit Amount
• Send Amount
Conclusion: Hence,wehave studied a smart contract on a test network for Bank account of a
customer
Department of Computer Engineering Course: Laboratory Practice -III
Assignment No. 4
Title: Write a program in solidity to create Student data. Use the following constructs:
• Structures
• Arrays
• Fallback
Deploy this as smart contract on Ethereum and Observe the transaction fee and Gas value.
Objective: Understand and explore the working of Blockchain technology and its applications.
Course Outcome:
Description:
Step 1: Open Remix IDE on any of your browsers, select on the New File and click on Solidity to
choose the environment.
Step 2: Write the Student Management code in the code section, and click the Compile
button under the Compiler window to compile the contract.
Step 3: To execute the code, click on the Deploy button under Deploy and Run Transactions
window. After deploying the code click on the drop-down on the console.
l O M o A R c P S D | 3 2 0 3 7 8 11
Code
struct Student {
intstud_id;
string name;
string department;
}
Student[] Students;
Sample Output
After deploying the contact successful you can observe two button add_stud and
getStudents.Give the input stud_id, name dept and click on getStudents button, enter the stud_id
which you have given as an Input and get the information of Students name and department
4 4 4 4 4 20
Assignment No : 5
Title of the Assignment: Write a survey report on types of Blockchains and its real time use cases.
Objective of the Assignment: Students should be able to learn new technology such as metamask.Its
application and implementations
Prerequisite:
1. Basic knowledge of cryptocurrency
2. Basic knowledge of distributed computing concept
3. Working of blockchain
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
There are 4 types of blockchain:
Public Blockchain.
Private Blockchain.
Hybrid Blockchain.
Consortium Blockchain
---------------------------------------------------------------------------------------------------------------
1. Public Blockchain
These blockchains are completely open to following the idea of decentralization. They don‘t have any
restrictions, anyone having a computer and internet can participate in the network.
As the name is public this blockchain is open to the public, which means it is not owned by anyone.
Anyone having internet and a computer with good hardware can participate in this public blockchain.
All the computer in the network hold the copy of other nodes or block present in the network
In this public blockchain, we can also perform verification of transactions or records
Advantages:
Trustable: There are algorithms to detect no fraud. Participants need not worry about the other nodes in the
network
Secure: This blockchain is large in size as it is open to the public. In a large size, there is greater distribution
of records
Anonymous Nature: It is a secure platform to make your transaction properly at the same time, you are not
required to reveal your name and identity in order to participate.
Decentralized: There is no single platform that maintains the network, instead every user has a copy of the
ledger.
Disadvantages:
Processing: The rate of the transaction process is very slow, due to its large size. Verification of each node is
a very time-consuming process.
Energy Consumption: Proof of work is high energy-consuming. It requires good computer hardware to
participate in the network
Acceptance: No central authority is there so governments are facing the issue to implement the technology
faster.
Use Cases: Public Blockchain is secured with proof of work or proof of stake they can be used to displace
traditional financial systems. The more advanced side of this blockchain is the smart contract that enabled
this blockchain to support decentralization. Examples of public blockchain are Bitcoin, Ethereum.
2. Private Blockchain
These blockchains are not as decentralized as the public blockchain only selected nodes can participate in
the process, making it more secure than the others.
Speed: The rate of the transaction is high, due to its small size. Verification of each node is less time-
consuming.
Scalability: We can modify the scalability. The size of the network can be decided manually.
Privacy: It has increased the level of privacy for confidentiality reasons as the businesses required.
Balanced: It is more balanced as only some user has the access to the transaction which improves the
performance of the network.
Disadvantages:
Security- The number of nodes in this type is limited so chances of manipulation are there. These
blockchains are more vulnerable.
Centralized- Trust building is one of the main disadvantages due to its central nature. Organizations can use
this for malpractices.
Count- Since there are few nodes if nodes go offline the entire system of blockchain can be endangered.
Use Cases: With proper security and maintenance, this blockchain is a great asset to secure information
without exposing it to the public eye. Therefore companies use them for internal auditing, voting, and asset
management. An example of private blockchains is Hyperledger, Corda.
3. Hybrid Blockchain
It is the mixed content of the private and public blockchain, where some part is controlled by some
organization and other makes are made visible as a public blockchain.
Ecosystem: Most advantageous thing about this blockchain is its hybrid nature. It cannot be hacked as 51%
of users don‘t have access to the network
Cost: Transactions are cheap as only a few nodes verify the transaction. All the nodes don‘t carry the
verification hence less computational cost.
Architecture: It is highly customizable and still maintains integrity, security, and transparency.
Operations: It can choose the participants in the blockchain and decide which transaction can be made
public.
Disadvantages:
Efficiency: Not everyone is in the position to implement a hybrid Blockchain. The organization also faces
4. Consortium Blockchain
It is a creative approach that solves the needs of the organization. This blockchain validates the transaction
and also initiates or receives transactions.
Speed: A limited number of users make verification fast. The high speed makes this more usable for
organizations.
Authority: Multiple organizations can take part and make it decentralized at every level. Decentralized
authority, makes it more secure.
Privacy: The information of the checked blocks is unknown to the public view. but any member belonging to
the blockchain can access it.
Flexible: There is much divergence in the flexibility of the blockchain. Since it is not a very large decision can
be taken faster.
Disadvantages:
Approval: All the members approve the protocol making it less flexible. Since one or more organizations are
involved there can be differences in the vision of interest.
Transparency: It can be hacked if the organization becomes corrupt. Organizations may hide information
from the users.
Vulnerability: If few nodes are getting compromised there is a greater chance of vulnerability in this
blockchain
Use Cases: It has high potential in businesses, banks, and other payment processors. Food tracking of the
organizations frequently collaborates with their sectors making it a federated solution ideal for their use.
Examples of consortium Blockchain are Tendermint and Multichain.
Conclusion-In this way we have explored types of blockchain and its applications in real time
4 4 4 4 4 20
Assignment No : 6
Title of the Assignment: Write a program to create a Business Network using Hyperledger.
Objective of the Assignment: Students should be able to learn hyperledger .Its application and
implementations
Prerequisite:
1. Basic knowledge of cryptocurrency
2. Basic knowledge of distributed computing concept
3. Working of blockchain
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
Hyperledger Composer is an extensive, open development toolset and framework to make developing
blockchain applications easier. The primary goal is to accelerate time to value, and make it easier to integrate
You can use Composer to rapidly develop use cases and deploy a blockchain solution in days.
Composer allows you to model your business network and integrate existing systems and data with your
blockchain applications.
Hyperledger Composer supports the existing Hyperledger Fabric blockchain infrastructure and runtime.
Hyperleder Composer generate business network archive (bna) file which you can deploy on existing
Hyperledger Fabric network
You can use Hyperledger Composer to model business network, containing your existing assets and the
Key Concepts of
Hyperledger
Composer
1. Blockchain State Storage: It stores all transaction that happens in your hyperledger composer application.
It stores transaction in Hyperledger fabric network.
2. Connection Profiles: Connection Profiles to configuration JSON file which help composer to connect to
Hyperledger Fabric. You can find Connection Profile JSON file in user‘s home directory.
3. Assets: Assets are tangible or intangible goods, services, or property, and are stored in registries. Assets
can represent almost anything in a business network, for example, a house for sale, the sale listing, the
land registry certificate for that house. Assets must have a unique identifier, but other than that, they can
contain whatever properties you define.
4. Participants: Participants are members of a business network. They may own assets and submit
transactions. Participant must have an identifier and can have any other properties.
5. Identities and ID cards: Participants can be associated with an identity. ID cards are a combination of an
identity, a connection profile, and metadata. ID cards simplify the process of connecting to a business
network.
6. Transactions: Transactions are the mechanism by which participants interact with assets. Transaction
processing logic you can define in JavaScript and you can also emit event for transaction.
7. Queries: Queries are used to return data about the blockchain world-state. Queries are defined within a
business network, and can include variable parameters for simple customisation. By using queries, data
can be easily extracted from your blockchain network. Queries are sent by using the Hyperledger
Composer API.
8. Events: Events are defined in the model file. Once events have been defined, they can be emitted by
transaction processor functions to indicate to external systems that something of importance has
happened to the ledger.
9. Access Control: Hyperledger is enterprise blockchain and access control is core feature of any business
blockchain. Using Access Control rules you can define who can do what in Business networks. The
access control language is rich enough to capture sophisticated conditions.
10. Historian registry: The historian is a specialised registry which records successful transactions, including
the participants and identities that submitted them. The historian stores transactions as HistorianRecord
assets, which are defined in the Hyperledger Composer system namespace.
Step 1: Start Hyperledger Composer Online version of Local. Click on Deploy a new business network
Step 3: Fill basic information, select empty business network and click ―deploy‖ button from right pannel
Step 5: Click on ―+Add a file…‖ from left panel and select ―model file (.cto)‖
Write following code in model file. Model file contain asset in our case it‘s hardware, participant in our case
participants are employee of organisation and transaction as Allocate hardware to employee. Each model has
extra properties. Make sure your have proper and unique namespace. In this example I am using ―com.kpbird‖
as namespace. You can access all models using this namespace i.e. com.kpbird.Hardware,
com.kpbird.Employee
/**
* Hardware model
*/namespace com.kpbirdasset Hardware identified by hardwareId {
o String hardwareId
o String name
o String type
o String description
o Double quantity
→ Employee owner
}
participant Employee identified by employeeId {
o String employeeId
o String firstName
o String lastName
}
transaction Allocate {
→ Hardware hardware
→ Employee newOwner
}
reference: https://hyperledger.github.io/composer/reference/cto_language.html
Step 6: Click on ―+Add a file…‖ from left panel and select ―script file (*.js)‖
Write following code in Script File. In Script we can define transaction processing logic. In our case we want to
allocate hardware to the employee so, we will update owner of hardware. Make sure about annotation above
Step 7: permissions.acl file sample is already available, Add following code in permissions.acl file.
/**
* New access control file
*/
rule AllAccess {
description: ―AllAccess — grant everything to everybody.‖
participant: ―ANY‖
operation: ALL
resource: ―com.kpbird.**‖
action: ALLOW
}rule SystemACL{
description: ―System ACL to permit all access‖
participant: ―org.hyperledger.composer.system.Participant‖
operation: ALL
resource: ―org.hyperledger.composer.system.**‖
action: ALLOW
}
reference: https://hyperledger.github.io/composer/reference/acl_language.html
Step 8: Now, It‘s time to test our hardware assets business network. Hyperledger composer gives ―Test‖
facility from composer panel it self. Click on ―Test‖ tab from top panel
Step 9: Create Assets. Click on ―Hardware‖ from left panel and click ―+ Create New Assets‖ from right top
corner and add following code. We will create Employee#01 in next step. Click on ―Create New‖ button
{
―$class‖: ―com.kpbird.Hardware‖,
―hardwareId‖: ―MAC01‖,
―name‖: ―MAC Book Pro 2015‖,
―type‖: ―Laptop‖,
―description‖: ―Mac Book Pro‖,
―quantity‖: 1,
―owner‖: ―resource:com.kpbird.Employee#01‖
}
Steps 10: Let‘s create participants. Click ―Employee‖ and click ―+ Create New Participants‖ and add following
Step 11: It‘s time to do transaction, We will allocate Macbook Pro from Ketan (Employee#01) to Nirja
(Employee#02). Click on ―Submit Transaction‖ button from left panel. In Transaction dialog, We can see all
Now, We are allocating Mac01 to Employee 02. Click Submit button after update above JSON in Transaction
Dialog. As soon as you hit submit button. Transaction processed and Transaction Id will generate.
Step 12: Click on ―All Transactions‖ from left panel to verify all transactions. In following screenshots you can
see add assets, ass participants and allocation all operation are consider as transactions. ―view records‖ will
All Transactions
Step 13: Now, It‘s time to deploy ―hardware-assets‖ business network to Hyperledger Fabric. Click on ―Define‖
tab from top panel and click ―Export‖ button from left panel. Export will create hardware-assets.bna file.
.bna is Business Network Archive file which contains model, script, network access and query file
source: https://hyperledger.github.io/composer/introduction/introduction
Step 14: Start Docker and run following commands from ~/fabric-tools directory
Install business network to Hyperledger Fabric, If business network is already installed you can use ―update‖
instead of ―install‖
$composer runtime install -c PeerAdmin@hlfv1 -n hardware-assets
Following command will deploy and start hardware-assets.bna file. Change hardware-assets.bna file before
you execute following command. networkadmin.card file will generate in ~/fabric-tools directory from previous
command.
$composer network start — card PeerAdmin@hlfv1 — networkAdmin admin — networkAdminEnrollSecret
adminpw — archiveFile /Users/ketan/Downloads/hardware-assets.bna — file networkadmin.card
To connect business network you need connection card. so we can import networkadmin.card using following
command
$composer card import -f networkadmin.card
To make sure networkadmin.card successfully install you can list cards using following command
$composer card list
BHARATI VIDYAPEETH‘S COLLEGE OF ENGINEERING LAVALE PUNE.
Department of Computer Engineering Course : Laboratory Practice III
Following command will make sure that our hardware-assets business network is successfully running in
Hyperledger Fabric.
$composer network ping — card admin@hardware-assets
Now It‘s time to interact with REST API. To develop Web or Mobile Application we require REST API. you can
run following command to generate REST API for hardware-assets business network.
$composer-rest-server
rest server will ask few basic information before generate rest api
Conclusion: In this way we have learnt about hyperledger and its use case in business world.
---------------------------------------------------------------------------------------------------------------
MINI PROJECT-3
Theory:
https://github.com/sherwyn11/E-Voting-App
Conclusion