ML Lab Observation
ML Lab Observation
Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training data
from a .CSV file.
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint ai in h
If the constraint ai is satisfied by x
Then do nothing
Else replace ai in h by the next more general constraint that is
satisfied by x
3. Output hypothesis h
Training Examples:
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
Program:
import csv
num_attributes = 6
a = []
print("\n The Given Training Data Set \n")
with open('enjoysport.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
a.append (row)
print(row)
print("\n The initial value of hypothesis: ")
hypothesis = ['0'] * num_attributes
print(hypothesis)
for j in range(0,num_attributes):
hypothesis[j] = a[0][j];
for i in range(0,len(a)):
if a[i][num_attributes]=='yes':
for j in range(0,num_attributes):
if a[i][j]!=hypothesis[j]:
hypothesis[j]='?'
else :
hypothesis[j]= a[i][j]
print(" For Training instance No:{0} the hypothesis is
".format(i),hypothesis)
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h •
Remove from G any hypothesis that is less general than another hypothesis in
G
Program:
import numpy as np
import pandas as pd
data =
pd.DataFrame(data=pd.read_csv('enjoysport.csv'))
concepts = np.array(data.iloc[:,0:-1])
print(concepts)
target = np.array(data.iloc[:,-1])
print(target)
Data Set:
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Output:
Final Specific_h:
['sunny' 'warm' '?' 'strong' '?' '?']
Final General_h:
[['sunny', '?', '?', '?', '?', '?'],
['?', 'warm', '?', '?', '?', '?']]
3. Write a program to demonstrate the working of the decision tree based ID3
algorithm. Use an appropriate data set for building the decision tree and apply this
knowledge to classify a new sample.
ID3 Algorithm
Examples are the training examples. Target_attribute is the attribute whose value is
to be predicted by the tree. Attributes is a list of other attributes that may be tested
by the learned decision tree. Returns a decision tree that correctly classifies the
given Examples.
∙ Otherwise Begin
∙ A ← the attribute from Attributes that best* classifies Examples
∙ The decision attribute for Root ← A
∙ For each possible value, vi, of A,
∙ Add a new tree branch below Root, corresponding to the test A = vi
∙ Let Examples vi, be the subset of Examples that have value vi for A
∙ If Examples vi , is empty
∙ Then below this new branch add a leaf node with label = most common
value of Target_attribute in Examples
∙ Else below this new branch add the subtree
ID3(Examples vi, Targe_tattribute, Attributes – {A}))
∙ End
∙ Return Root
ENTROPY:
Entropy measures the impurity of a collection of examples.
INFORMATION GAIN:
import math
import csv
def load_csv(filename):
lines=csv.reader(open(filename,"r"))
; dataset = list(lines)
headers = dataset.pop(0)
return dataset,headers
class Node:
def __init__(self,attribute):
self.attribute=attribute
self.children=[]
self.answer=""
def subtables(data,col,delete):
dic={}
coldata=[row[col] for row in data]
attr=list(set(coldata))
counts=[0]*len(attr)
r=len(data)
c=len(data[0])
for x in range(len(attr)):
for y in range(r):
if data[y][col]==attr[x]:
counts[x]+=1
for x in range(len(attr)):
dic[attr[x]]=[[0 for i in range(c)] for j in
range(counts[x])]
pos=0
for y in range(r):
if data[y][col]==attr[x]:
if delete:
del data[y][col]
dic[attr[x]][pos]=data[y] pos+=1
return attr,dic
def entropy(S):
attr=list(set(S))
if len(attr)==1:
return 0
counts=[0,0]
for i in range(2):
counts[i]=sum([1 for x in S if
attr[i]==x])/(len(S)*1.0)
sums=0
for cnt in counts:
sums+=-1*cnt*math.log(cnt,2)
return sums
def compute_gain(data,col):
attr,dic = subtables(data,col,delete=False)
total_size=len(data)
entropies=[0]*len(attr)
ratio=[0]*len(attr)
def build_tree(data,features):
lastcol=[row[-1] for row in data]
if(len(set(lastcol)))==1:
node=Node("")
node.answer=lastcol[0]
return node
n=len(data[0])-1
gains=[0]*n
for col in range(n):
gains[col]=compute_gain(data,col)
split=gains.index(max(gains))
node=Node(features[split])
fea = features[:split]+features[split+1:]
attr,dic=subtables(data,split,delete=True)
for x in range(len(attr)):
child=build_tree(dic[attr[x]],fea)
node.children.append((attr[x],child))
return node
def print_tree(node,level):
if node.answer!="":
print(" "*level,node.answer)
return
print(" "*level,node.attribute)
for value,n in node.children:
print(" "*(level+1),value)
print_tree(n,level+2)
def classify(node,x_test,features):
if node.answer!="":
print(node.answer)
return
pos=features.index(node.attribute)
for value, n in node.children:
if x_test[pos]==value:
classify(n,x_test,features)
'''Main program'''
dataset,features=load_csv("data3.csv")
node1=build_tree(dataset,features)
Output:
Outlook
rain
Wind
strong
no
weak
yes
overcast
yes
sunny
Humidity
normal
yes
high
no
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]),
dtype=float) y = np.array(([92], [86], [89]),
dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array
longitudinally y = y/100
#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))
#Variable initialization
epoch=5000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer
#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO* outgrad
EH = d_output.dot(wout.T)
Output:
Input:
[[0.66666667 1. ]
[0.33333333
0.55555556] [1.
0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.89726759]
[0.87196896]
[0.9000671]]
8. Write a program to implement k-Nearest Neighbour algorithm to classify the iris
data set. Print both correct and wrong predictions. Java/Python ML library classes
can be used for this problem.
Training algorithm:
∙ For each training example (x, f (x)), add the example to the list training
examples Classification algorithm:
∙ Given a query instance xq to be classified,
∙ Let x1 . . .xk denote the k instances from training examples that are nearest
to xq ∙ Return
∙ Where, f(xi) function to calculate the mean value of the k nearest training
examples.
Data Set:
Iris Plants Dataset: Dataset contains 150 instances (50 in each of three
classes) Number of Attributes: 4 numeric, predictive attributes and the
Class
Program:
""" Splits the dataset into 70% train data and 30% test data.
This means that out of total 150 records, the training set will
contain 105 records and the test set contains 45 of those
records """
x_train, x_test, y_train, y_test =
train_test_split(x,y,test_size=0.3)
Output:
Confusion Matrix
[[20 0 0]
[ 0 10 0]
[ 0 1 14]]
Accuracy Metrics
Basic knowledge
Confusion Matrix
True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually negative
False negatives: data points labelled as negative that are actually positive
Accuracy: how often is the classifier correct?
F1-Score:
Support: Total Predicted of Class.
Support = TP + FN
Example:
∙ Support _ A = TP_A + FN_A = 30 + (20 + 10)
= 60
9. Implement the non-parametric Locally Weighted Regression algorithm in
order to fit data points. Select appropriate data set for your experiment and
draw graphs.
Regression:
∙ Regression is a technique from statistics that is used to predict values of a
desired target quantity when the target quantity is continuous.
∙ In regression, we seek to identify (or estimate) a continuous variable y associated
with a given input vector x.
∙ y is called the dependent variable.
∙ x is called the independent variable.
Loess/Lowess Regression:
Loess regression is a nonparametric technique that uses local weighted regression
to fit a smooth curve through points in a scatter plot.
Lowess Algorithm:
∙ Locally weighted regression is a very powerful nonparametric model used in
statistical learning.
∙ Given a dataset X, y, we attempt to find a model parameter β(x) that
minimizes residual sum of weighted squared errors.
∙ The weights are given by a kernel function (k or w) which can be chosen arbitrarily
Algorithm
1. Read the Given data Sample to X and the curve (linear or non
linear) to Y 2. Set the value for Smoothening parameter or Free
parameter say τ
3. Set the bias /Point of interest set x0 which is a subset of X
4. Determine the weight matrix using :
6. Prediction = x0*β:
Program
import numpy as np
from bokeh.plotting import figure, show,
output_notebook from bokeh.layouts import
gridplot
from bokeh.io import push_notebook
beta = np.linalg.pinv(xw @ X) @ xw @ Y #@
Matrix Multiplication or Dot Product
# predict value
return x0 @ beta # @ Matrix Multiplication or Dot
Product for prediction
def radial_kernel(x0, X, tau):
return np.exp(np.sum((X - x0) ** 2, axis=1) / (-2 *
tau * tau))
# Weight or Radial Kernal Bias Function
n = 1000
# generate dataset
X = np.linspace(-3, 3, num=n)
print("The Data Set ( 10 Samples) X :\
n",X[1:10]) Y = np.log(np.abs(X ** 2 - 1) +
.5)
print("The Fitting Curve Data Set (10
Samples) Y :\n",Y[1:10])
# jitter X
X += np.random.normal(scale=.1, size=n)
print("Normalised (10 Samples) X :\n",X[1:10])
show(gridplot([
[plot_lwr(10.), plot_lwr(1.)],
[plot_lwr(0.1), plot_lwr(0.01)]]))
Output
def localWeight(point,xmat,ymat,k):
wei = kernel(point,xmat,k)
W=(X.T*(wei*X)).I*(X.T*(wei*ym
at.T))
return W
def
localWeightRegression(xmat,yma
t,k): m,n = np1.shape(xmat)
ypred = np1.zeros(m)
for i in range(m):
ypred[i] = xmat[i]*localWeight(xmat[i],xmat,ymat,k)
return ypred
# load data points
data = pd.read_csv('tips.csv')
bill = np1.array(data.total_bill)
tip = np1.array(data.tip)
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(bill,tip, color='green')
ax.plot(xsort[:,1],ypred[SortIndex], color = 'red', linewidth=5)
plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show();
10. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write
the program. Calculate the accuracy, precision, and recall for your data set.
sExperiment-11: Apply EM algorithm to cluster a Heart Disease Data Set. Use the same
data set for clustering using k-Means algorithm. Compare the results of these two algorithms
and comment on the quality of clustering. You can add Java/Python ML library classes/API in
the program.
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
X.columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y = pd.DataFrame(iris.target)
y.columns = ['Targets']
model = KMeans(n_clusters=3)
model.fit(X)
plt.figure(figsize=(14,7))
y_gmm = gmm.predict(xs)
#y_cluster_gmm
plt.subplot(2, 2, 3)
plt.scatter(X.Petal_Length, X.Petal_Width, c=colormap[y_gmm], s=40)
plt.title('GMM Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
Theory
A Bayesian network is a directed acyclic graph in which each edge corresponds to a
conditional dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of
conditional probability distributions
• The directed acyclic graph is a set of random variables represented by nodes. •
The conditional probability distribution of a node (random variable) is defined for
every possible outcome of the preceding causal node(s).
For illustration, consider the following example. Suppose we attempt to turn on our
computer, but the computer does not start (observation/evidence). We would like to
know which of the possible causes of computer failure is more likely. In this
simplified illustration, we assume only two possible causes of this misfortune:
electricity failure and computer malfunction.
The corresponding directed acyclic graph is depicted in below figure.
Fig: Directed acyclic graph representing two independent possible causes of a computer
failure.
The goal is to calculate the posterior conditional probability distribution of each of
the possible unobserved causes given the observed evidence, i.e. P [Cause |
Evidence].
Data Set:
Title: Heart Disease Databases
The Cleveland database contains 76 attributes, but all published experiments refer to
using a subset of 14 of them. In particular, the Cleveland database is the only one
that has been used by ML researchers to this date. The "Heartdisease" field refers to
the presence of heart disease in the patient. It is integer valued from 0 (no presence)
to 4.
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Attribute Information:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
• Value 1: typical angina
• Value 2: atypical angina
• Value 3: non-anginal pain
• Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the
hospital) 5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
• Value 0: normal
• Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
• Value 2: showing probable or definite left ventricular hypertrophy by Estes'
criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11.slope: the slope of the peak exercise ST segment
• Value 1: upsloping
• Value 2: flat
• Value 3: downsloping
12.thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
13.Heartdisease: It is integer valued from 0 (no presence) to 4.
Program:
import numpy as np
import pandas as pd
import csv
from pgmpy.estimators import
MaximumLikelihoodEstimator from pgmpy.models
import BayesianModel
from pgmpy.inference import VariableElimination
Output: