CV Unit 2
CV Unit 2
Syllabus content
Unit 4: Basic Image and Digital Image Processing
Camera:
Light:
Color:
o o
f f
c c
a a
m m
e e
Real World Computer Vision Applications
OpenCV, short for Open Source Computer Vision Library, is an open-source computer vision and machine
learning software library. Originally developed by Intel, it is now maintained by a community of developers
under the OpenCV Foundation.
OpenCV is one of the most popular computer vision libraries. If you want to start your journey in the field
of computer vision, then a thorough understanding of the concepts of OpenCV is of paramount importance.
To understand the basic functionalities of Python OpenCV module, we will cover the most basic and
important concepts of OpenCV intuitively:
1. Reading an image
2. Extracting the RGB values of a pixel
3. Extracting the Region of Interest (ROI)
4. Resizing the Image
5. Rotating the Image
6. Drawing a Rectangle
7. Displaying text , etc
What is an Edge?
In computer vision, an edge in an image is a significant local change in the image's
brightness, hue, or intensity. Edges are often associated with discontinuities in the image's
intensity or its first derivative. They can also be defined as a set of connected pixels that form a
boundary between two different regions.
Edges can be distinguished from noise by their long-range structure. They also have properties
such as gradient and orientation. Discontinuities in an image's brightness can be caused by
changes in depth, surface orientation, scene illumination, or material properties.
Edge detection is an important task in object recognition. When two non-parallel edges meet,
they form a corner
Real World Computer Vision Applications
Real World Computer Vision Applications
Real World Computer Vision Applications
Line Detection
A image in a photograph is called a raw image, and in order to extract useful information
from it, it must be put in a certain form. The first step in preparing the picture for higher-
level processing is called pre-processing. The purpose of pre-processing is two-fold: to
eliminate undesirable features that will hinder further processing and to extract the
desirable features that represent useful information in the image. Unwanted image
attributes include noise (insignificant lines and contours) and the presence of featureless
space. The important features include surface details and boundaries such as lines, edges,
and vertices.
Edge Detection
The first step in object pre-processing is edge detection. To isolate an image from its
background and neighboring images, you must first recognize its edges. An edge in an
image is an image contour across which the image's brightness or hue changes abruptly,
Real World Computer Vision Applications
perhaps in the magnitude or in the rate of change in the magnitude These edges are
modeled as mathematical discontinuities. The principle intensity edges in an object are:
Two methods of edge detection are Thresholding using Histograms and Gaussian
Convolution.
Thresholding
Thresholding is a process by which the intensity resolution of a picture is reduced
(to be displayed, for example, on a computer that does not support as high an
intensity resolution as the picture). The threshold should be between the average
intensity of the object and the average intensity of the background or other object.
Histograms aid in the use of determining this value.
Histograms
Histograms are used to separate an image from its background and to separate objects of
different colors, by calculating and locating in the picture the changes in intensity
resolution. A histogram is used one of two ways. One is a graph of the frequency of
occurrence of each level of intensity in an image. It indicates the picture's changes in
intensity by finding all the values of the pixels in the picture and then plotting the number
Real World Computer Vision Applications
of pixels that have each value. If the image is a high-intensity image, where the difference
between the object and the background is very different, the graph will contain many
well-defined peaks.
The threshold lies at one of these peaks (or troughs). The second method by which
histograms are utilized is to record the value of every pixel as it appears in the
picture. This method is employed more for edge detection: as the resolution of the
graph increases, a more defined edge or threshold can be located by closer
examination of the change in the pixels' intensities. Notice the abrupt intensity
changes in this close-up:
Real World Computer Vision Applications
One issue that may arise in edge detection with histograms is that of noise. In the
histogram shown, there is more than one peak; some of the peaks could be
mistaken for an edge. These subsidiary peaks occur because of the presence of
noise in the image. Below is an example of a noisy image, and its results after edge
detection:
To try out edge-detection for yourself, check out the Edge Detector Demo at
Carnegie Melon University.
Edges provide strong visual clues that can help the recognition process.
Dilation:
Dilation expands the image pixels i.e. it is used for expanding an element A by
using structuring element B.
Dilation adds pixels to object boundaries.
Real World Computer Vision Applications
The value of the output pixel is the maximum value of all the pixels in the
neighborhood. A pixel is set to 1 if any of the neighboring pixels have the value 1.
Dilation Erosion
It increases the size of the objects. It decreases the size of the objects.
It fills the holes and broken areas. It removes the small anomalies.
It connects the areas that are separated by space It reduces the brightness of the bright
smaller than structuring element. objects.
It increases the brightness of the objects. It removes the objects smaller than
the structuring element.
Distributive, duality, translation and decomposition It also follows the different
properties are followed. properties like duality etc.
It is XOR of A and B. It is dual of dilation.
It is used prior in Closing operation. It is used later in Closing operation.
It is used later in Opening operation. It is used prior in Opening operation.
Opening is a process in which first erosion Closing is a process in which first dilation
1. operation is performed and then dilation operation is performed and then erosion
operation is performed. operation is performed.
It eliminates the thin protrusions of the It eliminates the small holes from the obtained
3.
obtained image. image.
Opening is used for smoothening of Closing is used for removing internal noise of
5.
contour and fusing of narrow breaks. the obtained image
Perspective Transformation
When human eyes see near things they look bigger as compare to those who are far away. This is called
perspective in a general way. Whereas transformation is the transfer of an object e.t.c from one state to
another. So overall, the perspective transformation deals with the conversion of 3d world into 2d image. The
same principle on which human vision works and the same principle on which the camera works.
We will see in detail about why this happens, that those objects which are near to you look bigger, while
those who are far away, look smaller even though they look bigger when you reach them.
Frame of reference:
Object
World
Camera
Image
Pixel
Object coordinate frame
Object coordinate frame is used for modeling objects. For example, checking if a particular object is in a
proper place with respect to the other object. It is a 3d coordinate system.
World coordinate frame is used for co-relating objects in a 3 dimensional world. It is a 3d coordinate system.
Camera co-ordinate frame is used to relate objects with respect of the camera. It is a 3d coordinate system.
It is not a 3d coordinate system, rather it is a 2d system. It is used to describe how 3d points are mapped in a
2d image plane.
AD
Where
Y = 3d object
y = 2d Image
Now there are two different angles formed in this transform which are represented by Q.
Real World Computer Vision Applications
The first angle is
Where minus denotes that image is inverted. The second angle that is formed is:
From this equation, we can see that when the rays of light reflect back after striking from the object, passed
from the camera, an invert image is formed.
For example
Suppose an image has been taken of a person 5m tall, and standing at a distance of 50m from the camera,
and we have to tell that what is the size of the image of the person, with a camera of focal length is 50mm.
Solution:
Since the focal length is in millimeter, so we have to convert every thing in millimeter in order to calculate
it.
So,
Y = 5000 mm.
f = 50 mm.
Z = 50000 mm.
= -5 mm.
Image Pyramids
Image pyramids are one of the most beautiful concept of image processing. Normally, we
work with images with default resolution but many times we need to change the resolution
(lower it) or resize the original image in that case image pyramids comes handy.
The pyrUp() function increases the size to double of its original size and pyrDown() function
decreases the size to half. If we keep the original image as a base image and go on
applying pyrDown function on it and keep the images in a vertical stack, it will look like a
pyramid. The same is true for upscaling the original image by pyrUp function.
Once we scale down and if we rescale it to the original size, we lose some information and the resolution
of the new image is much lower than the original one.
Below is an example of Image Pyramiding –
import cv2
import matplotlib.pyplot as plt
img = cv2.imread("images/input.jpg")
layer = img.copy()
for i in range(4):
plt.subplot(2, 2, i + 1)
plt.imshow(layer)
cv2.imshow("str(i)", layer)
cv2.waitKey(0)
cv2.destroyAllWindows()
Output:
Reading an Image
Images are represented as arrays consisting of pixel values. 8-bit images have pixel values ranging from 0
(black) to 255 (white). Depending on the color scale there are various channels in an image, each channel
representing the pixel values for one particular color. RGB (Red, green, blue) is the most commonly used
color scale and all images I’ve used in my examples are RGB images.
We can easily read the image array using the imread function from OpenCV. One thing to remember here is
Cropping
Cropping is a widely used augmentation technique. However, be careful as to not crop important parts of the
image (pretty obvious, but easy to miss when you have too many images of various different sizes). Since
images are represented using arrays, cropping is equivalent to taking out a slice from an array:
Resizing
Most deep learning model architectures expect all input images to be of the same dimensions.
resized = cv2.resize(im, (120,90))
plt.imshow(resized)
<matplotlib.image.AxesImage at 0x128427240>
Real World Computer Vision Applications
Flipping image
This is another very popular image augmentation technique. The only thing to remember here is that the
flipping should make sense for your use case. For example, if you’re classifying building types, you wouldn’t
encounter any inverted buildings in your test set so it doesn’t make sense to do a vertical flip in this case.
flip_v = np.flip(im,0)
plt.imshow(flip_v)
<matplotlib.image.AxesImage at 0x1284cd358>
Rotate Image:
In most cases, it is okay to rotate the image by a small angle. The naive way of doing this might change the
Hence, a better way of rotating is by doing an affine transform using OpenCV. An affine transformation
preserves collinearity and ratios of distances (eg: the midpoint of a line segment continues to remain the
midpoint even after transformation). You can also fill the borders by using the BORDER_REFLECT flag.
Real World Computer Vision Applications
Here alpha (>0) is called gain and beta is called bias, these parameters are said to control contrast and brightness
respectively. Since we represent images using arrays, this function can be applied to each pixel by traversing through
the array.
Object detection is a very popular computer vision problem that involves finding a bounding box enclosing
the object of interest. Displaying the bounding box on the picture can help us visually inspect the problem
and requirements. One thing to remember while dealing with these problems is that if you’re planning to flip
the image, make sure you flip the box coordinates accordingly too. Here’s an easy way to display an image
Often we want to inspect multiple images in one go. It can be easily done using subplots in matplotlib.
Although not widely used in computer vision, it’s nice to know how to convert color images to greyscale.
Blur:
This technique can be useful in making your model more robust to image quality issues. If a model can
perform well on blurred images, it may indicate the model is doing well in general.
Thresholding
Thresholding is one of the segmentation techniques that generates a binary image (a binary image is one
whose pixels have only two values – 0 and 1 and thus requires only one bit to store pixel intensity) from a
given grayscale image by separating it into two regions based on a threshold value. Hence pixels having
intensity values greater than the said threshold will be treated as white or 1 in the output image and the
others will be black or 0.
Adaptive thresholding can be used to convert images to grayscale or binary, separate objects from t heir
backgrounds, and improve segmentation
Real World Computer Vision Applications
To understand an object in an image, we first need to find its shape, determined by its contour. The
boundary or contour marks the outline of an object in an image. So, detecting contours plays a vital role in
applications for identifying and segmenting objects in an image.
A contour consists of the pixels in an object’s boundary:
These pixels are usually of the same color, differentiating them from the rest.
Real World Computer Vision Applications
3. Contour Representation
We represent contours with chain codes and shape numbers. These parameters help in clear representation
and a better understanding of object contour. However, computing these parameters is purely optional in the
process of finding contours.
We can trace contours with chain codes. A chain code indicates the directions of tracing along the
boundary. Tracing starts from the selected initial point and proceeds clockwise.
There are two types of chain codes, 4-directional and 8-directional:
They differ in the number of directions along which we can trace a contour and specify a unique number for
each direction. Let’s see how an object can be represented with chain codes:
The image after sampling shows the boundary pixels. These are the contour points. Assuming the starting
position is top-left, we get the chain code moving clockwise. In our example, the 4-directional and 8-
directional chain codes are and .
Real World Computer Vision Applications
3.2. The Shape Numbers and First Differences
A shape number represents a normalized version of the corresponding chain code’s first difference.
The first difference shows how many counterclockwise directional changes were made in the chain code.
We compute the first difference by considering adjacent pairs one at a time. For example, the 4-
directional chain code pair needs no (counterclockwise) directional changes, so their first difference is .
However, the chain code pair takes changes, so the first difference is .
We concatenate the differences of the consecutive pairs to get the first difference of a chain code. For
example, the first difference of the four-directional code is .
A shape number’s the same as the first difference, except it starts from the lowest number in the first
difference. For instance, assuming the object’s first difference is , its shape number is . The differences
before the first lowest number are cyclically shifted to the left:
OpenCV (Open Source Computer Vision) is a computer vision library that contains various functions to
perform operations on Images or videos. OpenCV library can be used to perform multiple operations on
videos.
Let’s see how to detect the corner in the image.
cv2.goodFeaturesToTrack() method finds N strongest corners in the image by Shi-Tomasi method. Note that
the image should be a grayscale image. Specify the number of corners you want to find and the quality level
(which is a value between 0-1). It denotes the minimum quality of corner below which everyone is rejected.
Then provide the minimum Euclidean distance between corners detected.
Syntax : cv2.goodFeaturesToTrack(image, maxCorners, qualityLevel, minDistance[, corners[, mask[,
blockSize[, useHarrisDetector[, k]]]]])
>>hariscorner
Real World Computer Vision Applications
Image before corner detection:
plt.imshow(img), plt.show()
To understand an object in an image, we first need to find its shape, determined by its contour. The
boundary or contour marks the outline of an object in an image. So, detecting contours plays a vital role in
applications for identifying and segmenting objects in an image.
When we join all the points on the boundary of an object, we get a contour. Typically, a specific contour
refers to boundary pixels that have the same color and intensity. OpenCV makes it really easy to find and
draw contours in images. It provides two simple functions:
1. findContours()
2. drawContours()
1. CHAIN_APPROX_SIMPLE
Real World Computer Vision Applications
2. CHAIN_APPROX_NONE
Hough transform is a feature extraction method for detecting simple shapes such as circles, lines etc in an image.A “simple”
shape is one that only a few parameters can represent. For example, a line can be represented by two parameters (slope,
intercept) and a circle has three parameters — the coordinates of the center and the radius (x, y, r). Hough transform does an
excellent job in finding such shapes in an image.
The main advantage of using the Hough transform is that it is insensitive to occlusion. Let’s see how Hough transform works
by way of an example.
From high school math class we know the polar form of a line is represented as:
Here represents the perpendicular distance of the line from the origin in pixels, and is the angle measured in radians,
which the line makes with the origin as shown in the figure above.
You may be tempted to ask why we did not use the familiar equation of the line given below
The reason is that the slope, m, can take values between – to + . For the Hough transform, the parameters need to be
bounded.
Real World Computer Vision Applications
You may also have a follow-up question. In the form, is bounded, but can’t take a value between 0 to + ? That
may be true in theory, but in practice, is also bounded because the image itself is finite.
Accumulator
When we say that a line in 2D space is parameterized by and , it means that if we any pick a , it corresponds to a
line. Imagine a 2D array where the x-axis has all possible values and the y-axis has all possible values. Any bin in this 2D
array corresponds to one line.
Fig2 Accumulator
This 2D array is called an accumulator because we will use the bins of this array to collect evidence about which lines exist in
the image. The top left cell corresponds to a (-R, 0) and the bottom right corresponds to (R, ).
We will see in a moment that the value inside the bin ( , ) will increase as more evidence is gathered about the presence of a
line with parameters and .
First, we need to create an accumulator array. The number of cells you choose to have is a design decision. Let’s say you chose
a 10×10 accumulator. It means that can take only 10 distinct values and the can take 10 distinct values, and therefore you
will be able to detect 100 different kinds of lines. The size of the accumulator will also depend on the resolution of the image.
But if you are just starting, don’t worry about getting it perfectly right. Pick a number like 20×20 and see what results you get.
Now that we have set up the accumulator, we want to collect evidence for every cell of the accumulator because every cell of
the accumulator corresponds to one line.
Real World Computer Vision Applications
How do we collect evidence?
The idea is that if there is a visible line in the image, an edge detector should fire at the boundaries of the line. These edge
pixels provide evidence for the presence of a line.
For every edge pixel (x, y) in the above array, we vary the values of from 0 to and plug it in equation 1 to obtain a value
for .
In the Figure below we vary the for three pixels ( represented by the three colored curves ), and obtain the values for using
equation 1.
As you can see, these curves intersect at a point indicating that a line with parameters and is passing
through them.
Typically, we have hundreds of edge pixels and the accumulator is used to find the intersection of all the curves generated by
the edge pixels.
Let’s say our accumulator is 20×20 in size. So, there are 20 distinct values of and so for every edge pixel (x, y), we can
calculate 20 ( , ) pairs by using equation 1. The bin of the accumulator corresponding to these 20 values of is
incremented. We do this for every edge pixel and now we have an accumulator that has all the evidence about all possible lines
in the image. We can simply select the bins in the accumulator above a certain threshold to find the lines in the image. If the
threshold is higher, you will find fewer strong lines, and if it is lower, you will find a large number of lines including some
weak ones. HoughLine: How to Detect Lines using OpenCV.
Syntax
Parameters
Code:
# Read image
# Show result
Syntax :
Real World Computer Vision Applications
cv2.goodFeaturesToTrack(image, maxCorners, qualityLevel, minDistance[, corners[, mask[,
blockSize[, useHarrisDetector[, k]]]]])
import numpy as np
import cv2
img = cv2.imread('corner1.png')
corners = np.int0(corners)
for i in corners:
x, y = i.ravel()
plt.imshow(img), plt.show()
Parameters :
gray_img – Grayscale image with integral values
maxc – Maximum number of corners we want(give negative value to get all the corners)
Q – Quality level parameter(preferred value=0.01)
maxD – Maximum distance(preferred value=10)
For example, remote sensing images. These images are captured using satellites, and different operations are
applied to them. These operations aim for image transformation that is helpful in further analysis of the
Real World Computer Vision Applications
image. No matter which method we adopt, we get a new image generated from one or more than one source.
These are called image transformations.
In basic image transformation, we apply arithmetic operations to our image data. For example, image
subtraction is performed to detect the changes between two images. These changes occurred because the
same area was captured at different times. Let’s explore more about image transformation types.
A generative adversarial network (GAN) is a machine learning (ML) model in which two neural
networks compete with each other by using deep learning methods to become more accurate in their
predictions. GANs typically run unsupervised and use a cooperative zero-sum game framework to learn,
where one person's gain equals another person's loss.
The two neural networks that make up a GAN are referred to as the generator and the discriminator. The
generator is a convolutional neural network and the discriminator is a deconvolutional neural network.
The goal of the generator is to artificially manufacture outputs that could easily be mistaken for real data.
The goal of the discriminator is to identify which of the outputs it receives have been artificially created.
Essentially, generative models create their own training data. While the generator is trained to produce
false data, the discriminator network is taught to distinguish between the generator's manufactured data
and true examples. If the discriminator rapidly recognizes the fake data that the generator produces -- such
as an image that isn't a human face -- the generator suffers a penalty. As the feedback loop between the
adversarial networks continues, the generator will begin to produce higher-quality and more believable
output and the discriminator will become better at flagging data that has been artificially created. For
instance, a generative adversarial network can be trained to create realistic-looking images of human faces
that don't belong to any real person.
The first step in establishing a GAN is to identify the desired end output and gather an initial training data
set based on those parameters. This data is then randomized and input into the generator until it acquires
basic accuracy in producing outputs.
Next, the generated samples or images are fed into the discriminator along with actual data points from the
original concept. After the generator and discriminator models have processed the data, optimization
with backpropagation starts. The discriminator filters through the information and returns a probability
between 0 and 1 to represent each image's authenticity -- 1 correlates with real images and 0 correlates
with fake. These values are then manually checked for success and repeated until the desired outcome is
reached.
This creates a double feedback loop where the discriminator is in a feedback loop with the ground truth of
the images and the generator is in a feedback loop with the discriminator.
Types of GANs
GANs come in a variety of forms and can be used for various tasks. The following are the most common
GAN types:
Real World Computer Vision Applications
Vanilla GAN. This is the simplest of all GANs and its algorithm tries to optimize the mathematical
equation using stochastic gradient descent, which is a method of learning an entire data set by
going through one example at a time. It consists of a generator and a discriminator. The
classification and creation of generated images is done using the generators and discriminators as
straightforward multi-layer perceptrons. The discriminator seeks to determine the likelihood that
the input belongs to a particular class while the generator collects the distribution of the data.
Conditional GAN. By applying class labels, this kind of GAN enables the conditioning of the
network with new and specific information. As a result, during GAN training, the network receives
the images with their actual labels, such as "rose," "sunflower" or "tulip" to help it learn how to
distinguish between them.
Deep convolutional GAN. This GAN uses a deep convolutional neural network for producing
high-resolution image generation that can be differentiated. Convolutions are a technique for
drawing out important information from the generated data. They function particularly well with
images, enabling the network to quickly absorb the essential details.
CycleGAN. This is the most common GAN architecture and is generally used to learn how to
transform between images of various styles. For instance, a network can be taught how to alter an
image from winter to summer or from an image of a horse to a zebra. One of the most well-known
applications of CycleGAN is FaceApp, which alters human faces into various age groups.
StyleGAN. Researchers from Nvidia released StyleGAN in December 2018 and proposed
significant improvements to the original generator architecture models. StyleGAN can produce
photorealistic, high-quality photos of faces, but users can modify the model to alter the appearance
of the images that are produced.
Super resolution GAN. With this type of GAN, a low-resolution image can be changed into a more
detailed one. Super-resolution GANs increase the image resolution by filling in blurry spots.
GANs are becoming a popular ML model for online retail sales because of their ability to understand and
recreate visual content with increasingly remarkable accuracy. They can be used for a variety of tasks,
including anomaly detection, data augmentation, picture synthesis, and text-to-image and image-to-image
translation.
Photo translations from image sketches or semantic images that are especially useful in the
healthcare industry for diagnoses.
GAN examples
GANs are used to generate a wide range of data types, including images, music and text. The following
are popular real-world examples of GAN:
Generating human faces. GANs can produce accurate representations of human faces. For
example, StyleGAN2 from Nvidia can produce excellent, photorealistic images of people that don't
exist. These pictures are so lifelike that many people believe they're actual individuals.
Developing new fashion designs. GANs can be used to create new fashion designs that reflect
existing ones. For instance, clothing retailer H&M used GANs to create new apparel designs for its
merchandise.
Creating realistic animal images. GANs can also generate realistic images of animals. For
example, BigGAN, a GAN model developed by Google researchers, can produce high-quality
images of animals such as birds and dogs.
Video game character creation. GANs can be used to create new characters for video games. For
example, Nvidia created new characters using GANs for the well-known video game Final
Fantasy XV.
Generating realistic three-dimensional (3D) objects. GANs are also capable of producing
actual 3D objects. For example, researchers at Massachusetts Institute of Technology have created
3D models of chairs and other furniture that appear to have been created by people using GANs.
These models can be applied to architectural visualization or video games.