CV2019
CV2019
C. Rasche
compvis12 [at] gmail [dot] com
January 1, 2019
This is a dense introduction to the field of computer vision. It covers all topics including Deep Neural Net-
works (TensorFlow and PyTorch). It provides plenty of code snippets and copy-paste examples for Matlab
and Python. It aims at the practiconer who tries to understand the working principals of the algorithms.
We firstly sketch some of the basic feature extraction methods as they also help to understand the ar-
chitecture of Deep Neural Networks. Then, we proceed with feature extraction and matching based on
histograms of gradients - they build the basis of many tasks such as object instance detection and im-
age retrieval. We then introduce object detection based on the sliding window technique, i.e. suitable for
face and pedestrian detection. It follows a treatment of image processing techniques - segmentation and
morphological processing - and of shape and contour recognition techniques. We overview the essential
tracking methods - for regions and moving objects. We close with a survey of video surveillance, in-vehicle
vision system and remote sensing.
Contents
1 Introduction 6
1.1 Related Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Recognition - An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Areas of Application (Examples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Developing a Computer Vision System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Historical Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 From Development to Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
5 Classification with Deep Neural Networks 30
5.1 A Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Transfer Learning (PyTorch) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Fixed Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Fine Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Feature Quantization 41
7.1 Building a Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1.1 Vector-Quantization using the k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . 42
7.2 Applying the Dictionary to an Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8 Object Detection 44
8.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.1.1 Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.2 Pedestrian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.3 Improvement by Knowing Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
11 Shape 58
11.1 Compact Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
11.1.1 Simple Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1.2 Radial Description (Centroidal Profiles) . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.2 Point-Wise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.2.1 Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.2.2 Sets of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.3 Toward Parts: Distance Transform & Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.3.1 Distance Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.3.2 Symmetric Axes (Medial Axes), Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2
12 Contour 64
12.1 Straight Lines, Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.2 Edge Following (Curve Tracing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.3.1 Curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.3.2 Amplitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.4 Other Contour Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
14 Tracking 71
14.1 Tracking by Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
14.2 Tracking Translations by Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
14.3 Optimizing and Increasing Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.3.1 Kalman Filters SHB p793, s 16.6.1 FoPo p369, s 11.3, pdf 339 . . . . . . . . . . . . . . . . . . . . . . . . 74
14.3.2 Particle Filters Sze p243, s 5.1.2, pdf 276 SHB p798, s 16.6.2 FoPo p380, s 11.5, pdf 350 . . . . . . . . . . . . . . . 75
16 3D Vision 84
16.1 Stereo Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
16.2 Shape from X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
16.2.1 Shape from Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
16.2.2 Shape from Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
18 Classifying Motions 91
18.1 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
18.2 Body Motion Classification with Kinect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
18.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
18.2.2 Labeling Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
18.2.3 Computing Joint Positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
19 More Systems 94
19.1 Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
19.2 Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
19.3 Remote Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A Image Acquisition 99
3
B Convolution [Signal Processing] 100
B.1 In One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
B.2 In Two Dimensions [Image Processing] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
G Resources 111
4
J.12 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
J.13 2D Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
J.14 RanSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
J.15 Posture Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5
1 Introduction
Computer Vision is the field of interpreting image content. It is concerned with the classification of the
entire image, such as in a system classifying photos uploaded to the internet (Facebook, Instagram). Or
Computer Vision is concerned with the recognition of objects in an image, such as detecting faces or car
license plates (Facebook, GoogleStreetView). Or it is concerned with the detection of aspects of an image,
such as cancer detection in biomedical images.
Origin Computer Vision was originally founded as a sub-discipline of the field of Artificial Intelligence in
the 1970s. The founding goal was to create a system that has the same perceptual capabilities as the
human visual system has - your eyes and most of your brain. The human visual system can easily interpret
any scene with little effort: it perfectly discriminates between thousands of categories, and it can find objects
in scenes within a time span of several hundred milliseconds only; it easily switches between several types
of recognition processes with a flexibility and swiftness, whose complexity and dynamics have not been
well understood yet. It quickly turned out, that that goal was rather ambitious.
Instead, Computer Vision has focused on a set of specific recognition challenges, to be introduced
in Section 1.2. Those challenges can be often implemented in different ways, with each implementation
having advantages and disadvantages. Throughout the decades, many applications have been created
(Section 1.3), and some of those implemented tasks begin now to outperform a human observer - such as
face identification, letter recognition, or the ability to maneuver through traffic (autonomous vehicles). And
that itself is astounding, even though the original goal of an omni-vision system has not been achieved yet.
Today, Computer Vision is considered its own field.
Frontier Computer Vision is still considered a frontier, despite its evolvement over almost 50 years. The
success of modern Computer Vision is less the result of truly novel algorithms, but rather a result of increas-
ing computer speed and memory. In particular shape recognition is - despite its simple sounding task - still
not properly understood. And even though Google has a system that can recognize thousands of classes,
the system occasionally fails so bluntly, that one may wonder what other algorithms need to be invented in
order to achieve a flawless recognition process. If those algorithms will not be invented, then household
robots may always make some nerve-wrecking errors, such as mistaking the laundry basket for the trash
bin, confusing the microwave with the glass cabinet, etc. Thus, despite all the progress that has been made,
it still requires innovative algorithms.
In particular in the past several years, Computer Vision has received new impetus by the use of so-
called Deep Learning algorithms, with which one can classify decently large image collections. Google and
Facebook are competing to provide the best interface to those learning algorithms: Goolge offers Tensor-
flow, Facebook provides PyTorch, both of which are Python-based libraries to run classification algorithms.
And that is the reason why we treat that topic relatively early (Section 5), after a quick warm up with the
classical methods. We then proceed with a method that had been popular just before the arrival of Deep
Learning algorithms, namely feature extraction and matching (Sections 6 and 7). Later on, we continue
with traditional techniques (Section 9), and we also mention approaches to the most enigmatic challenge of
Computer Vision, namely shape recognition (Section 11) and contour description (Section 12).
Image Processing is concerned with the transformation or other manipulation of the image with the goal to
emphasize certain image aspects, e.g. contrast enhancement, or extraction of low-level features such
as edges, blobs, etc; in comparison, Computer Vision is rather concerned with higher-level feature
extraction and their interpretation for recognition purposes.
6
Machine Vision is concerned with applying a range of technologies and methods to provide imaging-based
automatic inspection, process control and robot guidance in industrial applications. A machine-vision
system has typically 3 characteristics:
1) objects are seen against an uniform background, which represents a ’controlled situation’.
2) objects possess limited structural variability, sometimes only one object needs to be identified.
3) the exact orientation in 3D is of interest.
An example is car license plate detection and reading at toll gates, which is a relatively controlled
situation. In comparison, computer vision systems often deal with objects of larger variability and
objects that are situated in varying backgrounds. Car license plate detection in GoogleStreetView is
an example of an object with limited variability but varying context.
There exist two other fields that overlap with Computer Vision:
Pattern Recognition (Machine Learning) is the art of classification (or categorization). To build a good
computer vision system, it requires substantial knowledge of classification methodology. Sometimes it
is even the more significant part of the computer-vision system, as in case of image classification, for
which so-called Deep Neural Networks have produced the best classification accuracy so far (Section
5). Clearly, we cannot treat classification in depth in this course and we will merely point out how to
use some of the classifiers (see also Appendices D and E).
Computer Graphics is sometimes considered as part of Computer Vision. The objective in Computer
Graphics is to represent objects and scenes as compactly and efficiently as possible; however there
is no recognition of any kind involved.
7
Motion Estimation: is the reconstruction of the precise movement between successive frames or be-
tween frames some time apart. It answers how the object or scene has moved. It helps to reconstruct
the precise pose of objects or an observer’s viewpoint.
Motion Classification: is the challenge of recognizing entire, completed motions as in gesture recogni-
tion or human body movements. It answers what type of movement the object has carried.
The following list of areas merely gives an overview of where computer vision techniques have been applied
so far; the list also contains applications of image processing and machine vision, as those fields are related:
Medical imaging: registering pre-operative and intra-operative imagery; performing long-term studies
of people’s brain morphology as they age; tumor detection, measurement of size and shape of internal
organs; chromosome analysis; blood cell count.
Automotive safety: traffic sign recognition, detecting unexpected obstacles such as pedestrians on the
street, under conditions where active vision techniques such as radar or lidar do not work well.
Surveillance: monitoring for intruders, analyzing highway traffic, monitoring pools for drowning victims.
Gesture recognition: identifying hand postures of sign level speech, identifying gestures for human-
computer interaction or teleconferencing.
Fingerprint recognition and biometrics: automatic access authentication as well as forensic applica-
tions.
Retrieval: as in image search on Google for instance.
Visual authentication: automatically logging family members onto your home computer as they sit
down in front of the webcam.
Robotics: recognition and interpretation of objects in a scene, motion control and execution through
visual feedback.
Cartography: map making from photographs, synthesis of weather maps.
Radar imaging: target detection and identification, guidance of helicopters and aircraft in landing, gui-
dance of remotely piloted vehicles (RPV), missiles and satellites from visual cues.
Remote sensing: multispectral image analysis, weather prediction, classification and monitoring of ur-
ban, agricultural, and marine environments from satellite images.
Machine inspection: defect and fault inspection of parts: rapid parts inspection for quality assurance
using stereo vision with specialized illumination to measure tolerances on aircraft wings or auto body
parts; or looking for defects in steel castings using X-ray vision; parts identification on assembly lines.
The following are specific tasks which can be often solved with image processing techniques and pattern
recognition methods and that is why they are often marginally treated only in computer vision textbooks - if
at all:
Optical character recognition (OCR): identifying characters in images of printed or handwritten text,
usually with a view to encoding the text in a format more amenable to editing or indexing (e.g. ASCII).
Examples: mail sorting (reading handwritten postal codes on letters), automatic number plate recog-
nition (ANPR), label reading, supermarket-product billing, bank-check processing.
2D Code reading reading of 2D codes such as data matrix and QR codes.
For more applications one can visit Google’s Tensorflow product designed to run on websites:
https://js.tensorflow.org/
There exist two philosophies to approach the implementation of a computer vision task, see the two path-
ways in Figure 1. The left path represents the traditional approach and is sometimes called feature engi-
neering. The right path stands for the modern approach and is sometimes called feature learning - it can
8
Figure 1: Organization of a computer vision system: feature engineering (left) versus feature learning (right). Feature
engineering is the more traditional approach (Sections 6, 7, 15.2); feature learning the modern approach (Section 5).
Both have advantages and disadvantages.
be considered the more successful avenue these days, as it has proven to outperform many ‘engineered’
systems in many tasks. We elaborate a bit.
The feature engineering approach typically starts with an image processing phase whose aim is to
prepare the image for the subsequent processing phases, as mentioned under ‘Image Processing’ in Sec-
tion 1.1. In a second phase, the feature extraction phase, the image is searched for features that are robust
under changes in light, for instance contour corners, so-called blobs, etc. Those are then matched in a
third phase, the feature matching phase, to some sort of representation in order to arrive at a classification
decision. This approach is sometimes labeled a white box system, as the individual techniques are individ-
ual, the processing steps are more explicit and as there are few parameters. In case of failure of a (single)
recognition run, it is possible to pinpoint to the insufficient processing step.
The feature learning approach is pursued by developers of the neural network methodology, nowadays
simply called deep learning. The image is processed by a series of convolution phases, by which features
are gradually extracted and then integrated to a ‘global’ image map, from which the decision is made. The
learning procedure involves finding the ‘weights’ for the innumerous convolutions in an automatic manner.
In comparison to the feature engineering approach, the techniques here are less individual, the processing
steps are less explicit, and the system has innumerous parameters. In case of failure, it is difficult to pinpoint
an exact location of weakness.
The two depictions are exagerated formulations - created by the proponers of the respective philoso-
phies. In the engineering approach, not everything is as fully transparent as the label ‘white box’ would
suggest. And in the learning approach, not everything is completely automatically learned - there is sub-
stantial tuning involved sometimes. Thus it would be more appropriate to use the terms brighter and darker,
or more and less transparent. And because the deep learning approach has drawn inspiration from some of
the engineered architectures, one should not be surprised if combined approaches will appear in the near
future.
9
1.4.1 Historical Note
In the early years of computer vision, the paradigm for recognition was formulated as a process, which grad-
ually and meticulously reconstructs the spatial 3D layout of the scene, starting from the 2D image layout.
This 3D reconstruction process was often divided into low-level, mid-level and high-level vision process, a
division partly reflected in the above list of stages. It was inspired by the fact that we humans perceive
the world as a 3D space. Over the years, it has become clear that this paradigm is too elaborate and too
complicated. Presently, the focus lies on solving recognition tasks with ’brute-force’ approaches, the two
pathways depicted in Fig. 1, for which classical techniques such as edge detection or image segmentation
hardly play a role. Some of the classical techniques have therefore moved a bit into the background. This
is also reflected in recent text books. For instance, Forsyth and Ponce’s book follows the structure of the
classical paradigm (low/mid/high-level vision), but the treatment of edge detection and image segmentation
is rather marginal; Szeliski’s book organization is centered around the feature-matching approache (left side
in Fig. 1), but still contains substantial material on image segmentation for instance. But no book contains
the latest, breath-taking developments, namely the use of Deep Neural Networks for image classification.
We therefore will start with that topic relatively early (Section 5).
Matlab: (http://www.mathworks.com/) Is extremely convenient for prototyping (research) because its ’for-
mulation’ is very compact and because it probably has the largest set of functions and commands. It
offers an image processing toolbox that is very rich in functionality, but one can manage without the
toolbox - we give plenty of code examples to do so. Since a few years, Matlab also features a com-
puter vision toolbox that is continuously growing in scope. Use doc or help to read about the functions
and commands it provides. It is useful to familiarize yourself with the image processing toolbox by
starting with doc images.
Octave, R: (https://www.gnu.org/software/octave/, https://www.r-project.org/) For training purposes
one can certainly also use software packages such as R and Octave, in which most functions have
the same name as in Matlab.
Python: (https://www.python.org/) Is perhaps the most popular language by now. Coding in Python is
slightly more elaborate than in Matlab and does not offer the flexibility in image display that Matlab
has. Python’s advantage is, that it can be relatively easily interfaced to other programming languages
that are suitable for mobile app development for instance, whereas for Matlab this is very difficult. In
Python, the initialization process is a bit more explicit and the handling of data-types (integer, float,
etc.) is also a bit more elaborate, issues that make the Python code a bit lengthier than in Matlab.
If one switches between any of those high-level languages, then the following summary is useful:
http://mathesaurus.sourceforge.net/matlab-python-xref.pdf.
In the following we mention programming languages that are considered rather lower-level and that re-
quire more care in initializing and maintaining variables. If you process videos, then you probably need to
implement your time-consuming routines into one of those languages.
Cython: (https://www.cython.org/) Is essentially the same code as Python, but offers to specify certain
variables and procedures in more detail with a notation similar to C (C++). That additional notation
can speed up the code by several factors. Cython is included in the Anaconda distribution. (Not to be
confused with CPython, which is the canonical Python implementation).
C++, C: For implementation into C or into one of its variants (i.e. C++), we merely point out that there exist
C libraries on the web with implemented computer vision routines. The most prominent one is called
10
Open CV, see wiki OpenCV or https://opencv.org/. Many of the routines offered by those libraries can
also be easily accessed through Python by importing them.
For more information on training material and coding we refer to Appendix G.
1.6 Reading
Here I list in particular books introducing the concepts. There are many more books providing details on
implementations, in particular on OpenCV.
Sonka, M., Hlavac, V., and Boyle, R. (2008). Image Processing, Analysis, and Machine Vision. Thom-
son, Toronto, CA. Introductions to topics are broad yet the method sections are concise. Contains
many, precisely formulated algorithms. Exhaustive on texture representation. Oriented a bit towards
classical methods, thus, not all newer methods can be found. Written by three authors, but reads like
if authored by one person only.
Szeliski, R. (2011). Computer Vision: Algorithms and Applications. Springer. Meticulous and visually
beautiful exposure of many topics, including on graphics and image processing; Strong at explaining
feature-based recognition and alignment, as well as complex image segmentation methods with the
essential equations only. Compact yet still understandable appendices explaining matrix manipula-
tions and optimization methods.
Forsyth, D. and Ponce, J. (2010). Computer Vision - A Modern Approach. Pearson, 2nd edition. Ex-
haustive on topics about object, image and texture classification and retrieval, with many practical
tips in dealing with classifiers. Equally exhaustive on tracking. Strong at explaining object detection
and simpler image segmentation methods. Slightly more praxis oriented than Szeliski. Only book to
explain image retrieval and image classification with feature methods.
Davies, E. R. (2012). Computer and Machine Vision. Elsevier Academic Press, Oxford. Rather machine
vision oriented (than computer vision oriented). Contains extensive summaries explaining advantages
and disadvantages of each method. Summarizes the different interest points detectors better than
any other book. Treats video surveillance and automotive vision very thoroughly. Only book to contain
automotive vision.
Prince, S. (2012). Computer Vision: Models, Learning, and Inference. Computer Vision: Models, Learn-
ing, and Inference. Cambridge University Press. Also a beautiful exposure of some computer vision
topics; very statistically oriented, starting like a pattern recognition book. Contains up-to-date reviews
of some topics.
Wikipedia Always good for looking up definitions, formulations and different viewpoints. Even textbooks
sometimes point out wikipedia pages. But wikipedia’s ’variety’ - originating from the contribution of
different authors - is also its shortcoming: it is hard to comprehend the topic as a whole from the
individual articles (websites). Wikipedia is what it was designed for after all: an encyclopedia. Hence,
textbooks remain irreplaceable.
Furthermore, because different authors are at work at wikipedia, it can happen that an intuitive and
clear illustration by one author is being replaced by a less intuitive one by another author. I therefore
recommend to copy/paste a well illustrated problem into a word editor (e.g. winword) in order to keep
it.
11
2 Simple Image Manipulations (First Steps)
To get acquainted with some of the basics, we perform a few simple image manipulations in this section.
Firstly, we learn about the image format and some basic operations such as thresholding and data type
conversions (Section 2.1). Then we mention some of the image enhancement techniques that are occa-
sionally used for computer vision systems (Section 2.2). Finally, we introduce a simple detector for face part
localization to understand the complexity of the intensity landscape (Section 2.3).
In Matlab one can load an image with the command imread. To convert it into a gray-level image there
exists the function rgb2gray. To display the image we use the function imagesc (image scale) and for
that purpose we initialize a figure with the function figure. The function clf clears the figure. With the
command subplot we can pack several images into the same figure. The following code shows how to use
those commands, its output is shown in Figure 2.
To select a part of the image - see comment ’zoom into center’ -, we specify the row numbers first - vertical
axis first -, followed by specifying the column numbers - horizontal axis. That is, one specifies the indices
as in matrices in mathematics.
Black-White Image We can threshold an image by applying a relational operator, see line BWflw =
Igry>100, in which case the image is automatically converted into a logical data-type, that is true or false,
namely one bit (value one and zero respectively). An image of that data-type is also called a black-white
image sometimes, hence the variable’s name BW. In the code example above we attempted to separate
the flower from its background, a foreground/background segregation as it is also called. We have chosen
the threshold somewhat arbitrarily and of course it would make sense to choose a threshold based on a
12
histogram, a histogram of intensity values for instance, as shown in the figure. We elaborate on that in Sec-
tion 9. After we have segmented, one often manipulates the black-white image by so-called morphological
operations, to be introduced in Section 10.
Blurred Image Sometimes it is useful to blur an image because a blurred image helps analyzing the
’coarse’ structures of an image, which otherwise are difficult to detect in the original image with all its details.
We can blur an image by averaging over a local neighborhood at each pixel in the image, in Matlab done
with the function conv2. In our example we take a 25x25 pixel neighborhood, generated with ones(25,25)
and merely sum up its 625 pixel values: this summation operation is done for each pixel in the image. We
will come back to that in Section 3.
Data-Type Conversion (Casting) The function imread returns a jpeg image as data-type uint8, which is
not very practical for certain computations. For many image-processing functions we need to cast (convert)
the image into a floating-number data-type. In Matlab that can be done with the functions single or double,
returning lower and higher precision respectively. In the above code, we did that casting for the function
conv2, but we could also write a separate line if desired, Irgb = single(Irgb).
Many functions are flexible and will produce an output of the same data-type; others expect a specific
data-type as input; some functions produce a specific data-type as output such as the thresholding oper-
ation. It is best to be always aware of what data type an image is - or any variable -, and what type the
functions expect and produce.
In Python the code looks very similar, but we need to ’import’ those functions from modules. In particular
the module skimage holds a lot of functions for computer vision and image processing, see also Appendix
I:
Note that in Matlab the function rgb2gray returns a map of data-type uint8 - if the input was of type uint8
- , whereas in Python the function skimage.color.rgb2gray immediately down-scales the intensity values
to an interval ∈ [0 1.0] and assigns them to a floating data-type.
Topology To obtain an impression what type of stimulus an image represents, we recommend to observe
the image with the function mesh, as given in the last line in the Matlab code block. This illustrates better
that the image array holds an ’intensity landscape’ (Fig. 3). For that reason computer vision scientists often
13
Original Gray-Scale
500 500
1000 1000
1500 1500
200 400 600 800 1000 1200 200 400 600 800 1000 1200
Sub-Selection (Zoom) 4
Histogram of Green Channel
×10
3
200
2
400
1
600
0
800
100 200 300 400 500 600 0 50 100 150 200 250
500 500
1000 1000
1500 1500
200 400 600 800 1000 1200 200 400 600 800 1000 1200
14
Figure 3: An image as observed
from a three-dimensional perspec-
tive (command mesh in Matlab).
The image appears as a landscape
whose elevation - the vertical axis
- represents intensity with values
ranging typically from zero to 255
(for images coded with 8 bits). It is
not easy for a human to understand
the semantic content of a scene from
this perspective, because the human
visual system is trained to interpret
frontal views of images. But this per-
spective illustrates better what the
computer vision system receives as
input.
borrow terminology from the field of topology to describe the operation of their algorithms, such as the term
’watershed’ in case of a segmentation algorithm.
If one intends to denoise a color image, then one would apply the median filter to the individual chromatic
channels.
In Python the function medfilt2d can be found in module scipy.signal.
15
Figure 4: Image enhancement with simple manipulations using its histogram for redistribution. This has visual appeal
primarily, but is also carried out for learning classification tasks sometimes.
Those profiles are shown in black in the Fig. 5. In such profiles it is - in principle - relatively easy to locate
facial features by observing local maxima, minima, etc. The ’raw’ profile sums are a bit ’noisy’ however - like
most raw signals: they contain too many ’erratic’ extrema that make detection of the facial features difficult.
We therefore smoothen the profiles a bit by low-pass filtering them. The smoothened versions are shown
in magenta in the Figure; now it is easier to locate facial features. Smoothening can be done by averaging
over a local neighborhood in the signal at each pixel in the profile. We have done this above already to
obtain a blurred image in two dimensions with the function conv2. Here we do it in one dimension only. And
we do so with a so-called Gaussian filter, a function whose shape is a ’bump’: it is an elegant way to filter a
signal, see Appendix C.2 for its exact shape. For the blurring of the image above we had used a flat filter,
which is a rather crude way.
The size of the filter matters: a small size does not smoothen very much, a large size flattens the signal
to much. We therefore make the filter size dependent on the image size by choosing a fraction of it. The
Gaussian values are generated with function pdf.
nPf = round(w*0.05); % # points: fraction of image width
LowFlt = pdf(’norm’, -nPf:nPf, 0, round(nPf/2)); % generate a Gaussian
Pverf = conv(Pver, LowFlt, ’valid’); % filter vertical profile
Phorf = conv(Phor, LowFlt, ’valid’); % filter horizontal profile
If you are not familiar with the convolution process you should consult Appendix B - for the moment the
Appendix B.1 on one-dimensional convolution is sufficient. Those concepts are part of the field of signal
16
×104 Vertical Profile
10
50
8
100
6
150
4
200 50 100 150 200 250 300
From Left to Right
250
300
×104 Horizontal Profile
10
350
400 5
0
50 100 150 200 250 300 100 200 300 400
From Top to Bottom
Figure 5: Intensity profiles for a face image. The vertical profile is generated by summing the pixel intensity values
along the y-axis (vertical; column-wise); the horizontal profile is generated by summing along the x-axis (horizontal;
row-wise). To what face parts do the extrema in the profiles correspond to? Code in Appendix J.2.
processing and we cannot give lengthy introductions to those. But with Matlab you can conveniently explore
them and obtain a quick understanding of those operations, that is why we give a lot of code examples in
the Appendix.
For extrema detection we can use the function findpeaks, which returns the local maxima as well as their
location in the signal. In order to find the local minima we invert the distribution. The code in Appendix J.2
shows how to apply those functions.
This type of face part localization is somewhat simple, but its temporal complexity is low and that is why
this approach is often used as a first phase in a more elaborate facial feature tracking system. Many
applied computer vision systems consist of cascades of sub-systems that progressively carry out a task
with increasing precision.
17
3 Image Processing I: Scale Space, Pyramid and Gradient Image
For many computer vision tasks, it is useful to blur the image and to analyze those different blurs separately.
We have introduced the idea of a blurred image already in the previous section, but here we generate this
blur repeatedly with increasing filter sizes arriving at a so-called scale space as shown in Fig. 6. This space
will be introduced in the subsequent Section 3.1. Then we introduce the image pyramid, Section 3.2, which
is - coarsely speaking - a reduction of the scale space.
Scale 3
100
200
300
400
500
600
100 200 300 400 500 600 700 800 900 1000
Scale 2
100
200
300
400
Figure 6: Scale space of an image. Observe the increasing
blur from bottom to top. 500
600
Original: unfiltered image Io .
100 200 300 400 500 600 700 800 900 1000
Scale 1: Io was smoothened with a Gaussian filter of sigma
equal one, σ = 1.
Scale 2: Io was smoothened with σ = 2. Scale 1
Scale 3: Io was smoothened with σ = 3.
100
It is called a space because one can regard it as a three- 200
dimensional space, with the third dimension corresponding to
300
a fine-to-coarse axis with variable σ.
At coarser scales (larger values of σ) it is easier to find con- 400
tours and regions as a whole, but structures are sometimes
500
smeared with other different structures. Coarser scales are
often down-sampled to obtain a more compact representation 600
of the scale space, which so forms a pyramid, see Fig. 7. 100 200 300 400 500 600 700 800 900 1000
Code in Appendix J.3.
(In this case, the filtering was performed for each color chan- Original
nel separately, once for the red, once for green and once for
100
blue image.)
200
300
400
500
600
100 200 300 400 500 600 700 800 900 1000
18
For many computations it is also useful to know how the intensity landscape is ’oriented’ at each pixel.
Specifically, we would like to know the ’surface slope’ at each image pixel for a small pixel neighborhood.
This is expressed with the gradient image, to be explained in Section 3.3.
3.1 Scale Space wiki Scale space Sze p127, s3.5, p144
An image is blurred by convolving it with a two-dimensional filter that averages across a small neighborhood. SHB p106, s4.3
In our introductory example of the previous Section we merely used a summation function, but typically
image blurring is done with a Gaussian filter as we did for filtering the face profiles (Section 2.3). Here
the Gaussian filter is a two-dimensional function and looks like shown in the first four patches of Fig. 11.
It is expressed as g(x, y, σ), where x and y are the image axes and where σ is the standard deviation
regulating the amount of blur - also called the smoothing parameter. In the language of signal processing
one expresses the blurring process as a convolution, indicated by the asterisk ∗. One says, the image
Io (x, y) is convolved by the filter g(x, y, σ)
resulting in a coarser image Ic . If you are unfamiliar with the 2D-convolution process, then consult now
Appendix B.2 to familiarize yourself with it.
If this blurring is done repeatedly, then the image becomes increasingly coarser, the corresponding
intensity landscape becomes smoother, illustrated in Fig. 6. Practically, there are different ways to arrive at
the scale space. The most straightforward implementation is to low-pass filter the original image repeatedly
with 2D Gaussians of increasing size, that is first with σ = 1, then with σ = 2, etc. That approach is however
a bit slow and there exist fast implementation, for which we refer to the Appendix B.2.
The resulting stack of images, {Io , I1 , I2 , . . . }, is called a scale space. In a typical illustration of the scale
space, the bottom image corresponds to the original image; subsequent, higher images correspond to the
coarser images. In other words, the images are aligned vertically from bottom to top and one can regard
that alignment as a fine-to-coarse axis. That axis is now labeled σ. The resulting space is then expressed
as
S(x, y, σ). (3)
The axis σ can be understood as a smoothing variable: a σ-value equal zero corresponds to the original
image, that is, there is no smoothing; a σ-value equal one corresponds to the first coarse image, etc. In
applications, sigma values typically range from one to five, σ = 1, 2, .., 5.
Application The scale space is used for the following feature detection processes in particular:
1. Region finding: we can easily determine brighter and darker regions by subtracting the scales from
each other, a technique we will return to in Section 4.1.
2. Verification of structures across the scale (axis): for instance, if a specific feature can be found at
different scales, then it is less likely to be accidental, a technique to be used in feature detection in
general (Section 6).
3. Finding more ’coherence’: for instance, contours appear more continuous at coarser scales (Sections
4.2 and 12).
In Matlab we can create a Gaussian filter with the function fspecial and convolve the image with the
command conv2:
The image processing toolbox also offers commands such as imgaussfilt and imfilter to generate blurs
of images and are probably the preferred functions, because those are optimized for speed.
19
In Python we can use the submodule scipy.ndimage that contains the function gaussian filter to blur
images:
Generating such a scale space gives us more information that we intend to exploit to improve our features
extraction. This gain in information comes however at the price of more memory usage, as we now have
multiple images. And multiple images means more computation when searching for features. But we can
reduce this scale space, without much loss of information by ’compressing’ the coarse scales, which leads
to the pyramid treated in the next section.
Level 3
20
40
60
80
20 60 100
Level 2
Figure 7: Multi-resolution pyramid of an image.
The images of the scale space (Figure 6) are down- 50
Level 1
Original: P0 = original image Io
50
Level 1: P1 = sub-sampled of Ic1 (Ic1 = Io ∗ g(1))
100
Level 2: P2 = sub-sampled of Ic2 (Ic2 = Io ∗ g(2))
150
Level 3: P3 = sub-sampled of Ic3 (Ic3 = Io ∗ g(3)). 200
250
the images such that they form a pyramid - for 50 100 150 200 250 300 350 400 450 500
100
200
300
400
500
600
100 200 300 400 500 600 700 800 900 1000
20
followed by sub-sampling columns by
Idwn = Idwn(:,1:2:end); % sub-sampling columns
In Matlab there exists also the function downsample, which however works for rows only - if the input is a
matrix; we then need to flip the matrix to down-sample it along the columns. Matlab also offers the function
impyramid, which carries out both the Gaussian filtering as well as the down-sampling. But we can also
operate the other direction and up-sample:
If other down- or up-sampling steps - than halving and doubling- are required, then the function imresize
may be more convenient.
Application In many matching tasks, it is more efficient to search for a pattern starting with the top level
of the pyramid, the smallest resolution, and then to work downward toward the bottom levels, a strategy
also called coarse-to-fine search (or matching). More specifically, only after a potential detection was made
at the coarse level, then one starts to verify by moving towards finer levels, where the search is more time
consuming due to the higher resolution.
The gradient image describes the steepness at each point in the intensity landscape, more specifically
how the local ’surface’ of the landscape is inclined. At each pixel, two measures are determined: the
direction of the slope, also called the gradient; and the magnitude - the steepness - of the slope. Thus
the gradient image consists of two maps of values, the direction and the magnitude. That information is
most conveniently illustrated with arrows, namely as a vector field, see Figure 8: the direction of the arrow
corresponds to the gradient; the length of the arrow corresponds to the magnitude.
To determine the gradient, one takes the derivative in both dimensions, that is the difference between
∂I ∂I
neighboring pixels along both axes, ∂x and ∂y respectively. This operation is typically expressed with the
nabla sign ∇, the gradient operator:
∂I ∂I T
∇I = , . (4)
∂x ∂y
whereby the gradient information is expressed as a vector. We give examples: a point in a plane has no
inclination, hence no magnitude and an irrelevant direction - the gradient is zero; a point in a slope has a
certain direction - an angle value out of a range of [0, 2π] - and a certain magnitude representing the steep-
ness. The direction is computed with the arctangent function using two arguments (atan2), returning a value
∈ [−π, π] in most software implementations. The magnitude, k ∇I k, is computed using the Pythagorean
formula.
Determining the gradient field directly on the original image does not return ’useful’ results typically, because
the original image is often too noisy (irregular). Hence the image is typically firstly low-pass filtered with a
small Gaussian function, e.g. σ = 1, before determining its gradient field.
In Matlab there exists the routine imgradient which returns the magnitude and direction immediately
(Image Processing Toolbox); if one needs also the derivatives later again in the code, then one can use
imgradientxy to obtain two matrices corresponding to the individual gradients. If we lack the toolbox, we
can use the following piece of code, whereby the function gradient returns two matrices - of the size of the
image -, which represent the gradients along the two dimensions:
21
Figure 8: Gradient image. The arrows point into
the direction of the gradient - the local slope of
the neighborhood; the length of the arrows corre-
sponds to the magnitude of the gradient. The gradi-
ent direction points toward high values (white=255;
black=0).
(What does the picture depict? Hint: generate a
coarse scale by squinting your eyes.)
In the example code, the angle values are shifted into the positive range [0, 2π]. One can plot the gradient
field using the function quiver:
% --- Plotting:
X = 1:ISize(2); Y = 1:ISize(1);
figure(1); clf; colormap(gray);
imagesc(I, [0 255]); hold on;
quiver(X, Y, Dx, Dy);
Dx, Dy = numpy.gradient(I)
Application The gradient image is used for a variety of tasks such as edge detection, feature detection
and feature description (coming up).
22
4 Feature Extraction I: Regions, Edges & Texture
Now we make a first step toward reading out the structure in images. We will try to find edges in the intensity
landscape, dots, regions, repeating elements called texture, etc. This feature extraction is useful for locating
objects, finding their outlines and - before the arrival of Deep Learning - were used for classification.
Regions can be relatively easily obtained by subtracting the images of the scale space from each other.
This is an operation called band-pass filtering and will be introduced in Section 4.1. Edges can be obtained
by observing where a steep drop in the intensity landscape occurs, to be treated in Section 4.2. And for
texture detection, the techniques are essentially a mixture of region and edge detection to be highlighted in
Section 4.3.
In Matlab we can generate this ’band space’ by applying merely the difference operator to the scale space
SS, in which case however the signs for brighter and darker regions are switched due to the implementation
of diff:
%% ========= DOG ==========
DOG = diff(SS,1,3); % brighter is negative, darker is positive
% --- set borders to 0:
for i=1:size(DOG,3), p=i*3;
DOG(:,:,i) = padarray(DOG(p:end-p+1,p:end-p+1,i),[p p]-1);
end
%% ========= BRIGHTER/DARKER ========
[BGT DRK] = deal(DOG); clear DOG; % clear DOG to save memory
BGT(BGT>0) = 0; % all positive values to 0
DRK(DRK<0) = 0; % all negative values to 0
BGT = -BGT; % switch sign to make brighter positive
The band-pass space DOG contains one level less than the scale space. In this piece of code we also set
the border values to zero for reason of simplicity. The variables BGT and DRK hold then only positive and
negative values (left and right column in Figure 9). Of course one could also take other threshold values
than zero to select perhaps only very bright or very dark regions.
Python provides functions to perform blob detection, that can be found in submodule skimage.feature,
for example function blob dog.
23
20
40
60
Figure 9: Example of a band-pass space, in this
case a Difference-of-Gaussian (DOG) space: it 80
is suitable for detecting regions (example code in
100
J.4.1). 20 40 60 80 100
Left column represents brighter regions (BGT in
code): negative values were set to zero.
Right column represents darker regions (DRK in
20 20
code): positive values were set to zero and the
absolute value of the negative values is displayed. 40 40
0-1
60 60
The second row labeled 0-1 is the subtraction
of the input image and its low-pass filtered version. 80 80
The third row labeled 1-2 is the subtraction of the 100 100
corresponding two adjacent levels in the scale 20 40 60 80 100 20 40 60 80 100
space; etc.
100 100
20 40 60 80 100 20 40 60 80 100
20 20
40 40
2-3
60 60
80 80
100 100
20 40 60 80 100 20 40 60 80 100
Brighter Darker
Now that we have regions, we are interested in manipulating them. For example we wish to eliminate
small regions and would like to know the approximate shape of the remaining larger regions. We continue
with that in Section 10.
Generating and dealing with the entire band-pass space is time consuming. If we develop a very specific
task it might suffice to chose a few ’dedicated’ scales. For instance to select nuclei from histological images
that have image sizes of several thousands pixel (for either row or column), then generating the entire space
might be too expensive. Rather one would design a bandpass filter whose size matches approximately the
size of a typical nucleus. The original image would then be low-pass filtered with the corresponding two
suitable sigmas, the resulting two filtered images subtracted from each other, and then a threshold is applied
to the difference image.
24
a first processing step in more sophisticated edge detectors. But thresholding the gradient image does not
always find the precise location of the edge, as the edge could be very large and that would result in the
detection of neighboring pixels as well.
100 100
200 200
300 300
400 400
500 500
600 600
100 200 300 400 100 200 300 400
Figure 10: Left: input image. Right: binary map with on-pixels representing edges as detected by an edge-detection
algorithm. There are different algorithms that can do that, the most elegant one is the Canny algorithm which is based
on the gradient image as introduced in Section 3. An edge following algorithm traces the contours in such a binary map.
To find edges more precisely, the principal technique is to convolve the image with two-dimensional filters
that emphasize such edges in the intensity landscape. The filter masks of such oriented filters therefore
exhibit an ’orientation’ in their spatial alignment. Here are two primitive examples for a vertical and diagonal
orientation filter respectively.
−1 0 1 0 1 1
−1 0 1 and −1 0 1 . (6)
−1 0 1 −1 −1 0
Thus, the image is convolved multiple times with different orientation masks to detect all edge orientations,
resulting in a corrsponding number of different output maps. Those maps are then combined to a single
output map by a mere logical AND operation. More sophisticated techniques interpolate between those
principal orientations.
Matlab provides the function edge to perform this process and it provides different techniques. Python offers
pretty much the same set of techniques, but in different modules, skimage.feature and skimage.filters.
Those functions return as output a binary map where on-pixels correspond to locations of pixels. The code
example in Appendix J.4.2 displays the output of different edge detection techniques.
- Robert detector: this detector is rather primitive and is outdated by now, but it is still used in some
industrial applications: its advantage is its low processing duration, its downside is that it does not
detect all edges.
25
- Prewitt and Sobel detector: those detectors find more edges, but at the price of more computation.
- Canny detector: this is most elaborate detection technique,
Medg = edge(I, ’canny’, [], 1);
whereby the empty brackets [] stands for no particular choice of threshold parameters, and thus the
threshold is determined automatically; the value 1 is the scale value. Typically, scale values up to 5
are specified. Hence, with this technique one can easily perform low-pass-filtering and edge detection
using one function only.
Texture is observed in the structural patterns of object surfaces such as wood, grain, sand, grass, and cloth.
But even scenes, that consists of many objects, can be regarded as texture.
The term texture generally refers to the repetition of basic texture elements called textons (or texels in
older literature). Natural textures are generally random, whereas artificial textures are often deterministic or
periodic. Texture may be coarse, fine, smooth, granulated, rippled, regular, irregular, or linear.
One can divide texture-analysis methods into four broad categories:
Statistical: these methods are based on describing the statistics of individual pixel intensity values. We
only marginally mention these methods (Section 4.3.1), as they have been outperformed by other
methods.
Structural: in structural analysis, primitives are identified first, such as circles or squares, which then are
grouped into more ’global’ symmetries (see SHB pch 15 for discussion). We omit that approach in favor of
better performing approaches.
Spectral: in those methods, the image is firstly filtered with a variety of filters such as blob and orien-
tation filters, followed by a statistical description of the filtered output. Caution: these methods are
sometimes also called ’structural’.
Local Binary Patterns: turns statistical information into a code. It is perhaps the most successful texture
description to date (Section 4.3.3). Fore some databases - in particular those containing textures -
the description performs almost as well as Deep Nets. In comparison to Deep Nets, it has a smaller
representation and a negligible learning duration.
ThKo p412
4.3.1 Statistical
One can distinguish between first-order statistics and second-order statistics. First-order statistics take
merely measures from the distribution of gray-scale values. In second-order statistics, one attempts to
express also spatial relationships between pixel values and coordinates.
First-Order Statistics Let v be the random variable representing the gray levels in the region of interest.
The first-order histogram P (v) is defined as
nv
P (v) = (7)
ntot
with nv the number of pixels with gray-level v and ntot the total number of pixels in the region (imhist in
Matlab for an entire image). Based on the histogram (equ. 7), quantities such as moments, entropy, etc.
are defined. Matlab: imhist, rangefilt, stdfilt, entropyfilt
26
ThKo p414
Second-Order Statistics The features resulting from the first-order statistics provide information related SHB p723, s15.1.2
to the gray-level distribution of the image, but they do not give any information about the relative positions
of the various gray levels within the image. Second-order methods (cor)relate these values. There exist
different schemes to do that, the most used one is the gray-level cooccurrence matrix (graycomatrix in
Matlab), see Dav p213, s8.3 SHB p723, s15.1.2 for more.
The use of the cooccurence matrix is memory intensive but useful for categories with low intensity
variability and limited structural variability - or textural variability in this case. For ’larger’ applications, the
use of a spectral approach may return better performance.
In the spectral approach the texture is analyzed at different scales and described as if it represented a
spectrum, hence the name. One possibility to perform such a systematic analysis is the use of wavelets
(wiki Wavelet), which has found great use in image compression with the jpeg format. There are two reasons
why wavelets are not optimal for texture representations. First, they are based on a so-called mother wavelet
only, that is they are based on a single filter function. Second, they are useful for compression, but less so
for classification. To represent texture, it is therefore more meaningful to generate more complex filters.
In the following we introduce a bank of filters that has been successful for texture description, see Figure
11 (see Appendix J.4.3 for code). It is one specific bank of filters, namely the one by authors Leung and
Malik. But other spectral filter banks look very similar. All these filters are essentially a mixture of the filtering
processes we have introduced before. There are four principally different types of filters in that bank:
Filter no. 1-4: those filters are merely Gaussian functions for different sigmas as we used them to generate
the scale space (Section 3.1).
Filter no. 5-12: those eight filters are bandpass filters as we mentioned them in Section 4.1. In this case
the blob filter is much larger and is a Laplacian-of-Gaussian (LoG) filter.
Filter no. 13-30 (rows 3, 4 and 5): these are oriented filters that respond well to a step or edge in an image
- it corresponds to edge detection as introduced above in Section 4.2. In this case, the first derivative
of the Gaussian function is used.
Filter n. 31-48 (bottom three rows): those filters correspond to a bar filter. It is generated with the second
derivative of the Gaussian.
To apply this filter bank, an image is convolved with each filter separately, each one returning a correspond-
ing response map. In order to detect textons in this large output, one applies a quantization scheme as
discussed in Section 7. Algorithm 1 is a summary of the filtering procedure.
1. In a first step, each filter is applied to obtain a response map Ri (i = 1..n). Ri can have positive and
negative values.
2. In order to find all extrema, we rectify the maps obtaining so 2n maps Rj (j = 1..2n).
3. We try to find the ’hot spots’ (maxima) in the image, that potentially correspond to a texton, by aggre-
gating large responses. We can do this for example by convolving Rj with a larger Gaussian; or by
applying a max operation to local neighborhoods.
4. Finally, we locate the maxima using an ’inhibiton-of-return’ mechanism that starts by selecting the
global maximum followed by selecting (the remaining) local maxima, whereby we ensure that we
do not return to the previous maxima, by inhibiting (suppressing) the neighborhood of the detected
maximum.
Thus, for each image we find a number of ’hot spots’, each one described by a vector xl . With those we
then build a dictionary as will be described in Section 7.
27
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36
37 38 39 40 41 42
43 44 45 46 47 48
Figure 11: A bank of filters for texture analysis (the Leung-Malik filter bank): blob and orientation filters for different
scales (see Appendix J.4.3 for code). Filter patches are generated for a size of 49 x 49 pixels.
Filters 1-4: Gaussian function (low-pass filter) at four different scales (sigmas).
Filters 5-12: a blob filter - a LoG in this case - at eight different scales.
Filters 13-30 (rows 3, 4 and 5): an edge filter - first derivative of the Gaussian in this case - for six different orientations
and three different scales.
Filter 31-48 (bottom three rows): a bar filter - generated analogously to the edge filter.
28
4.3.3 Local Binary Patterns wiki Local binary patterns
This texture descriptor observes the neibhouring pixels, converts them into a binary code and that code is
transformed into a decimal number for further manipulation. Let us take the following 3x3 neighborhood.
The center pixel - with value equal 5 here - is taken as a reference and compared to its eight neighboring
pixels:
3 4 6 0 0 1
1 5 4 → 0 − 0 (8)
7 6 2 1 1 0
Neighboring values that are larger than 5, are set to 1, neighboring values that are smaller, are set to 0. This
results in a 8-bit code, which in turn can be converted into a decimal code. This descriptor is constructed
for each pixel of some window, sometimes even for the entire image. The descriptor numbers are then
histogrammed: in case of the 8-bit code turned into a decimal code that would result in a 256-dimensional
vector. That vector is the classified with traditional machine learning methods. There exist many variations
of this texture descriptor.
Matlab extractLBPFeatures
Python skimage.feature.local binary pattern
29
5 Classification with Deep Neural Networks
We now turn toward image classification, the process of assigning a scene or an object label to an entire
image. One would think that this can be done with the feature extraction methods introduced in the pre-
vious section, and indeed there are innumerous attempts to do so. That approach is also called feature
engineering. It has however lost its appeal - at least for the moment - with the arrival of the approach of
Deep Learning, also called feature learning. With Deep Learning we let a so-called Deep Neural Network
(DNN) find out the features that are necessary for getting the image classes discriminated. Not only are
DNNs (or Deep Nets) more convenient - as we do not need to write elaborate feature extraction algorithms
- but they also perform better, sometimes much better.
Deep Neural Networks are elaborations of traditional Artificial Neural Networks (ANNs), a methodology that
exists already since decades. For a primer on ANNs we refer to Appendix D.1. In this section we immedi-
ately start with Deep Neural Networks. It most popular type is the so-called Convolutional Neural Network
(CNN), which we introduce in Section 5.1. For this, we will switch to the programming language Python, as
it has become the dominant language to explore neural network architectures. The big tech companies pro-
vide libraries to run such networks. Those libraries use similar terminology, but have slightly different ways
to set up, initialize and run networks. One term that is often used is tensor, which is a function manipulating
vectorial functions. In the context of Deep Nets, a tensor is often understood as an object representing
batches of images, which makes the tensor three-dimensional or four-dimensional: two dimensions for the
spatial dimension x and y, one dimension for the number of images in the batch, and another dimension
if the image contains color information (typically 3: red, green, blue). A tensor can also be two- or even
one-dimensional, in which case they are sometimes called simple tensors. Practically we do not need to
learn novel algebra with this term, it is merely a different label for certain processes.
PyTorch: this package is provided by Facebook. It is perhaps the most trending package. In particular
transfer learning can be conveniently done with those libraries, coming up in Section 5.2.
Tensorflow: this package is provided by Google. It is considered somewhat intricate to understand and to
apply, and that is why a simpler API was developed for it, called Keras. We use Keras for introducing
MLPs (Appendix D.1) and for introducing a simple CNN (Section 5.1 next).
FastAI: a packed that promises to deliver state-of-the art classification accuracies. It uses in particular en-
semble of Deep Nets, which provide better results than individual Deep Nets. It is based on PyTorch.
For other packages we refer to the wiki pages: wiki Comparison of deep learning software.
A fundamental downside of Deep Learning is that it takes very long to train a Deep Net. One solution to
the problem is to apply massive computing power, available however mostly to big tech companies and
universities. Another way to mitigate the downside is to exploit the so-called CUDA cores of the graphic
cards in a PC. Software packages (PyTorch, Tensorflow) provide routines to exploit the available CUDA
cores on a computer. A third way to counteract the problem is to use the trick of transfer learning. The latter
two tricks will be addressed in particular in Section 5.2.
Roughly speaking, a Convolutional Neural Network (CNN) can be regarded as an elaboration of a MLP
(Section D.1). It is elaborated by the following two principal tricks in particular, see also Figure 12:
1. CNNs make actual use of the two-dimensionality of images, namely by applying quasi-convolutions
to find discriminative, local features, similar to the feature extraction process discussed in Section 4
(Section 4.3.2 in particular). The difference is that in a ’traditional’ convolution there is only a single
kernel mask (K in equation 42), but in a CNN there are many individually learned kernel masks Ki ,
one for each local neighborhood i. The corresponding layers are also called feature layers or feature
maps.
30
2. CNNs subsample the feature layers to so-called pooling layers akin to building a spatial pyramid dis-
cussed in Section 3.1.
output
dense layer(s)
Figure 12: The typical (simplified) architecture of a Convolu-
tional Neural Network (CNN) used for classifying images. It is
essentially an MLP elaborated by convolution and subsampling
processes as introduced in the previous sections. The archi- pooling layer(s)
tecture has in particular several alternating feature and pooling
layers. The term ’hidden’ layer is not really used anymore in a
CNN, as there are so many different types of hidden layers, that
more specific names are needed.
sub-sampling
The feature layer(s) is sometimes also called feature map(s): it
is the result of a convolution of the image, however not one with a
single, fixed kernel, but one with many individual, learned kernels:
each unit observes a local neighborhood in the input image and
learns the appropriate weights. feature layer(s)
The pooling layer is merely a lower resolution of the feature
layer and is obtained by the process of sub-sampling. This
sub-sampling helps to arrive at a more global ’percept’.
Between the last pooling layer and the output layer, there lies typically a dense layer: it is a flat (linear) layer,
all-to-all connected with its previous pooling layer and its subsequent output layer, again as a complete
bipartite graph. This is sufficient introduction for the moment and we now start looking at some code as
written in Keras. The code is merely an extension of the MLP code example of Appendix D.1, extended by
convolution and pooling layers:
# https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py
# Trains a simple convnet on the MNIST dataset.
# Achieves 99.25% test accuracy after 12 epochs
from __future__ import print_function
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from LoadMNIST import LoadMNIST
batchSz = 128
nEpoch = 12
31
# extending by one dimension
TREN = TREN.reshape(60000, Isz[0], Isz[1], 1) # [60000 28 28 1]
TEST = TEST.reshape(10000, Isz[0], Isz[1], 1) # [10000 28 28 1]
A typical CNN has many alternations between feature and pooling layers: for small tasks CNNs frequently
have 3 or more such alternations, resulting in a 8-layer network or larger networks; for large tasks, CNNs
can consist of several hundreds of layers.
32
from torchvision import models
MOD = models.resnet18(pretrained=True)
There exist other models, e.g. alexnet, densenet, inception, etc. (Appendix D.3). Often they come as a
range of versions learned on different resolutions. For instance ‘resnet34’ provides higher resolution, but it
also takes longer to train on it.
Now we perform data augmentation, a process to artificially enrich the labeled data set, see also Ap-
pendix F. This carried out with the module transforms. The various types of transforms can be lumped
together to a single object, called AUGNtrain in the following code snippet:
from torchvision import transforms
AUGNtrain = transforms.Compose([
transforms.RandomResizedCrop(szImgTarg),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(ImgMean, ImgStdv) ])
Now we initialize the training parameters. With the function SGD we select an optimization procedure called
stochastic gradients. Note that we specify the training only for the final layer by selecting .fc.
By calling lr scheduler.StepLR we specify more learning parameters. Then we choose a loss function,
a function that calculates the classification accuracy.
# only parameters of final layer are being optimized
optim = SGD(MOD.fc.parameters(), lr=lernRate, momentum=momFact)
crit = nn.CrossEntropyLoss()
33
# .fc is omitted - as opposed to above:
optim = SGD(MOD.parameters(), lr=lernRate, momentum=momFact)
Adjusting the parameters of the entire net will take more time, but we gain classification accuracy. Other
than that we can use the same code example as in J.6.3.
34
6 Feature Extraction II: Patches and Transformations Sze p181, ch4, p205
We turn toward a feature that is based on an image patch - it is the basis for the feature engineering
approach as depicted in Fig. 1, left half. A patch is typically a small, square-sized pixel array of the
image. Patches are taken in particular at points where there appears to be a corner or other ’interesting’
structure (Figure 13). That process is called detection. The detected patch is then transformed, a process
called extraction and its result is then a vector called descriptor. The most frequent transformation is the
generation of a histogram of the local gradients of the patch. Such transformations make the patch relatively
unique and distinct and thus suitable for matching between images; they are therefore also called keypoint
features, interest points and if they express angles in particular then they are also corners.
10 400
20
200
30
100
40 0
20 40 0 50 100 150 200 250
10 400
200
20
200
30
40 0
300 20 40 0 50 100 150 200 250
10 400
20
400 200
30
40 0
20 40 0 50 100 150 200 250
500
10 400
20
200
30
600 40 0
100 200 300 400 20 40 0 50 100 150 200 250
Figure 13: Examples of image patches and their corresponding description, in this case a mere intensity histogram.
Center column: four patches detected in the picture. In this case the patches were selected randomly to illustrate the
potential information they can offer.
Right column: histogram of intensity values for each patch. This extraction and description process is rather simple
- more complex transformations will be introduced that are based on the intensity gradients (as introduced in Section
3.3).
Patches are applied in feature-based correspondence techniques such as stereo matching, image stitching,
fully automated 3D modeling, object instance detection as well as video stabilization. A key advantage of
using matching with sets of keypoints is that it permits finding correspondences even in the presence of
clutter (occlusion) and large scale and orientation changes. Patches used to be employed also for image
classification, but were meanwhile surpassed by Deep Neural Networks.
Summarizing, the process of finding and matching keypoints consists of three stages:
1. Feature detection: search for unique patch locations, that are likely to match well in other images.
35
2. Feature extraction (description): conversion of the patch into a more compact and stable (invariant)
descriptor that can be matched against other descriptors.
3. Feature matching: weighting of feature descriptors and matching with descriptors of other images.
In Matlab, functions carrying out these processes start with a corresponding keyword, e.g. detectFeatures,
extractFeatures and matchFeatures. In Python, those processes are found in skimage.feature, ie.
corner xxx and match descriptors.
Sze p185
6.1 Detection
FoPo p179, pdf 149
One way to attempt to find corners is to find edges - as introduced in Section 4.2 - , and then along walk the Dav p158, s6.5 s6.7
edges looking for a corner. This approach can work poorly, because edge detectors often fail at corners. SHB p156, s5.3.10
Also, at very sharp corners or unfortunately oriented corners, gradient estimates are poor, because the Pnc p281, s13.2.2
smoothing region covers the corner. At a ’regular’ corner, we expect two important effects. First, there
should be large gradients. Second, in a small neighborhood, the gradient orientation should swing sharply.
We can identify corners by looking at variations in orientation within a window, which can be done by
autocorrelating the gradients: X
H= (∇I)(∇I)T (9)
window
X I2 Ix Iy
x
≈ (10)
Ix Iy Iy2
window
∂g ∂g
whereby Ix = Io ∗ ∂x and Iy = Io ∗ ∂y (g is a Gaussian). In a window of constant gray level (that is without
any strong gradient), both eigenvalues of this matrix are small because all the terms are small. In a window
containing an edge, we expect to see one large eigenvalue associated with gradients at the edge and one
small eigenvalue because few gradients run in other directions. But in a window containing a corner, both
eigenvalues should be large. The Harris corner detector looks for local maxima of
trace(H) 2
C = det(H) − k (11)
2
where k is some constant, typically set between 0.04 and 0.06. The detector tests whether the product of
the eigenvalues (which is det(H)) is larger than the square of the average (which is (trace(H)/2)2 ). Large,
locally maximal values of this test function imply the eigenvalues are both big, which is what we want. These
local maxima are then tested against a threshold. This detector is unaffected by translation and rotation.
%% ----------- Step 1
gx = repmat([-1 0 1],3,1); % derivative of Gaussian (approximation)
gy = gx’;
Ix = conv2(I, gx, ’same’);
Iy = conv2(I, gy, ’same’);
%% ----------- Step 2 & 3
Glrg = fspecial(’gaussian’, max(1,fix(6*sigma)), sigma); % Gaussian Filter
Ix2 = conv2(Ix.^2, Glrg, ’same’);
36
Iy2 = conv2(Iy.^2, Glrg, ’same’);
Ixy = conv2(Ix.*Iy, Glrg, ’same’);
%% ----------- Step 4
k = 0.04;
HRS = (Ix2.*Iy2 - Ixy.^2) - k*(Ix2 + Iy2).^2;
As noted above already, we have presented the working principles of the Harris corner detector. There are
many types of feature detectors but they are all based on some manipulation of the gradient image (see for
instance Dav p177, s6.7.6). See also website links in FoPo p190,s5.5 for code examples.
To now select corners in the ’corner image’ (HRS) (step 5), we select maxima and suppress their neigh-
borhood to avoid the selection of very near-by values. Here’s a very primitive selection mechanism:
%% ----------- Step 5
% Extract local maxima by performing a grey scale morphological
% dilation and then finding points in the corner strength image that
% match the dilated image and are also greater than the threshold.
FHar = detectHarrisFeatures(I);
where FHar is a structure that contains the locations and the significance of the detected features. See
Appendix J.7.1 for an example. An example of the detection output is shown in Fig. 14, along with the
output of three other feature detectors. As one can see there are substantial differences in location as
well as in detection count. Both, location and count, depend on the default parameters settings, but the
differences between algorithms remain.
Feature Tracking (in Scale Space): Most features found at coarse levels of smoothing are associated
with large, high-contrast image events because for a feature to be marked at a coarse scale, a large pool of
pixels need to agree that it is there. Typically, finding coarse-scale phenomena misestimates both the size
and location of a feature. At fine scales, there are many features, some of which are associated with smaller,
low-contrast events. One strategy for improving a set of features obtained at a fine scale is to track features
across scales to a coarser scale and accept only the fine-scale features that have identifiable parents at a
coarser scale. This strategy, known as feature tracking in principle, can suppress features resulting from
textured regions (often referred to as noise) and features resulting from real noise.
The most famous descriptor is the scale invariant feature transform (SIFT), which is formed as follows: Pnc p284, s13.3.2
a) take the gradient (0-360 deg) at each pixel (from ∇I) in a 16 × 16 window around the detected keypoint
(Section 3.3), using the appropriate level of the Gaussian pyramid at which the keypoint was detected.
37
Figure 14: Output of various feature detector algorithms. Code in Appendix J.7.1
b) the gradient magnitudes are downweighted by a Gaussian fall-off function (shown as a circle in Figure
15) in order to reduce the influence of gradients far from the center, as these are more affected by
small misregistrations.
c) in each 4 × 4 quadrant, a gradient orientation histogram is formed by (conceptually) adding the weighted
gradient value to one of 8 orientation histogram bins. To reduce the effects of location and dominant
orientation misestimation, each of the original 256 weighted gradient magnitudes is softly added to
2 × 2 × 2 histogram bins using trilinear interpolation. (Softly distributing values to adjacent histogram
bins is generally a good idea in any application where histograms are being computed).
d) form an 4 · 4 · 8 component vector v by concatenating the histograms: the resulting 128 non-negative
values form a raw version of the SIFT descriptor vector. To reduce the effects of contrast or gain
(additive
√ variations are already removed by the gradient), the 128-D vector is normalized to unit length:
u = v/ v · v.
e) to further make the descriptor robust to other photometric variations, values are clipped to t = 0.2: form
w whose √ i’th element wi is min(ui , t). The resulting vector is once again renormalized to unit length:
d = w/ w · w.
The following code fragments give an idea of how to implement steps a-c:
EdGrad = linspace(0,2*pi,8); % edges to create 8 bins
[yo xo] = deal(pt(1),pt(2)); % coordinates of an interest point
Pdir = Gbv.Dir(yo-7:yo+8,xo-7:xo+8); % 16 x 16 array from the dir map
38
Figure 15: Forming SIFT features.
Left: Gaussian weighting, shown for only a 8x8 field in this illustration.
Right: formation of histograms demonstrated on 2x2 quadrants.
[Source: Szeliski 2011; Fig 4.18]
In Matlab one uses the function extractFeatures and feeds both the image and the detected points as
parameters to it:
Dhar = extractFeatures(I, FHar);
The output variable Dhar is a structure containing the selected features and their descriptions, namely
vectors.
6.3 Matching
To compare two two descriptor lists (originating from two different images for instance), di and dj (i =
1, ..k, j = 1, ..l), we take the pairwise distances and form a k × l distance matrix Dij . Then we take the
minimum in relation to one descriptor list, e.g. mini Dij , and obtain the closest descriptor from the other
descriptor list. That would be a simple correspondence and may suffice if we compare a list of image
descriptors and a list of category descriptors, as we will do for image classification (section 7). If we intend
to establish correspondences for registration (section 15), we want to find the mutual matches.
L1, L2: the 2 descriptor lists, [nD1 x nDim] and [nD2 x nDim]
% ----- Compact version:
DM = pdist2(L1,L2);
% ----- Explicit version: (building DM ourselves)
DM = zeros(nD1,nD2);
for i = 1 : nD1
iL1rep = repmat(L1(i,:), nD2, 1); % replicate individual vector of L1 to size of L2
39
Di = sqrt(sum((iL1rep - L2).^2,2)); % Euclidean distance
DM(i,:) = Di; % assign to distance matrix
end
% ----- Correspondence with respect to one list
[Mx2 Ix1] = min(DM,[],1); % [1 x nD2]
[Mx1 Ix2] = min(DM,[],2); % [nD1 x 1]
% ----- Correspondence mutual
IxP1to2 = [(1:nD1)’ Ix2]; % [nD1 x 2] pairs with indices of 1st list and minima of 2nd list
IxP2to1 = [(1:nD2)’ Ix1’]; % [nD2 x 2] pairs with indices of 2nd list and minima of 1st list
bMut1 = ismember(IxP1to2, IxP2to1(:,[2 1]), ’rows’); % binary array of mutual matches in list 1
IxMut1 = find(bMut1); % mutually corresponding pairs with indexing to list 1
IxPMut1 = IxP1to2(IxMut1,:); % [nMut1 x 2] mutal pairs of list 1
One may also want to use a for-loop for the maximum operation, that is to take the maximum row-wise (to
avoid costly memory allocation).
For large databases with thousands of vectors, these ”explicit but precise” distance measurements are too
slow anyway. Instead, faster but slightly inaccurate methods are used, as for instance hashing functions or
kd-trees. In Matlab that is implemented with function matchFeatures.
6.4 Summarizing
Now that we have seen feature detection, feature extraction and feature matching, we can find correspond-
ing matches between two (similar) images. Appendix J.7.2 gives a full code example. This type of estab-
lishing correspondence is the starting point for many tasks and was the leading approach to recognition
before Deep Neural Networks toke over the scene, see again Fig. 1.
C implementations with Matlab interface can be found here for instance:
http://www.vlfeat.org/
http://www.aishack.in/2010/05/sift-scale-invariant-feature-transform/
40
7 Feature Quantization FoPo p203, s6.2, pdf 164
We now move towards representations for objects and scenes using the patches as obtained in the previous
Section 6. For some time, that was the most successful method for scene classification system, but has
lost its appeal since Deep Neural Networks took over. Nevertheless, it is useful to understand the method,
because it has its advantages too: it does not need as many training samples as a DNN and it learns much,
much quicker.
A generic way to exploit such patches is to collect a large number of patches for a category and to find
clusters within them, that are representative for that category. To illustrate the idea, we look at an example
from image compression, specifically color encoding:
Example Quantization An image is stored with 24 bits/pixel and can have up to 16 million colors. Assume
we have a color screen with 8 bits/pixel that can display only 256 colors. We want to find the best 256 colors
among all 16 million colors such that the image using only the 256 colors in the palette looks as close as
possible to the original image. This is color quantization where we map from high to lower resolution. In
the general case, the aim is to map from a continuous space to a discrete space; this process is called
vector quantization. Of course we can always quantize uniformly, but this wastes the colormap by assigning
entries to colors not existing in the image, or would not assign extra entries to colors frequently used in
the image. For example, if the image is a seascape, we expect to see many shades of blue and maybe
no red. So the distribution of the colormap entries should reflect the original density as close as possible
placing many entries in high-density regions, discarding regions where there is no data. Color quantization
is typically done with the k-Means clustering technique.
The Principal Applied to Features In our case, we aim to find clusters amongst our features that rep-
resent typical ’parts’ of objects, or typical ’objects’ of scenes. In the domain of image classification and
object recognition, these clusters are sometimes also called (visual) ’words’, as their presence or absences
in an image, corresponds to the presence or absence of words in a document; in texture recognition they
are also called ’textons’. The list of words represents a ’pooled’ representation or a ’dictionary’ (aka ’bag
of words’), with which we attempt to recognize objects and scenes. Thus, in order to apply this principal,
there are two phases: one is building a dictionary, and the other is applying it; which would correspond to
training and testing in machine learning terminology. Figure 16 summarizes the approach. We will merely
point out how to use the machine learning techniques and omit lengthy explanations in order to progress
with the concepts in computer vision.
41
Algorithm 3 Building a dictionary for a category with xi ∈ DL . Compare with upper half of Figure 16.
1) Collect many training patches xi (d) (i = 1..np ; d = 1..nd )
2) Optional: apply the PCA: xi (d) → xri (d) (d = 1..nr )
3) Find k (nc ) cluster centers cj (d) (j = 1..nc ; d = 1..nr )
42
calculate the clusters during evolvement, see the options of kmeans.
Algorithm 4 Applying the dictionary (for one category), xi ∈ DT . Compare with lower half of Figure 16.
1) For each relevant pixel m in the image: compute the vector representation vm around that pixel
2) Obtain j, the index of the cluster center cj closest to that feature
3) Add a value of one to the corresponding bin the histogram H(j).
7.3 Classification
A classifier is trained (or learned) with a so-called training dataset. To estimate its prediction performance
it is applied to a so-called testing (or sampling) set. It requires a training and a testing set: the classifier is
learned on the training set and then its performance is verified on the testing set. See also Appendix E for
implementation details. When working with a dictionary, we need to partition the training set as well: one
partition is used for building the dictionary, the other partition is used for generating histograms as training
’material’ for the classifier.
Example: We have 30 images per category. For each category we use 25 instances for training and 5 for
testing. Of the 25 training instances, we use 5 for building the dictionary (algorithm 3), the other 20 are
used for generating histograms (algorithm 4). The actual classifier is then trained with those 20 histogram
vectors and tested on the 5 training histograms, which were also generated with algorithm 4. We do this for
3-5 folds (see appendix).
As you may have noticed, there are many parameters that influence performance. The optimization of
such a system is equally challenging (if not even more) as developing just the system - hence enthusiasm
to deal with much code is of benefit. For the moment, we attempt to get the classification system going
with a moderate performance and leave fine tuning to experts in classification. We mention here only that
applying the principal component analysis may result in the largest performance improvement as well as
the tuning of the feature thresholds.
43
8 Object Detection FoPo p549, ch 17
Object detection is the localization of an object category in a scene. For example, we would like to determine
how many faces there are in a group photo; or how many pedestrians there are in a street scene. The
principal technique is to train a classifier that discriminates between the desired object and any other object,
a Deep Neural Network for instance. Then we move that classifier across the image to find potential
matches. This search is done on a subset of pixel locations, because it would be too costly to apply the
classifier at each pixel: one uses ’windows’ that typically overlap and are taken from a grid - it is a coarse
scanning of the image. This is also called the sliding window technique sometimes.
For the training phase, we collect two datasets of image windows, each window of the same size n × m.
One set contains windows of the object of relatively fixed size and reasonably centered in the image. The
other set contains ’distractor’ images that constitute the non-object information. We then train a classifier
to discriminate between these two sets of windows (classes). In the testing (or application) phase, we pass
n × m windows of a new image to the classifier; the window is moved by a step size of few pixels to speed
up the search (e.g. ∆x and ∆y=3 pixels). There are three challenges with this technique:
1) Size invariance: the detection system should be invariant to object size. This can be achieved by a
search over scale, meaning by using the pyramid (Section 3.1): to find large objects, we search on
coarser scales (layers), to find small objects we search on a finer scales. Put differently, we apply the
n × m window in each layer of the pyramid.
2) Avoiding multiple counts: the very same object instance in an image should not be counted multiple
times, which may happen due to the sliding search: the smaller the step sizes, the higher the chance
for repeated detection. To avoid multiple counts, the neighboring windows are suppressed, when a
local maximum was detected, also called nonmaximum suppression.
3) Accelerating spatial search: searching for a match in the highest image resolution is time consuming
and it is more efficient to search for a match in the top pyramidal layers first and then to verify on lower
layers (finer scales), that means by working top-down through the pyramid, e.g. first P3 , then P2 , etc.
This strategy is also known as coarse-to-fine matching.
The technique in summary:
There are obviously tradeoffs between search parameters (e.g. step sizes) and system performance (e.g.
detection and localization accuracy). For example, if we work with training windows that tightly surround the
object, then we might be able to improve object/distractor discrimination, but we will have to use smaller step
sizes for an actual performance improvement. Vice versa, if we use windows that surround the object only
loosely, then we can use smaller steps sizes but our discrimination and localization performance suffers.
In software packages, there are some functions to facilitate the search processes, but for detection of
complex objects one uses specialized classifiers as introduced in subsequent sections.
44
In Matlab this can be found in particular under ’Neighborhood and Block Operations’ in the image process-
ing toolbox. In particular function blockproc and nlfilter are useful here.
In Python one would have to look into module numpy.lib.stride tricks, but that does not appear to be
a much pursued direction.
Even though Deep Neural Networks are perhaps the best performing system nowadays, there are some
traditional systems that excel at speed in particular, such as the Viola-Jones algorithm for face detection,
introduced below. It is therefore worth mentioning them.
Face detection is ubiquitous nowadays: it is run in most of today’s digital cameras to enhance auto-focus; on
social-media sites to tag persons; in Google street view to blur persons, etc. There exist many algorithms
each one with advantages and disadvantages. A good face detection system combines different algorithms,
but most of them will run the Viola-Jones algorithm, to be introduced next. But first we mention general tricks
used for training a face detection system.
A typical face detection system uses the following tricks to improve performance. For the first two tricks see
also Appendix F.
a) Hard Negative Mining: non-face images are collected from aerial images or vegetation for instance
(Figure 17b).
b) Data Augmentation: the set of collected face images is expanded artificially by mirroring, rotating, scal-
ing, and translating the images by small amounts to make the face detectors less sensitive to such
effects (Figure 17a).
c) Image Enhancement: after an initial set of training images has been collected, some optional pre-
processing can be performed, such as subtracting an average gradient (linear function) from the
image to compensate for global shading effects and using histogram equalization to compensate for
varying camera contrast (Fig. 17c), see again Section 2.2.
Figure 17: Training a face detector (Rowley, Baluja, and Kanade 1998a):
a) Data augmentation: artificially mirroring, rotating, scaling, and translating training images to generate a training set
with including larger variability.
b) Hard negative mining: using images without faces (looking up at a tree) to generate non-face examples.
c) Image enhancment: pre-processing the patches by subtracting a best fit linear function (constant gradient) and
histogram equalizing.
[Source: Szeliski 2011; Fig 14.3]
Viola-Jones Algorithm The most frequently used face detection algorithm is probably the one by Viola
and Jones. It uses features consisting of two to four rectangular patches of different polarity, see upper
row of figure 18. The pixels inside the white rectangles are subtracted from the pixels inside the black
pixels. Computing the values for one rectangle can be done extremely efficiently with the integral image
45
(Section 8.1.1). To find out which combinations of rectangles (orientations and size) are representative for
a category, it is necessary to try out all combinations, which is a very time-intensive procedure - despite the
rapid computation of rectangle intensities. This feature selection can be done with a ’boosting’ classifier.
The two most significant features are shown in Figure 18; there exist also a number of other less significant
features. Testing an image occurs very rapidly by searching for the most significant features first; if they are
present, the search continues; if they are not present, the search is stopped.
The primary advantage of this detection system is that it is extremely fast and runs in real time. The
downside of the system is that it detects only vertically oriented faces (the majority of faces is vertically
oriented anyway), and the long learning duration as just mentioned.
In Python can the algorithm be applied through OpenCV. It comes in several variants and one specifies the
desired variant with function CascadeClassifier, here the default variant haarcascade frontalface default.xml.
It is run with the method detectMultiScale:
Other Applications Face detectors are built into video conferencing systems to center on the speaker.
They are also used in consumer-level photo organization packages, such as iPhoto, Picasa, and Windows
Live Photo Gallery.
Pnc p275
Rectangular regions can be detected rapidly by use of the integral image, aka summed area table. It is Dav p175
computed as the running sum of all the pixel values from the origin: SHB p101, alg 4.2
j
i X
X
Is (i, j) = Io (k, l). (12)
k=0 l=0
To find now the summed area (integral) inside a rectangle [i0 , i1 ] × [j0 , j1 ], we simply combine four samples
from the summed area table:
Matlab
46
8.2 Pedestrian Detection
According to Dalal and Triggs, one can typify the structure of pedestrians into ’standing’ and ’walking’:
- standing pedestrians look like lollipops (wider upper body and narrower legs).
- walking pedestrians have a quite characteristic scissors appearance.
Dalal and Triggs used histograms of gradients (HOG) descriptors, taken from a regular grid of overlapping
windows (Fig. 19). Windows accumulate magnitude-weighted votes for gradients at particular orientations,
just as in the SIFT descriptors (see previous section). Unlike SIFT, however, which is only evaluated at
interest point locations, HOGs are taken from a regular grid and their descriptor magnitudes are normalized
using an even coarser grid; they are only computed at a single scale and a fixed orientation. In order to
capture the subtle variations in orientation around a person’s outline, a large number of orientation bins is
used and no smoothing is performed in the central difference gradient computation.
Figure 19 left shows a sample input image, while Figure 19 center left shows the associated HOG descrip-
tors. Once the descriptors have been computed, a support vector machine (SVM) is trained on the resulting
high-dimensional continuous descriptor vectors. Figures 19 center right and right show the corresponding
weighted HOG responses. As you can see, there are a fair number of positive responses around the head,
torso, and feet of the person, and relatively few negative responses (mainly around the middle and the neck
of the sweater).
Matlab extractHOGfeatures
Python skimage.feature.hog
Applications Needless to say, that pedestrian detectors can be used in automotive safety applications.
Matlab even has a code example for pedestrian detection.
47
8.3 Improvement by Knowing Context
The sliding window technique is obviously a bit simple. The technique works with objects that exhibit
limited variability in appearance (gradients) and structure, meaning that do not deform too much. Some
improvement could be made if we knew more about the scene. Let’s take pedestrian detection as an
example. Pedestrians (and like most objects) appear in a typical context: pedestrians are all about the
same absolute size, have their feet on or close to the ground, and are usually seen outdoors, where the
ground is a plane. Thus, if we knew the horizon of the ground plane and the height of the camera above
that ground plane, we could exclude many windows immediately. For instance, windows whose base is
above the horizon would be suspect because they would imply pedestrians in the air; windows whose base
is closer to the horizon should be smaller (otherwise, we would be dealing with gigantic pedestrians). The
height of the camera above the ground plane matters because in this problem there is an absolute scale,
given by the average height of a pedestrian. Assume the horizon is in the center of the image. Then, for
cameras that are higher above the ground plane, legitimate pedestrian windows get smaller more quickly as
their base approaches the horizon. There are two strong sources of information about the horizon and the
camera height. First, the textures of the ground, buildings, and sky are all different, and these can be used
to make a rough decomposition of the image that suggests the horizon. Second, observing some reliable
detection responses should give us clues to where the horizon lies, and how high the focal point is above
the ground plane. Hoiem et al. (2008) show that these global geometric cues can be used to improve the
behavior of pedestrian and car detectors (see also Hoiem et al. (2006)).
48
9 Segmentation (Image Processing II) SHB p176, ch6
Image segmentation is the task of delineating objects or meaningful regions. For a scene, image segmenta- Sze p235, ch5,pdf267
tion aims at partitioning the scene into its constituent objects and regions, often with the aim at performing
a foreground/background segregation. For an object - in face of some background -, image segmenta-
tion aims at finding the exact silhouette and possibly the object’s parts. For other applications, the exact
goal may differ as well. Functionally speaking, segmentation is the search for groups of pixels of a certain
’coherence’.
Historical note. Image segmentation is one of the oldest and most widely studied problems in computer vision.
It was once thought to be an essential, early step in a systematic reconstruction of the semantic image content
(see introduction again). But due to the difficulties of obtaining consistent segmentation results across different
image types (scenes, objects, texture,...), which would correspond to human interpretation -, it has lost its
significance for recognition. Nowadays much recognition is performed without any segmentation - see Deep
Neural Networks. But segmentation algorithms are heavily used for example in medical image analysis (e.g. in
X-rays) and in consumer applications, where users initiate segmentation by pointing out which regions are to be
segmented.
If our task does not require precise segmentation, but merely a coarse localization of objects, then segmen-
tation by thresholding can be sufficient. We have given an example in Section 2 already, in the upcoming
Section 9.1 we elaborate on that. If our tasks requires reasonably precise outlines, then region growing
as introduced in Section 9.2 could be more suitable. If we desire a precise segregation between a large
foreground object and its background, then a statistical classifier is probably the best choice (Section 9.3).
9.1 Thresholding: Global, Local, Band and Multi-Level SHB p177, s6.1
Choosing an appropriate threshold is often done by looking at the intensity distribution of the image: if the
objects and background are of distinct intensity or colors, then the distribution is bimodal and we chose the
minimum between the two modes as the threshold, see for instance the histogram in the upper right of Fig.
20 (see also SHB p24, s.2.3.2). There exists a variety of methods on how to calculate the optimal threshold. One of
the early methods was developed by Otsu, see graph center left in Fig. 20.
Global, Local, Band Thresholding If one applies a single threshold to the entire image, then that is also
called global thresholding. It can however be more suitable sometimes, to apply a threshold that depends
on its neighborhood, that is an image window around the pixel to be thresholded; that would be called a
local threshold - because neighborhoods are local. We can also specify a range for which pixel values
remain unmodified, but the values outside that range would be set to zero: that would be called a band
threshold.
Multi-Level Thresholding For more complex scenes, thresholding is of limited use, but can be exploited
for locating objects (or regions), that possess a relatively distinct gray-level. For instance, thresholding has
been applied for road segmentation and for vehicle location in in-vehicle vision systems. The intensity distri-
bution for such images is often multi-modal and methods to identify the correct threshold or range of values
are sometimes called multi-level thresholding. One possible step toward that goal would be the smoothen-
ing of the distribution and extrema detection as we did for face part detection (Section 2.3).
In Matlab one can look at the intensity histogram with the function imhist; the function graythresh finds an
optimal threshold between two peaks; the function im2bw thresholds the image using the level as specified
by graythres. The function multithres is based on the method used in function graythres.
Caution graythresh/im2bw. The function graythresh always generates the level as a scalar between 0 (black)
and 1 (white) irrespective of the image’s class type (’uint8’, ’single’, etc.). Correspondingly, the function im2bw
expects a level specified between 0 and 1 and the original image class type. If you have converted the original
image to class ’single’ already - as recommended in previous exercises - then this may produce wrong results.
Here it is better to work with the original image class type.
49
Histogram
5000
50 4000
100 3000
150 2000
200 1000
0
50 100 150 200 250 300 0 100 200 300
50 50
100 100
150 150
200 200
50 100 150 200 250 300 50 100 150 200 250 300
K-means Watershed
50 50
100 100
150 150
200 200
250
50 100 150 200 250 300 50 100 150 200 250 300
50
In Python those functions can be found in particular in sub-module skimage.filters, e.g. threshold otsu.
Python offers more methods to determine thresholds than Matlab does.
Application Thresholding is particularly appealing for tasks that require fast segmentation, as in video
processing for example.
In region growing one starts with several, selected pixels and then keeps expanding from those pixels
until some stopping condition is fulfilled. The most popular algorithm is the watershed algorithm and as
its name implies, one floods the intensity landscape until the rising water level meets the watersheds. To
imagine that, we observe the image as an intensity landscape, see Fig. 3 again. The algorithm determines
the landscape’s (local) minima and from those one grows outward until the flood front encounter another
growing flood front. The points of encounter form lines that correspond to watersheds. The resulting regions
can be regarded as catchment basis where rain would flow into the same lake.
Watershed segmentation is usually applied to a smoothed version of the gradient magnitude image
(k ∇I k, Section 3.3), thus finding smooth regions separated by visible (higher gradient) boundaries.
Watershed segmentation often leads to over-segmentation (see lower right in Figure 20), that is a seg-
mentation into too many regions. Watershed segmentation is therefore often used as part of an interactive
system, where the user first marks seed locations (with a click or a short stroke) that correspond to the
centers of different desired components.
Matlab watershed
Python skimage.segmentation.watershed
Advantages no specification of the number of clusters necessary or any other parameter.
Disadvantages over-segmentation, slow
9.3.1 K-Means
We used the K-Means procedure already for feature quantization (Section 7.1.1), namely on a 128-dimensional
problem. Here we use it to segment much lower-dimensional spaces, for instance in Fig. 20 we used merely
one dimension, namely gray-scale intensity. If we have a color image, then it would be obvious to start ex-
ploring the three RGB dimensions.
The advantage of using the K-Means algorithm is that it works relatively fast; its downside is that one
needs to specify a number of clusters, meaning one needs to know how many different objects one expects.
51
Suppose we have a color image with of a flower with a homogeneous background (leaves of other
plants). Then perhaps we could start the segmentation with k=3: one cluster for the flower leaves, one
cluster for the background, and one for the rest of the flower. Doing this only on the color channels we
would code:
Matlab kmeans
Advantages relatively fast: faster than region growing, but slower than thresholding
Disadvantages specification of k: number of expected clusters needs to be specified beforehand
9.4 Notes
Color In many classification problems, color information is hardly of use as objects can appear in many
different colors. In segmentation tasks however, color information is more likely to be of actual benefit. It may
therefore be worth observing the images also in a different color space, other than the typical RGB space.
For instance, segmentation of natural objects (leafs, fruits, landscapes) is sometimes done in the Hue-
Saturation-Value (HSV) space, which can be obtained by converting with the command rgb2hsv.There exist
many other color spaces, see Appendix H for an overview. But be advised that even when research articles
state that a certain transformation helps improving segmentation, it is not necessarily what everyone agrees
on. For instance for the task of skin detection, there exists many studies advertising their transformations.
But other studies say that there is no real significance advantage by transforming the original RGB space.
Thus, perhaps trying to deal with the original RGB space could be sufficient, at least in a first step.
52
Moving on Many segmentation algorithms return a binary (logical) map, also called black-white image as
introduced in Section 2. That map is often manipulated with operations known as morphological processing,
coming up in the next Section 10.
53
10 Morphology and Regions (Image Processing III) Sze p112, s3.3.2, p127
Morphological processing is the local manipulation of the structure in an image - its morphology - toward
a desired goal. Often the manipulations aim at facilitating the measurement of features such as regions,
contours, shapes, etc. Morphological processing can also carry out filtering processes.
We have already given two use cases for morphological processing. In one example, morphological
processing ’polished’ the map of edge pixels that one has obtained with some edge-detection algorithm
(Section 12.2). And we have mentioned several times that after application of a segmentation algorithm
one would like to continue modifying the black-white image toward a specific goal. In both cases one would
apply so-called binary morphology, coming up in Section 10.1. But one can also apply such manipulations
to gray-scale images, called gray-scale morphology, which we will quickly introduce in Section 10.2.
After we have completed (segmentation and) morphological processing, we often want to count the
objects or regions in the black-white image. Or if we know their count already, then perhaps we intend to
localize the regions and describe them by a few parameters. This is called region finding or labeling. This
is a fairly straightforward process and will be explained in Section 10.3.
20 20 20 20 20
40 40 40 40 40
60 60 60 60 60
80 80 80 80 80
20 40 60 20 40 60 20 40 60 20 40 60 20 40 60
Figure 21: Binary morphological processing on a hand-written letter ’j’. From left to right:
Original: the object to be manipulated.
Erosion: a slimmed version of the original object: note that the resulting object is now ruptured.
Dilation: a thickened version of the original.
Closing: this process consists of two operations: first dilation of the original, followed by erosion. (the effects of any
actual closing are not really viewable in this case)
Opening: first erosion of the original, followed by dilation. Note that some of the ruptures are now more evident.
Erosion: the object looses one or several ’layers’ of pixels along its boundary.
Dilation: the object fattens by one or several layers of pixels along its boundary.
If one applies the above two operations in sequence, then that causes the following modifications:
Closing: is the dilation followed by an erosion, then that fuses narrow breaks and fills small holes and
gaps.
Opening: is the erosion followed by a dilation, then that eliminates small objects and sharpens peaks in
an object.
The last two sequential operations can be summarized as follows: they tend to leave large regions and
smooth boundaries unaffected, while removing small objects or holes and smoothing boundaries.
Other combinations of those basic operations are possible, leading to relatively complex filtering op-
erations such as the tophat and bothat operations, which however can also become increasingly time-
54
consuming. When developing an application, it is difficult to foresee which morphological operations are
optimal for the task: one simply has to try out a lot of combinations of such operations and observe carefully
the output.
More algorithmically speaking, the image is modified by use of a structuring element, which is moved
through the image - very much like in the convolution process. The structuring element can be any shape
in principal: it can be a simple 3 × 3 box filter as in the examples above; or it can be a more complicated
structure, for instance some simple shape that one attempts to find in the image, leading so essentially to a
filter process.
If one deals with large images, then one of the first steps toward object characterization or region finding
might be to eliminate very small regions, which in software programs we can do with a single function. This
is of benefit because morphological operations are carried out much faster after one has eliminated any
unlikely object candidates.
Matlab Binary manipulations are carried out with commands starting with the letters bw standing for black-
white. Here are examples of some essential manipulations, whereby most manipulations can be found in
function bwmorph:
The operation ‘thin’ creates skeleton-like structures. It is particularly useful for contour tracing: one would
apply first a thinning operation before trying to trace contours in a map.
More sophisticated manipulations can be achieved using a structural element defined with strel. There
also exist functions such as bwareafilt and bwpropfilt to carry out complex filtering operations.
Python The functions are found in different sub-modules, in scipy.ndimage.binary xxx as well as in
skimage.morphology. In examples:
55
10.2 Grayscale Morphology
The above introduced operations also exist in gray-scale morphology, but here the operation is not the
change of a bit value, yet the selection of an extrema value in the neighborhood under investigation. In
gray-scale morphology one talks of the structural function: in the simplest case, the function takes the
maximum or minimum.
Matlab The operations are found under the initial letters im, for instance imdilate, imerode, imopen and
imclose.
Python The operations can be found in scipy.ndimage.grey xxx or some of them in skimage.morphology.
For instance skimage.morphology.dilation carries out dilation for a gray-scale image and can also be ap-
plied to a black-white image, but for the latter the above functions starting with binary are faster.
Label Matrix To find connected components, one runs an algorithm that labels the regions with integers:
1, 2, 3, ...nregions . The labeled regions remain in the map whereby the value 0 signifies background. Labeling
can occur with two principal different types of connectivity: the ’conservative’ type uses only a so-called 4-
connectivity or 4-connected neighborhood and uses only the 4 neighbors along the vertical and horizontal
axes. The ’liberal’ type uses the so-called 8-connectivity or 8-connected neighborhood and uses all 8
neighbors; it typically results in fewer and larger regions than the 4-connectivity.
Region Properties After we have located the regions, we may want to describe them. Software packages
typically provide a function that describes regions with a number of simple measures based on geometry or
statistics, e.g. various measures of the region’s extent in the image, statistical moments, etc.
Matlab’s function regionprops can be applied in a variety of ways. Here are explained three ways. We can
apply the function regionprops directly to the black-white image BW, in which case we have skipped the
function bwconncomp - it is carried out by regionprops in that case. Or we apply the function bwconncomp
first and then feed its output to regionprops. Or we use the function bwboundaries, which is useful if we
intend to describe the silhouette of shapes, for example with a radial description as will be introduced in
Section 11 on the topic of shape.
I = imread(’cameraman.tif’);
BW = im2bw(I,128/255);
56
%% ===== RegionProps directly from BW ========
RPbw = regionprops(’table’,BW,I,’maxintensity’,’area’);
The function regionprops provides only simple region descriptions. For more complex descriptions, one
inevitably moves toward the topic shape description and that will be introduced in the upcoming section.
Python offers slightly less flexibility than Matlab. The sub-module skimage.measure provides the functions
label and regionprops. The label function requires type int as input, and the function regionprops takes
only a label matrix as input:
To obtain the list values as a table - as a single array -, we write a loop as follows, a formulation that is called
list comprehension in Python:
57
11 Shape Dav p227, ch 9 and 10
Shape means the geometry of an object or its form; it is a ’structure’ consisting of a few segments, typically
without texture. Shape description techniques are used in the following applications for example:
- Medical imaging: to detect shape changes related to illness (tumor detection) or to aid surgical planning
- Archeology: to find similar objects or missing parts
- Architecture: to identify objects that spatially fit into a specific structure
- Computer-aided design, computer-aided manufacturing: to process and to compare designs of mechan-
ical parts or design objects.
- Entertainment industry (movies, games): to construct and process geometric models or animations
In many applications, the task to be solved is the process of retrieval, namely the ordering (sorting) of
shapes, and less so the process of classification, see Section 1.2 again to understand the difference. In
a retrieval process, one shape is compared to all other shapes and that is computationally much more
intensive than just classification. Thus, one major concern in retrieval is the speed of the entire matching
process.
The number of shape matching techniques is almost innumerous. Each technique has its advantages
and disadvantages and works often for a specific task and merely under certain conditions. Textbooks
are typically shy of elaborating on this topic - with the exception of Davies’ book - , because there is no
dominating method and it is somewhat unsatisfactory and endless to present all techniques. Perhaps the
two most important aspects in choosing a shape matching technique are its matching duration and its
robustness to shape variability.
Shape Variability Depending on the task or the collection, shapes can vary in size or in spatial orienta-
tion; they can be at different positions in the image; they can appear mirrored; their parts may be aligned
slightly differently amongst class instances; their context may differ. These different types of ’condition’ are
also called variability. Ideally, a shape matching technique would be invariant to all those variabilities. The
table below shows the terminology used with respect to those desired shape matching properties:
Variability Invariance
size scaling
orientation rotation
position translation
laterality (mirroring) reflection
alignment of parts articulation
blur, cracks, noise deformation
presence of clutter occlusion
Practically, it is impossible to account for all these invariances and therefore one needs to observe what
type of variability is present in the shape database and make a choice of the most suitable technique. This
choice is also important for classification techniques.
The first three sections introduce shape descriptions of increasing complexity. Section 11.1 introduces
simple shape descriptions suitable for rapid retrieval. Section 11.2 introduces techniques based on point
comparisons: those techniques have a longer matching duration but show better retrieval accuracy. Section
11.3 introduces part-based descriptions: they can be even more accurate, but the matching techniques are
rather complicated. Thus, the choice of a matching technique is also a matter of dealing with a speed-
accuracy tradeoff. In the final section 11.4, we have a word on shape classification systems.
58
Section 11.1.1, we mention simple measures based on the boundary or interior of the shape. In Section
11.1.2 we introduce the radial description, which is probably the most efficient description that uses feature
vectors.
Advantages compact, useful if very large number of shapes are to be matched, can serve as a triage
Disadvantages not very discriminative
If the shape is a continuous curve, that is a single, closed curve, then we can extract many more useful
parameters by determining its radial signature, see Figure 22, also called centroidal profiles. For that we
need the silhouette points of the shape - its boundary. In Matlab the boundary can be obtained with the
function bwboundary, see Section 10.3. In Python there exists the function find contours.
The radial signature is the sequence of distances R(s) from the shape’s center point to each silhouette
(curve) point s. To obtain the center point, one simply averages the curve points. Sometimes one uses
angle Ω as the dependent variable instead of curve point s.
For a circle, the radial signature would be a constant value. For an ellipse, the signature would be
undulating with two ’mounds’. For a triangle, there would be three sharp peaks. For a square, there would
be four peaks. And so on. For a complex shape, such as the the pigeon shape in the figure, the signature
is relatively complex.
There are two relatively straight-forward analyses we can do with the radial signature. One is a Fourier
analysis, meaning we express the signature as a spectrum of frequencies. This is an enormously powerful
analysis, which merits its own lectures, but is rather the topic of a signal processing course. The other anal-
ysis is an investigation of the extrema present in the signature. The Fourier analysis is more discriminative
than the extrema analysis, but combining both can yield even better results.
Fourier Analysis The Fourier analysis transforms a signal into a spectrum, which in digital implemen-
tation is a sequence of so-called Fourier descriptors (FD). We apply this discrete Fourier transform to the
(unmodified) radial signature (R(s)) and normalize it by its first value:
FDabs = abs(fft(Rad)); % fast Fourier
FDn = FDabs(2:end)/FDabs(1); % normalization by 1st FD
In the lower right of Fig. 22, the first 50 Fourier descriptors are shown. But typically, the first 5 to 10 Fourier
descriptors are sufficient for discriminating the shapes.
59
Radial Signature
0 200
50 150
Radius
100 100
150 50
0
200 0 200 400 600 800
Arc Length
250
300
(Discrete) Fourier Descriptors
0.2
Magnitude
350
400 0.1
450 0
0 100 200 0 10 20 30 40 50
No.
Figure 22: Left: a pigeon shape; the asterisk marks the beginning of the signature. Upper Right: radial signature:
the distances between pole (shape center) and the individual shape points. Lower Right: discrete Fourier descriptors
of the radial signature.
Derivative Analysis Finding extrema in the signature is easier if we first low-pass filter the signature - very
much as in the analysis of facial profiles, see introductory exercise. The number of extrema corresponds
to the number of corners in a shape, whereby here corner means a curvature higher than its context. Two
corners would correspond to an ellipse or bicorn shape, three corners to a triangle or trident shape, etc.
The presented descriptions can not discriminate large sets of shapes, but their use may serve as a triage
for a more complex description and matching. Appendix J.10.3 shows an example of how to generate the
radial and Fourier descriptors.
11.2 Point-Wise
There are two cases of point-wise formats one can distinguish. In one format, a shape is expressed as a
single boundary, a sequence of points, which typically corresponds to its silhouette (Section 11.2.1). In the
other format, a shape is considered as a set of points (Section 11.2.2). Because the shape comparison
with such formats is relatively time-consuming, one prefers too know the approximate alignment between
the two shapes before an accurate similarity is determined. This is also known as the correspondence
problem, meaning which points in one shape correspond to which others in another shape, at least in an
approximate sense.
11.2.1 Boundaries
In this case, the list of points of one shape, are compared to the point list of another shape, by somehow
determining a distance (or similarity) measure between the two lists. The simplest way would be to take
the pairwise distances between the two lists of points and to sum the corresponding minima to arrive
at a measure of similarity. The pairwise point matching is computationally costly and the computational
complexity is said to be square, expressed also as O(N 2 ), where N is the number of pixels and O is
the symbol for complexity. And to solve the correspondence problem one could simply shift the two shapes
against each other to find a minimum for the correspondence. This would increase the complexity to O(N 3 ),
that is it is cubic now and thus rather impractical.
60
Of course, the complexity were greatly reduced if one used so-called landmarks or key-points only,
namely points on the shape that are at locations of high curvature in a shape. We can do this using the
radial, the curvature or the amplitude signatures, see Sections 11.1.2 and 12.3. The problem is that such
key-points are difficult to determine consistently.
The most efficient boundary matching technique is based on observing the local orientations along the
boundary and including them in the matching process. The detailed steps are as follows:
1. Sample an equal number of points i = 1, .., N from each shape, equally spaced along the boundary.
2. Determine the local orientation o at each point, for instance the angle of the segment spanning several
pixels on both sides of the center pixel. Thus the shape is described by a list of N points with three
values per point: x- and y-coordinate, as well as orientation o.
3. Determine the farthest point using the radial description. This will serve as a correspondence.
When matching two shapes, one would take the point-wise distances (including orientation o) using the far-
thest point as reference. The point-wise distances are also taken in reverse order to account for asymmetric
shapes. Thus, the matching complexity is O(2N ) only.
1) Multiple Boundaries The most successful approach is called Shape Context (Belongie, Malik &
Puzicha, 2002) and is based on taking local radial histograms at selected points of the shape. The se-
lection of such key-points may not be completely consistent, but that would be compensated by a flexible
matching procedure. At each key-point, a circular neighborhood of points is selected and a one-dimensional
histogram is generated counting the number of on-pixels as a function of radial distance.
2) Limited Variability; Few Points This case is rather useful for localization and less for retrieval (or
classification). We assume that we know the shape’s key-points and that its articulation is limited (see
again property list given in the introduction). The goal is then to find the target shape in another image. We
are then faced with two tasks: the correspondence problem and the transformation problem.
The distance transform is typically determined for a binary image. The transform calculates at each back- SHB p19, s 2.3.1
ground (off) pixel the distance to the nearest object (on) pixel. This results in a scalar field called distance
map D(i, j). The distance map looks like a landscape observed in 3D, which is illustrated in Figure 23. The
distance values inside a rectangular shape form a roof-like shape, a chamfer (shapes used in woodworking
and industrial design); the interior of a circle looks like a cone. The distance transform is also sometimes
known as the grassfire transform or symmetric-axis transform, since it can also be thought of a propagation
process, namely a fire front that marches forward until it is canceled out by an oncoming fire front. Wherever
such fronts meet, that is where they form symmetric points, which correspond to the ridges in the distance
61
Original Black/White Distance Map
50 50
100 100
150 150
200 200
50 50
100 100
150 150
200 200
map: it is a roof-like skeleton for the rectangle and a single symmetric point for a circle - the peak of the
cone. Sometimes that skeleton is also called medial axis.
The distance transform can be calculated with different degrees of precision. Simpler implementations
provide less precision but calculate quicker an imprecise distance value; they are based on the Manhattan
distance for instance. Precise implementations use the Euclidean distance.
In Matlab the function bwdist calculates the distance values for off-pixels, meaning on-pixels are under-
stood as boundaries (as introduced above). It calculates Euclidean distances by default, but a simpler
implementation can be specified as option.
In Python the distance transform can be found in the scipy module, specifically scipy.ndimage.distance transform edt
for Euclidean precision. It calculates the distances at on-pixels (in contrast to Matlab). Simpler implemen-
tations are provided by separate function scripts, ie. distance transform cdt.
62
from scipy.ndimage import distance_transform_edt
DM = distance_transform_edt(BW) # distance calculated at ON pixels
Applications: binary image alignment (fast chamfer matching), nearest point alignment for range data
merging, level set, feathering in image stitching and blending
BWs = bwmorph(BW,’skel’,Inf);
BWs = bwmorph(BW,’thin’,Inf);
BWs = medial_axis(BW)
BWs = skeletonize(BW)
The differences between those implementations is subtle and it is diffult to foresee when what implementa-
tion is more appropriate for a specific task. One general difficulty with any of those implementations is that
they are limitedly robust to subtle variability (see table 11 again). In particular, they are limitedly invariant
to articulation and deformation, but even for rotation of the same shape, the output of an algorithm may
change.
11.4 Classification
When it comes to classification, the most successful systems are not the above introduced techniques, but
Deep Neural Networks as introduced in Section 5. Even for the digit database MNIST, the use of symmetric
axes for instance, has not provided better prediction accuracies than machine learning algorithms.
There are two disadvantages with CNNs for shape recognition. One is, that they require the shape
to be fairly well centered in the image; thus, a search algorithm is necessary that finds the exact shape
center. The other disadvantage is, that they require fairly long to learn the features, as pointed out already
in Section 5, though with transfer learning that short coming is almost eliminated.
The power of those networks comes from their robustness to local changes: small changes in the
boundary do have little consequences in networks. In contrast, for any of the shape description techniques
introduced above, such small changes can result in relatively different features and thus tendentially more
wrong classifications than with DNNs. Take for instance the two rectangular shapes in Figure 23: the
corresponding skeletons in the lower right graph, show sufficient differences that make a robust comparison
difficult.
63
12 Contour
Contours outline the objects and parts in a scene. Contours can be detected by edge detection as in-
troduced in Section 4.2. But to understand the object’s shape (or scene part) we need a description. As
contours are often fragmented due to various types of noise and illumination effects, it requires a description
that can deal with open contours - as opposed to the radial description for closed contours (Section 11.1.2).
In Section 12.3 we introduce the challenges toward that goal. But before we arrive there, we need to coil up
the individual edge pixels of the edge map, a process called edge following or edge tracing (Section 12.2).
If one needs to detect straight lines or circles only - as in case of some specific applications - then we
can use statistical methods to find them (Section 12.1). Those methods are also relatively robust to contour
fragmentation.
Finally, it is worth pointing out that there exist also other contour types, used in some particular scenarios
(Section 12.4).
Edge following is also called edge tracing or edge tracking. Sometimes it is also called boundary tracing,
which makes sense when edges represent a region boundary.
There are two issues to resolve before we start tracing. One is that contours in edge maps are occasion-
ally broader than one pixel, which can make tracing difficult, because such clusters represent ambiguous
situations; they occur in particular when the contour runs along the diagonal axes. To avoid such thickened
contours, one can employ the operations of morphological processing as introduced previously, binary mor-
phology in particular (Section 10.1). With the so-called thinning operation we can ’slim’ those thick contour
locations. Or perhaps we wish to remove isolated pixels immediately by using an operation clean:
Another issue is that we need to decide how to deal with junctions. Let us take a T-junction as an
example: should we break it up at its branching point and trace the three segments individually? Or should
trace around it - an obtain a boundary - which we later partition into appropriate segments? This is an issue
of representation in principle.
64
Matlab There are several ways to do this: two involve the use of Matlab functions, one way would be to
write our own routine, see also example in Section J.11.1.
1. bwboundaries: this routine is useful for finding the boundaries of regions. In case of a contour, the
routine will trace around the contour and half the contour pixels match with the opposite pixels: if one
preferred the pixel coordinates only, then one had to eliminate the duplicate coordinates.
2. bwtraceboundary: here one specifies a starting point and the function will then trace until it coincides
with its starting point. In contrast to the function bwboundaries above, it does not treat a contour as a
region and pixels are ’recorded’ only once. Hence, we write a loop which detects starting points and
trace contours individually. We need to take care of when tracing should stop for an individual contour.
3. Own routine: the easiest approach would be to trace contours using their end- and branchpoints. The
endpoints can be found with the command Mept = bwmorph(Medg,’endpoints’); the branchpoints
can be found with the option ’branchpoints’.
Python There exists no routine designed specifically for that task. Python however offers a routine to
measure iso-contours, called skimage.measure.find contours. One specifies a height value at which
the map is thresholded and then the corresponding region boundaries are taken. This can be exploited
to emulate boundary detection in binary images. If one specifies a value of 0.8 (between 0.5 and 1.0),
then the contours lie closer to the region pixels; if one specifies a value of 0.2 (between zero and 0.5),
then the contours lie closer to the adjacent exteriors pixels of the region. The example in Section J.11.1
demonstrates that.
12.3 Description
To properly describe an open or closed contour it is necessary to move a window through the contour
and take some measure of the window’s subsegment. This results in a signature analogous to the radial
signature introduced for shapes 11.1.2. There are two types of signatures that have been pursued so far:
curvature and amplitude.
12.3.1 Curvature
The curvature measure is calculated by simply taking the derivatives of the contour. The second derivative
is our signature. Here is an improvised curvature measure, where Rf and Cf are our coordinates (row and
column indices):
Wherever a peak in the signature occurs, that is where the contour exhibits sharp curve, a point of highest
curvature. This is sometimes used for shape finding and identification. If there is no peak, then we deal
with a perfect, smooth arc or with a circle.
If the coordinates are integer values, then it is better to firstly smoothen the coordinate values. Appendix
J.11.2 demonstrates a complete example.
The shortcoming with this signature is that it is suitable only for a certain scale: we are not able to detect
all points of highest curvature in arbitrarily sized contours. Broad curvatures are easier detected when the
signature has been smoothened with a correspondigly large lowpass filter. In order to cover all curvatures
in an arbitrary contour, one therefore generates a space, the curvature scale space (CSS) by low-pass
filtering the curve for a range of sigmas, analogous to the image scale space as introduced in Section 3.1.
65
12.3.2 Amplitude
The curvature measure has the downside that it modifies the signal by low-pass filtering it. It would be better
to leave the signal as is and take an alternate measure. That alternate measure would be the amplitude of
the subsegment, the distance between chord (equation) and curve pixels (of the subsegment). The resulting
signature does not look much different than a curvature signature, but in a space it allows a more reliable
detection of points of highest curvature. The downside is that calculating the amplitude is a relatively costly
procedure.
50
50
100
100 150
200
150
250
200 300
350
250
50 100 150 200 250 100 200 300 400 500
50
50
100
100 150
200
150
250
200 300
350
250
50 100 150 200 250 100 200 300 400 500
Figure 24: Left Column: river contours detected in a satellite image of an agricultural field. Those contours are
difficult to detect with edge detectors due to their tight spacing. Right Column: iso-contours taken at various intensity
levels; some of them represent well surfaces and also regions reflecting the light source.
66
13 Image Search & Retrieval FoPo p657, ch 21, pdf 627
An image retrieval system is a computer system for browsing, searching and retrieving images from a
large database of digital images. Browsing, searching and retrieving are search processes of increasing
specificity:
- browsing: the user looks through a set of images to see what is interesting.
- searching: the user describes what he wants, called a query, and then receives a set of images in
response.
- retrieval: the user uploads an image and in return obtains the most similar images, ranked by some
similarity measure.
Traditionally, methods of image retrieval utilized metadata such as captioning, keywords, or (textual) de-
scriptions to find similar images, that is they would not use any computer vision. Then, with increasingly
powerful computer vision techniques, search started to be carried out also with the actual ‘pixel content’
using simple image histogramming in the early days. This new approach was then called content-based
image retrieval (CBIR). For some time, CBIR was based on the (engineered) features as introduced in early
sections (Section 4), but modern CBIR uses also Deep Nets of course.
Note that images are selected by some measure of similiarity - they are not classified as in image
classification, see again Section 1.2 to understand the difference or the introduction on shapes in Section
11. For a query image, a similiarity to all other images is calculated and then the images are ranked
according to that similarity measure.
The following section does therefore not offer novel computer vision methods, but introduces terminology
and methods from the field of information retrieval, that optimize the outcome of a search. Although that
terminology was developed in particular for feature engineering techniques, it is equally useful for feature
learning techniques.
Applications
Finding Near Duplicates
1) Trademark registration: A trademark needs to be unique, and a user who is trying to register a trademark
can search for other similar trademarks that are already registered (see Figure 25 below).
2) Copyright protection.
Figure 25: A trademark identifies a brand; customers should find it unique and special. This means that, when one
registers a trademark, it is a good idea to know what other similar trademarks exist. The appropriate notion of similarity
is a near duplicate. Here we show results from Belongie et al. (2002), who used a shape-based similarity system to
identify trademarks in a collection of 300 that were similar to a query (the system mentioned in Section 11.2.2). The
figure shown below each response is a distance (i.e., smaller is more similar). This figure was originally published as
Figure 12 of Shape matching and object recognition using shape contexts, by S. Belongie, J. Malik, and J. Puzicha,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, c IEEE, 2002.
Semantic Searches Other applications require more complex search criteria. For example, a stock photo
library is a commercial library that survives by selling the rights to use particular images. An automatic
67
method for conducting such searches will need quite a deep understanding of the query and of the images
in the collection. Internet image search shows one can build useful image searches without using deep
object recognition methods (it is a safe bet that commercial service providers do not understand object
recognition much better than the published literature). These systems seem to be useful, though it is hard
to know how much or to whom.
Trends and Browsing In data mining, one uses simple statistical analyses on large datasets to spot
trends. Such explorations can suggest genuinely useful or novel hypotheses that can be checked by domain
experts. Good methods for exposing the contents of images to data mining methods would find many
applications. For example, we might data mine satellite imagery of the earth to answer questions like: how
far does urban sprawl extend?; what acreage is under crops?; how large will the maize crop be?; how much
rain forest is left?; and so on. Similarly, we might data mine medical imagery to try and find visual cues to
long-term treatment outcomes.
We now review techniques from document retrieval, that found their ways into image retrieval. A typical
text information retrieval systems expects a set of query words. With those query words an initial set of
putative matches is selected from an index (Section 13.1.1). From this list they chose documents with a
large enough similarity measure between document and query (Section 13.1.2). These are ranked by a
measure of significance, and returned (Section 13.1.3). For the purpose of image retrieval we can think of
the words as being the ’visual words’ as in developed in Section 7, and of documents as being the images.
Some more comparison is given in the final section 13.1.4.
The table is sparse as most words occur in few documents only. We could regard the table as an array of
lists. There is one list for each word, and the list entries are the documents that contain that word. This
object is referred to as an inverted index, and can be used to find all documents that contain a logical
combination of some set of words. For example, to find all documents that contain any one of a set of
words, we would take each word in the query, look up all documents containing that word in the inverted
index, and take the union of the resulting sets of documents. Similarly, we could find documents containing
all of the words by taking an intersection, and so on. This represents a coarse search only, as the measure
f is used as a binary value only (and not as an actual frequency). A more refined measure would be the
word count, or even better, a frequency-weighted word count. A popular method is the following:
tf-idf stands for ’term frequency-inverse document frequency’ and consists of two mathematical terms,
one for ’term frequency’, the other for ’inverse document frequency’. With Nt as the number of documents
that contain a particular term, the idf is
Nd total
idf = . (15)
Nt containing term
Practical tip: add a value of one to the denominator to avoid division by zero. With
68
nt (j) for the number of times the term appears in document j and
nw (j) for the total number of words that appear in that document
the tf-idf measure for term t in document j is
n (j) N
t d
ft,j = / log . (16)
nw (j) Nt
We divide by the log of the inverse document frequency because we do not want very uncommon words to
have excessive weight. The measure aims at giving most weight to terms that appear often in a particular
document, but seldom in all other documents.
In response to a user query, a set of Nb documents is returned, where Nb needs to be specified, i.e. 100
items are returned to the user. To determine how fitting the selected documents are, one calculates two
measures:
F measure: To summarize the recall and precision values, different formulas can be used. A popular one
is the F1 -measure, which is a weighted harmonic mean of precision and recall:
PR
F1 = 2 . (20)
P +R
Precision-Recall Curve To obtain a more comprehensive description of the system, one calculates the
precision values for increasing recall by systematically increasing Nb . One plots recall on the x-axis against
69
precision on the y-axis and this curve is called the precision-recall curve .
Average Precision An important way to summarize a precision-recall curve is the average precision, which
is computed for a ranking of the entire collection. This statistic averages the precision at which each new
relevant document appears as we move down the list. P (r) is the precision of the first r documents in the
ranked list, whereby r corresponds to Nb in equation 19; Nr the total number of documents in the collection.
Then, the average precision is given by
Nr
1 X
A= P (r). (21)
Nr r=1
Example In response to a query the (total of) 3 relevant items are found at positions 2, 13 and 36:
11 2 3
A= + + = 0.2457
3 2 13 36
Implementation Given an index vector with sorted retrieval positions, Ix, the measure is computed as
follows:
70
14 Tracking FoPo p356, ch 11, pdf 326
Tracking is the pursuit of one or multiple moving objects in a scene; it is often carried out in a 2D plane.
Tracking is used in many applications:
Surveillance: is the monitoring of activities and the warning when a problem case is detected. Example
airport traffic surveillance: different kinds of trucks should move in different, fixed patterns - if they do
not, it is suspicious; similarly, there are combinations of places and patterns of motions that should
never occur (e.g., no truck should ever stop on an active runway).
Targeting: A significant fraction of the tracking literature is oriented toward (a) decision what to shoot,
and (b) hitting it. Typically, this literature describes tracking using radar or infrared signals (rather than
vision), but the basic issues are the same: What do we infer about an object’s future position from a
sequence of measurements? Where should we aim?
Motion Capture: is the recording of the 3D configuration of a moving person using markers, e.g. white
patches placed on the joints of a person dressed in a black suit. Such recordings are used for anima-
tion, e.g. rendering a cartoon character, thousands of virtual extras in a crowd scene, or a virtual stunt
avatar.
There exist two simple but effective methods for tracking:
1. Tracking by detection (Section 14.1): we have a strong model of the object, strong enough to identify
it in each frame. We localize it, link up the instances, and that would be our track.
2. Tracking by matching (Section 14.2): we have a model of how the object moves. We have a domain in
the nth frame in which the object sits, and then use this model to search for a domain in the (n + 1)’th
frame that matches it.
For precise tracking, we include estimations of where the target might go next (Section 14.3).
• Case ’1 object in each frame’: If there is only one object in each frame, we try to build a reliable
detector (for that object) and observe its position in each frame (Figure 26a). If the detector is not reliable
we treat the problem as if there were multiple objects, see next.
Example: Tracking a red ball on a green background: the detector merely needs to look for red pixels.
In other cases, we might need to use a more sophisticated detector, e.g. tracking a frontal face.
• Case ’Multiple objects (or unreliable detector)’: If objects enter or leave the frame (or the detector
occasionally fails), then it is not enough to just report the object’s position. We then must account for the fact
that some frames have too many (or too few) objects in them. Hence, we maintain a track, which represents
a timeline for a single object (Figure 26b).
Typically, the tracks from the previous frame are copied to the next frame, and then object detector
responses are allocated to the tracks. How we allocate depends on the application (we give some examples
below). Each track will get at most one detector response, and each detector response will get at most one
track. However, some tracks may not receive a detector response, and some detector responses may not
be allocated a track. Finally, we deal with tracks that have no response and with responses that have no
track. For every detector response that is not allocated to a track, we create a new track (because a new
object might have appeared). For every track that has not received a response for several frames, we prune
that track (because the object might have disappeared). Finally, we may postprocess the set of tracks to
insert links where justified by the application. Algorithm 6 breaks out this approach.
The main issue in allocation is the cost model, which will vary from application to application. We need
a charge for allocating detects to tracks. For slow-moving objects, this charge could be the image distance
between the detect in the current frame and the detect allocated to the track in the previous frame. For
objects with slowly changing appearance, the cost could be an appearance distance (e.g., a χ-squared
distance between color histograms). How we use the distance again depends on the application. In cases
where the detector is very reliable and the objects are few, well-spaced, and slow-moving, then a greedy
71
Figure 26:
a: the case of one object only - an object with a distinctive appearance in each frame: the detector re-
sponses are linked to form a simple space-time path.
b: if some instances drop out, we will need to link detector responses to abstract tracks: track 1 has mea-
surements for frames n and n + 2, but not for frame n + 1.
c: if there is more than one instance per frame, a cost function together with weighted bipartite matching
could be enough to build the track.
[Source: Forsyth/Ponce 2010; Fig 11.1]
algorithm (allocate the closest detect to each track) is sufficient. This algorithm might attach one detector
response to two tracks; whether this is a problem or not depends on the application.
The more general algorithm solves a bipartite matching problem, meaning tracks on one side of the
graph are assigned to the detector responses on the other side of the graph. The edges are weighted by
matching costs, and we must solve a maximum weighted bipartite matching problem (Figure 26c), which
could be solved exactly with the Hungarian algorithm, but the approximation of a greedy algorithm is often
sufficient. In some cases, we know where objects can appear and disappear, so that tracks can be created
only for detects that occur in some region, and tracks can be reaped only if the last detect occurs in a
disappear region.
Background subtraction FoPo p291, s 9.2.1 is often a simple-but-sufficient detector in applications where the back-
ground is known and all trackable objects look different from the background. In such cases, the background-
subtracted objects appear as blobs and those are taken as detector responses. It is the simplest form of
foreground/background segmentation.
Example: People tracking on a fixed background, such as a corridor or a parking lot. If the application
does not require a detailed report of the body configuration, and if we expect people to be reasonably
large in view, we can reason that large blobs produced by background subtraction are individual people.
Weaknesses: if people stand still for a long time, they might disappear; it would require more work to split
up the large blob of foreground pixels that occurs when two people are close together; and so on - many
applications require only approximate reports of the traffic density, or alarms when a person appears in a
particular view.
In Python we can access methods for background subtraction through OpenCV. Their names start with
createBackgroundsubtractor. Here is an example:
import cv2 as cv
Bs = cv.createBackgroundSubtractorKNN()
where we create an object for subtraction. Later in the loop we call the method .apply to generate a
72
Algorithm 6 Tracking multiple objects (or tracking with unreliable object detector). i=time; t=track.
Notation:
Write xk (i) for the k’th response of the detector in the ith frame
Write t(k, i) for the k’th track in the ith frame
Write ∗t(k, i) for the detector response attached to the k 0 th track in the ith frame
(Think C pointer notation)
Assumptions: Detector is reasonably reliable; we know some distance d such that d(∗t(k, i − 1), ∗t(k, i))
is always small.
First frame: Create a track for each detector response.
N’th frame:
Link tracks and detector responses by solving a bipartite matching problem.
Spawn a new track for each detector response not allocated to a track.
Reap any track that has not received a detector response for some number of frames.
Cleanup: We now have trajectories in space time. Link anywhere this is justified (perhaps by a more
sophisticated dynamical or appearance model, derived from the candidates for linking).
black-white image M, where white are the tracked objects. The black-white image is called mask here:
In many tracking tasks, nothing more complex is required. The trick of creating tracks promiscuously and
then pruning any track that has not received a measurement for some time is quite general and extremely
effective.
A full Matlab example is given in Appendix J.12. It serves to understand the process as a whole. For speed,
one would rather use OpenCV.
Example: Tracking soccer players on a television screen, with players of height 10-30 pixels. Detailed body-
part dynamics can not be tracked due to low resolution and high frame rate (30Hz). Instead, we assume
that the domain translates and we thus track the player as a box. We can model a player’s motion with two
components. The first is the absolute motion of a box fixed around the player and the second is the player’s
movement relative to that box. To do so, we need to track the box, a process known as image stabilization.
As another example of how useful image stabilization is, one might stabilize a box around an aerial view of
a moving vehicle; now the box contains all visual information about the vehicle’s identity.
In each example, the box translates. If we have a rectangle in frame n, we can search for the rectangle
of the same size in frame n + 1 that is most like the original, e.g. using the sum-of-squared differences (or
SSD) of pixel values as a test for similarity and search for its minimum over a small neighborhood.
In many applications the distance the rectangle can move in an inter-frame interval is bounded because
there are velocity constraints. If this distance is small enough, we could simply evaluate the sum of squared
differences to every rectangle of the appropriate shape within that bound, or we might consider a search
across the scale space (or even better the pyramid) for the matching rectangle.
Matching Principle The simplest way to establish an alignment between two images or image patches
is to shift one image relative to the other. Given a template image I0 (x) sampled at a set of discrete pixel
locations {xi = (xi ; yi )}, we wish to find where it is located in image I1 (x). A least squares solution to this
problem is to find the minimum of the sum of squared differences (SSD) function
X X
ESSD (u) = [I1 (xi + u) − I0 (xi )]2 = e2i , (22)
i i
73
where u = (u; v) is the displacement and ei = I1 (xi +u)−I0 (xi ) is called the residual error (or the displaced
frame difference in the video coding literature). (We ignore for the moment the possibility that parts of I0
may lie outside the boundaries of I1 or be otherwise not visible.) The assumption that corresponding pixel
values remain the same in the two images is often called the brightness constancy constraint.
Tracking Initiation We can start tracks using an interest point operator. In frame 1, we find all interest
points. We then find the location of each of these in the next frame, and check whether the patch matches
the original one: if so, it belongs to a track; if not, the track has already ended. We now look for interest
points that do not belong to tracks and create new tracks there. Again, we advance tracks to the next frame,
check each against their original patch, reap tracks whose patch does not match well enough, and create
tracks at new interest points.
14.3.1 Kalman Filters SHB p793, s 16.6.1 FoPo p369, s 11.3, pdf 339
xt = Axt−1 + wt (23)
where xt and xt−1 are the current and previous state variables, A is the linear transition matrix, and w is
a noise (perturbation) vector, which is often modeled as a Gaussian (Gelb 1974). The matrices A and the
noise covariance can be learned ahead of time by observing typical sequences of the object being tracked
(Blake and Isard 1998). We here summarize the entire model following the book by SHB, whereby k now
represents time.
xk+1 = Ak xk + wk ,
(24)
zk = Hk xk + vk .
74
wk is zero mean Gaussian noise with assumed covariance Qk = E[wk wkT ]
Hk is the measurement matrix, describing how the observations are related to the model
vk is another zero mean Gaussian noise factor, with covariance Rk = E[vk vkT ]
The Kalman gain matrix:
Kk = Pk− HkT (Hk Pk− HkT + Rk )−1 (25)
Covariances:
Pk− = Ak Pk−1
+
ATk + Qk−1 (26)
14.3.2 Particle Filters Sze p243, s 5.1.2, pdf 276 SHB p798, s 16.6.2 FoPo p380, s 11.5, pdf 350
Strictly, a particle filter is a sampling method that approximates distributions by exploiting their temporal
structure; in computer vision they were popularized mainly by Isard and Blake in their CONditional DENSity
propagATION algorithm, short CONDENSATION.
Particle filtering techniques represent a probability distribution using a collection of weighted point sam-
ples, see upper graph in Fig. 28. To update the locations of the samples according to the linear dynamics
(deterministic drift), the centers of the samples are updated according to and multiple samples are gener-
ated for each point (lower graph in Figure). These are then perturbed to account for the stochastic diffusion,
i.e., their locations are moved by random vectors taken from the distribution of w. Finally, the weights
of these samples are multiplied by the measurement probability density, i.e., we take each sample and
measure its likelihood given the current (new) measurements.
75
Figure 28: Top: each density distribution is represented using a superposition of weighted particles. Bottom:
the drift-diffusion-measurement cycle implemented using random sampling, perturbation, and re-weighting stages.
[Source: Szeliski 2011; Fig 5.7]
76
15 Motion Estimation I: Optic Flow, Alignment FoPo p397, ch 12, pdf 446
In the task of tracking, the focus was on a mere pursuit of the object, where one often is merely interested
in which direction the object moved in the 2D (image) plane. In motion estimation, we aim at describing the
movement in more detail: the object could rotate during the motion, something we would like to measure
now; the object could approach the observer, in which case it hardly moves in the image plane, but it grows
larger over time, etc. For these situations we require more detailed measurements of the motion. The idea
pursued in this section is to measure the vectors between the moved objects. The vectors are measured
between corresponding object points and the resulting vector flow-field gives us a lot of information. There
are three principal scenarios where this idea is applied:
Calculating Optical Flow: The vector-flow field is determined frame-by-frame (for movies), a calculation
also called optical flow: this is computationally very intensive but also informative (Section 15.1).
Determining Object Motion: we are given two images of the same object, in the second photograph the
object has been moved: now we would like to know how exactly it moved: was it rotated?; is it larger
and therefore nearer to the observer? We estimate the object motion; we estimate the object’s new
pose.
Determining Camera (Observer) Motion: we are given two images of the same scene, but taken from
slightly different viewpoints. In other words, the camera observer has moved between the two pho-
tographs and now we would like to know how exactly. We estimate the observer (camera) motion; we
estimate the observer’s new viewpoint.
To properly determine the vector flow field between two images, we need to know which points or features
correspond to each other, an issue also know as the correspondence problem. Once we have established
that correspondence, then we can determine the exact motion parameters by using geometric transforma-
tions. The entire process is called alignment or registration and is essentially the same for estimation of
object or observer motion (Section 15.2). Finally, we have a few words on the topic of image registration
(Section 15.3).
15.1 Optical Flow Dav p506, s19.2 Sze p360, s8.4 FoPo p343, s10.6, pdf 313 SHB p757, s16.2
Optical flow (OF) is the vector flow field between two successive frames and is - loosely speaking - calcu-
lated by pixel-by-pixel tracking. This results in a vector field akin to the gradient field (Section 3.3), see Fig.
29. We discuss two cases of that figure in more detail and mention another case (under example 3):
Example 1: the vector field for an object moving across the image consists of the set of translation vectors
at each pixel of the object (Fig. 29b). Simple parametric models of optical flow are often good
representations for some more complex object motions, with transformations akin to those introduced
in section 15, but we leave that for the advanced course.
Example 2: an object approaching the observer creates a radial flow pattern (Fig. 29d). This is information
which can be exploited to calculate the time to contact.
Further simple observations tell us how quickly we will hit something. Assume the camera points at
the focus of expansion, and make the world move to the camera. A sphere of radius R whose center
lies along the direction of motion and is at depth Z will produce a circular image region of radius
r = f R/Z. If it moves down the Z axis with speed V = dZ/dt, the rate of growth of this region in the
image will be dr/dt = −f RV /Z 2 . This means that
Z r
time to contact = − = dr
. (28)
V dt
The minus sign signals that the sphere is moving down the Z axis, so Z is getting smaller and V is
negative.
Example 3: if the observer (viewer) moves, also called egomotion, then this creates a whole-field optical
flow, which can be exploited for image segmentation.
77
Figure 29: Interpretation of velocity flow fields (arrow
length corresponds to speed).
Optical flow can be calculated with a variety of methods. Perhaps the most intuitive one is matching small
blocks as was done in tracking object: we optimize as outlined in equation 22, whereby u would represent
the flow field. In fact some sophisticated tracking methods rely on computation of optical flow. Optic flow
can also be estimated with phase correlation for instance.
In Python we can access some of the methods through OpenCV, whose function names start with calcOptialFlow,
for instance:
import cv2 as cv
Pts1, St, err = cv.calcOpticalFlowPyrLK(Iold, Inew, Pts0, None, **PrmOptFlo)
where Iold is the previous frame, Inew is the next frame, Pts0 are a list of feature coordinates from the
previous frame and PrmOptFlo is a list of parameters for that algorithm.
15.2 Alignment
Alignment is the challenge of reconstructing how exactly a shape, object or scene has changed its pose from
one image to another. In mathematics one speaks of how the object transformed: we seek the transform
that describes the motion the most informative and precise as possible. Figure 30 shows the common 2D
transformations.
78
In the simplest case, the object has moved straight from one location to another and otherwise it did
not change: that would be a mere translation, see Fig. 30. But the object may have also rotated during
the motion, in which case the motion corresponds to a so-called Euclidean transformation. The object may
have also shrunk or enlarged during motion, in which case the transformation is expressed with similarity.
There are more complex transformations such as affine and projective, which reconstruct distortion and
change in perspective, respectively. However from those we cannot easily infer translation, rotation and
scaling anymore, although those three are very informative and intuitive.
Figure 30: Basic set of 2D planar transformations. The square shape at the origin is also called moving image/points
or source; the transformed shapes are the target, fixed or senses image/points.
translation: a shift in position.
Euclidean: translation and rotation, aka rigid body motion.
similarity: translation, rotation and scaling.
affine: including distortion.
projective: including change of viewpoint.
[Source: Szeliski 2011; Fig 3.45]
In the most straightforward form, the two datasets have the same dimensionality, for instance we are
registering 2D data to 2D data or 3D data to 3D data, and the transformation is rotation, translation, and
perhaps scale. Here are some example applications:
Shape Analysis: A simple 2D application is to locate shapes, for instance finding one shape in another im-
age as employed in biology, archeology or robotics. If the transformations involve translation, rotation,
scaling and reflection, then it is also called Procrustes analysis wiki Procrustes analysis.
Medical Support: We have an MRI image (which is a 3D dataset) of a patient’s interior that we wish to
superimpose on a view of the real patient to help guide a surgeon. In this case, we need to know the
rotation, translation, and scale that will put one image on top of the other.
Cartography: We have a 2D image template of a building that we want to find in an overhead aerial image.
Again, we need to know the rotation, translation, and scale that will put one on top of the other; we
might also use a match quality score to tell whether we have found the right building.
In the following section we explain how some of the 2D motions of Figure 30 are expressed mathematically
(Section 15.2.1). Then we learn how to estimate them (Section 15.2.2), assuming that the correspondence
problem was solved. Finally, we learn how to robustly estimate them, that is even in the presence of noise
and clutter when the correspondence problem is a challenge (Section 15.2.3).
79
the table below: notation 1 is more explicit with respect to the individual motions; notation 2 concatenates
the individual motions into a single matrix such that the entire motion can be expressed as a single matrix
multiplication, which can be convenient sometimes.
Translation: simplest type of transform - merely a vector t is added to the points in x. To express this as a
single matrix the identity matrix is used (unit matrix; square matrix with ones on the main diagonal and
zeros elsewhere) and the translation vector is appended resulting in a 2 × 3 matrix, see also Figure
31.
2D Euclidean Transform: consist of a rotation and translation. The rotation is achieved with the a so-
called orthonormal rotation matrix R whose values are calculated by specifying the rotation angle θ.
In notation 2, the single matrix is concatenated as before and it remains a 2 × 3 matrix.
Scaled Rotation: merely a scale factor is included to the previous transform. The size of the single matrix
in notation 2 does not change.
Affine: Here a certain degree of distortion is allowed, but we do not elaborate on this transformation any
further here. The single matrix for notation 2 still remains of size 2 × 3.
In Appendix J.13 we show how the transforms are carried on a set of points (the first section in that script).
The transformations can also be applied to entire images. That is what is frequently done for data augmen-
tation when training neural networks (Section 5).
In Matlab there exists single functions to carry out individual transformations such as imtranslate, imrotate
and imresize; imcrop is useful too for data augmentation. More complex transformations can be generated
with functions imtransform, for which we prepare the transformation matrices in maketform for instance.
The function script imwarp is a more general form for generating transformations.
In Python all those functions are found in skimage.transform.
In PyTorch all those functions are packed into
80
how can we reconstruct the best estimate of the motion parameters values? The usual way to do this is to
use least squares, i.e., to minimize the sum of squared residuals
X X
ELS = kri k2 = kf (xi ; p) − x0i k2 , (30)
i i
where
ri = x0i − f (xi ; p) = x̂0i − x̃0i (31)
is the residual between the measured location x̂0i and its corresponding current predicted location x̃0i =
f (xi ; p).
For simplicity we assume now a linear relationship between the amount of motion ∆x = x0 − x and the
unknown parameters p:
∆x = x0 − x = J(x)p. (32)
where J = ∂f /∂p is the Jacobian of the transformation f with respect to the motion parameters p. J is
shown in figure 31 and has a particular form for each transformation. In this case, a simple linear regression
(linear-least-squares problem) can be formulated as
X
ELLS = kJ(xi )p − ∆xi k2 , (33)
i
pT Ap − 2pT b + c. (34)
The minimum can be found by solving the symmetric positive definite (SPD) system of normal equations:
Ap = b, (35)
where X
A= JT (xi )J(xi ) (36)
i
Practically, we write a loop that takes each point and calculates the dot product and accumulates them to
build A. Then we call a script that does the least-square estimation. This is shown in the second part
in the code of J.13. The function lsqlin carries out the least-square estimation. We also show how to
use Matlab’s function scripts fitgeotrans that does both in a single script - building A and b, as well the
estimation of the motion parameters. In that code we also show how to use the function script procrustes,
which does also everthing at once.
When using functions in software packages the preferred terminology is as follows: the first image is
referred to as the moving image (or points), or the source; the second image is referred to as the target,
fixed or sensed image (points).
In Python , the procrustes analysis can be found in module scipy.spatial. I am not aware of a single
script in Python, but OpenCV provides functions that do the entire alignment process.
As pointed out already, in complex situations it is not straightforward to establish the correspondence be- Pnc p342, s 15.6
tween pairs of points. Often, when we compare two images - or two shapes in different images - we find SHB p461,s.10.2
only part of the shape points and sometimes wrong points or features were determined. One can summa-
rize such ’misses’ as outliers and noise, and they make motion estimation in real world applications difficult.
It therefore requires more robust estimation techniques, for instance an iterative process consisting of the
following two steps:
81
Figure 31: Jacobians of the 2D coordinate transformations x0 = f (x; p) (see table before), where we have re-
parameterized the motions so that they are identity for p = 0. [Source: Szeliski 2011; Tab 2.1]
Algorithm Fischler and Bolles (1981) formalized this approach into an algorithm called RANSAC, for
RANdom SAmple Consensus (algorithm 7). It starts by finding an initial set of inlier correspondences,
i.e., points that are consistent with a dominant motion estimate: it selects (at random) a subset of n cor-
respondences, which is then used to compute an initial estimate for p. The residuals of the full set of
correspondences are then computed as
where x̃0i are the estimated (mapped) locations and x̂0i are the sensed (detected) feature point locations.
The RANSAC technique then counts the number of inliers that are within of their predicted location, i.e.,
whose k ri k≤ . The value is application dependent but is often around 1-3 pixels. The random selection
process is repeated k times and the sample set with the largest number of inliers is kept as the final solution
of this fitting stage. Either the initial parameter guess p or the full set of computed inliers is then passed on
to the next data fitting stage.
Matlab’s computer vision toolbox provides a function. In Appendix J.14 we provide a primitve version to
understand it step by step. Python offers the function ransac in submodule skimage.measure.
82
Algorithm 7 RANSAC: Fitting structures using Random Sample Consensus. FoPo p332, pdf 305
Input : D, D∗
Parameters:
n the smallest number of points required (e.g., for lines, n = 2, for circles, n = 3)
k the number of iterations required
t the threshold used to identify a point that fits well
d the number of nearby points required to assert a model fits well
Until k iterations have occurred:
- Draw a sample of n points from the data D uniformly and at random → Ds ; Dc = D \ Ds
- Fit to Ds and obtain estimates p
- For each data point x ∈ Dc : if point close,
that is smaller than a threshold: kri k2 < t, then xi → Dgood
end
- If there are d or more points close to the structure (|Dgood | ≥ d), then p is kept as a good fit.
- refit the structure using all these points.
- add the result to a collection of good fits → P good
end
Choose the best fit from P good , using the fitting error as a criterion
points selected for registration. In feature-based methods, one detects characteristic features as introduced
in Section 6 and then estimates the motion using RANSAC for instance. That would be a registration based
on a subset of points. In intensity-based methods, one uses all pixel values at hand - the entire image -
and that can be more accurate but also requires more computation, analogous to the computation of optical
flow.
In Matlab, image registration can be carried out with imregister. In Python, one rather uses functions from
OpenCV.
83
16 3D Vision SHB p546, ch11, ch12
3D Vision is the task to perceive depth, the third dimension of space. Depth perception is useful, for
example, for industrial robots reaching for tools and for autonomous vehicles trying to avoid obstacles.
Depth information is also exploited in gesture recognition sometimes, with the most famous example being
the Kinect system.
Depth information can be obtained with different cues. One can try to obtain it from the two-dimensional
image only by elaborate reconstruction. Or one can simply use a camera that senses depth directly, a
technique called range imaging, see again Appendix A. Each cue and technique has advantages and
disadvantages. We here mention only the popular ones, of which the first one, stereo correspondence,
uses two-dimensional information only:
Stereo Correspondence: takes two or more images taken from slightly different positions and then deter-
mines the offset (disparity) between the images - a process also done by the human visual system
to estimate depth. To determine disparity one needs to find the corresponding pixels across images,
a process that is relatively challenging. The advantage of the technique is that is can be carried out
with cheap, light-sensitive (RGB) cameras. We elaborate on that method in Section 16.1.
Lidar (Light Detection And Ranging): is a surveying method that measures distance to a target by illumi-
nating the target with a pulsed laser light, and measuring the reflected pulses with a sensor. It typically
works for a specific depth range. This sensor is particularly used for autonomous vehicles and in order
to cover the entire depth of the street scene, one employs several Lidar sensors each one tuned for
specific depth.
Time-of-Flight (ToF): a relatively new technique that uses only a single pulse to scan the environment -
as opposed to Lidar (or Radar). Early implementations required relatively much time to compute the
distance - as opposed to Lidar and radar -, but recent implementations can provide up to 160 frames
per second.
The output of such a technique is typically a two-dimensional map, in which the intensity values correspond
to depth measured in some unit (i.e. centimeters). That map is also called depth map, or range map if
obtained with range imaging. The analysis of such a depth map does not pose any novel computational
challenges: to obtain a segragation between foreground and background one can use the segmentation
techniques as introduced previously. For that reason there is not more to explain here and in the following
we merely elaborate on the process of stereo correspondence as that is an algorithmic issue.
Depth information can also be obtained from single images in principle - called shape from X sometimes
-, but that has not been as practical yet as the above techniques. We mention those methods nevertheless
(Section 16.2).
Stereopsis is the perception of depth using two (or more) images taken from slightly different viewpoints. FoPo p227, ch7
The corresponding points between the two images have a slight, horizontal offset (motion), so-called dis-
parity, which is inverse proportional to depth. Thus the main challenge is to determine the (stereo) corre-
spondence, which is usually solved by assuming some constraints, see SHB for a nice overview (SHB p584, s11.6.1).
Stereo correspondence algorithms can be divided into two groups:
- Low-level, correlation-based, bottom-up, dense correspondence
- High-level, feature-based, top-down, sparse correspondence. Features are for instance the ones intro-
duce in Section 6.
Epipolar Constraint A popular constraint to compute stereo correspondence, see Fig. 32. Pixel x0 in
the left image corresponds to epipolar line segment x1 e1 in the right image, and vice versa for x1 and x0 e0 .
Both segments form a plane (right graph).
84
Figure 32: Epipolar geometry. c camera center; e epipole. Left: epipolar line segment corresponding to one ray;
Right: corresponding set of epipolar lines and their epipolar plane. [Source: Szeliski 2011; Fig 11.3]
nates each (giving 4 degrees of freedom), while another three come from the mapping of any three epipolar
lines in the first image to the second. Alternatively, we note that the nine components of F are given up to
an overall scale and we have another constraint det F = 0, yielding again 9 − 1 − 1 = 7 free parameters.
The correspondence of seven point in left and right images allows the computation of the fundamental
matrix F using a non-linear algorithm [Faugeras et al.,1992], known as the seven-point algorithm. If eight
points are available, the solution becomes linear and is known as the eight-point algorithm.
Shape from X is a generic name for techniques that aim to extracting shape from intensity images. Many
of these methods estimate local surface orientation (e.g., surface normal) rather than absolute depth. If, in
addition to this local knowledge, the depth of some particular point is known, the absolute depth of all other
points can be computed by integrating the surface normals along a curve on a surface [Horn, 1986]. There
exist:
- from Stereo: as mentioned above.
- from Shading: see below.
- from Photometric Stereo: a way of making ’shape from shading’ more reliable by using multiple light
sources that can be selectively turned on and off.
- from Motion: e.g. optic flow.
- from Texture: see below.
- from Focus: is based on the fact that lenses have finite depth of field, and only objects at the correct
distance are in focus; others are blurred in proportion to their distance. Two main approaches can be
distinguished: shape from focus and shape from de-focus.
- from Contour: aims to describe a 3D shape from contours seen from one or more view directions.
85
16.2.1 Shape from Shading FoPo p89, s2.4
Albedo (latin for ”whiteness”): or reflection coefficient, is the diffuse reflectivity or reflecting power of a
surface. It is the ratio of reflected radiation from the surface to incident radiation upon it. Its dimensionless
nature lets it be expressed as a percentage and is measured on a scale from zero for no reflection of a
perfectly black surface to 1 for perfect reflection of a white surface.
Algorithm Most shape from shading algorithms assume that the surface under consideration is of a uni-
form albedo and reflectance, and that the light source directions are either known or can be calibrated by
the use of a reference object. Under the assumptions of distant light sources and observer, the variation in
intensity (irradiance equation) become purely a function of the local surface orientation, which is used for
instance for scanning plaster casts. In practice, surfaces are rarely of a single uniform albedo. Shape from
shading therefore needs to be combined with some other technique or extended in some way to make it
useful. One way to do this is to combine it with stereo matching (Fua and Leclerc 1995) or known texture
(surface patterns) (White and Forsyth 2006). The stereo and texture components provide information in
textured regions, while shape from shading helps fill in the information across uniformly colored regions and
also provides finer information about surface shape.
The variation in foreshortening observed in regular textures can also provide useful information about local SHB p613, s12.1.2
surface orientation. Figure 35 shows an example of such a pattern, along with the estimated local surface
orientations. Shape from texture algorithms require a number of processing steps, including the extraction
of repeated patterns or the measurement of local frequencies in order to compute local affine deformations,
and a subsequent stage to infer local surface orientation. Details on these various stages can be found
in the research literature (Witkin 1981; Ikeuchi 1981; Blostein and Ahuja 1987; Garding 1992; Malik and
Rosenholtz 1997; Lobay and Forsyth 2006).
86
Figure 34: Reflectance maps for Lambertian surfaces: Left: contours of constant intensity plotted in gradient (p, q)
space for the case where the source direction s (marked by a black dot) is along the viewing direction v (0, 0) (the
contours are taken in steps of 0.125 between the values shown); Right: the contours that arise where the source
direction (ps , qs ) is at a point (marked by a black dot) in the positive quadrant of (p, q) space: note that there is a well-
defined region, bounded by the straight line 1 + pps + qqs = 0, for which the intensity is zero (the contours are again
taken in steps of 0.125). [Source: Davies 2012; Fig 15.9]
87
17 Pose Estimation (of Humans and Objects)
Pose estimation is a general term for recognizing the position and/or posture of an object. There exists
two specific tasks in particular. In pose estimation of humans, one is interested in the posture of a person,
the detailed alignment of the body parts (Section 17.1). In pose estimation of objects, one estimates from
which viewpoint in space the object is seen (Section 17.2). Such tasks used to be solved with feature point
detection (Section 6): one would learn the configuration of key-points in a set of training images, and then
find and match those points in the testing images. Such systems can be fairly complex. Nowadays, deep
CNNs provide an easier way to train such poses: those networks are not less complex than traditional
approaches, but they are easier to use and achieve better results - sometimes much better results.
Figure 36: Posture estimation, here focusing in particular on joints. Estimate generated by OpenPose, a
deep CNN that was trained on thousands of poses. See J.15.
Figure 36 shows the output of that network. 15 key-points can be detected; with some network variants
even more. With the same network architecture one can also train to estimate the pose of faces (Fig. 37),
hands (Fig. 38), and in the near future feet:
https://github.com/CMU-Perceptual-Computing-Lab/openpose
The code in Appendix J.15 shows how to feed a single image to the network and how to read out the key-
points from the network output. The ordering of key-points can be looked up on:
88
https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/output.md
Figure 37: Face pose estimation, also estimated by the OpenPose network, trained on images of faces.
See J.15.
Figure 38: Hand pose estimation. Also estimated by the OpenPose system, trained on images of hands.
See J.15.
89
the photo is taken from an unusual viewpoint: now we compare in our head the previously seen viewpoint
with the one on the photo and infer the viewpoint. We can invert the problem formulation and say we
estimate the pose of the object in space for which we require a different reference point than the one of the
current observer. This task is particularly used in robot navigation, where one attempts to learn a map of a
robot’s environment for easier navigation, a problem called simultaneous localization and mapping (SLAM).
In Computer Vision, the traditional approach to perform this localization is to detect features between
the two photographs and then reconstruct the motion in order to make an estimate as precise as possible,
as introduced in Sections 6 and 15, respectively. This is computationally intensive both in calculation as in
memory requirements. A recent DNN has however achieved stunning results by solving that task quicker
and with less memory. We sketch that network here.
90
18 Classifying Motions
Classifying motions is the challenge to understand a sequence of motions. First we track the motion, which
can include several parts of one object, such as a moving person; then we classify the trajectories of the
tracked objects and parts. Systems solving such tasks are typically complex systems - though there exist
exceptions such as the Kinect user interface (Section 18.2). Two frequent tasks are gesture recognition and
body motion classification.
One severe challenge with recognizing human motions is the segragation of the motion from its back-
ground. For the tracking task as introduced in Section 14, this segragation was a lesser problem as the
objects in the scene are relatively small and embedded in a homogeneous background - at least phase-
wise and locally. But when following human motions, one typically zooms into the silhouette for reasons of
resolution, and in that situation chances are higher that the gesture’s background is heterogeneous, which
in turn makes segragation more difficult: think of the background that a labtop camera faces when the user
sits in his room at home. For that reason one often seeks additional information by using depth camera
(Appendix A), but even with that depth information segragation remains challenging sometimes. It is there-
fore not surprising that recognizing human motions works best on a homogeneous background with fairly
constant illumination. Skeptical voices even say that there still does not exist a convincing implementation
of motion classification.
18.2 Body Motion Classification with Kinect FoPo p476, s14.5, pdf446
Kinect is a video game technology developed by Microsoft that allows its users to control games using
natural body motions. Kinect’s sensor delivers two images: a 480x640 pixel depth map and a 1200x1600
color image. Depth is measured by projecting an infrared pattern, which is observed by a black-and-white
camera. The two main features of this sensor are its speed - it is much faster than conventional range
finders using mechanical scanning - and its low cost (only ca. 200 Euros).
Range images have two advantages in this context:
1) They relatively easily permit the separation of objects from their background, and all the data processed
by the pose estimation procedure presented below, is presegmented by a separate and effective
background subtraction module.
2) They are much easier to simulate realistically than ordinary photographs (no color, texture, or illumi-
nation variations). In turn, this means that it is easy to generate synthetic data for training accurate
classifiers without overfitting.
18.2.1 Features
Kinect features simply measure depth differences in the neighborhood of each pixel. Concretely, let us
denote by z(p) the depth at pixel p in the range image. Given image displacements λ and µ, a scalar is
computed as follows:
h 1 i h 1 i
fλ,µ (p) = z p + λ −z p+ µ (40)
z(p) z(p)
In turn, given some allowed range of displacements, one can associate with each pixel p the feature vector
x(p) whose components are the D values of fλ,µ (p) for all distinct unordered pairs (λ, µ) in that range.
91
As explained below, these features are used to train an ensemble of decision tree classifiers, in the form
of a random forest. After training, the feature x associated with each pixel of a new depth map is passed to
the random forest for classification.
Random Forest Research in classification has shown that some data sets are better classified with mul-
tiple classifiers, also called ensemble classifiers. In this case, the pixel/bodypart classification takes place
with multiple tree classifiers, a random forest, see figure 39.
The feature x(p) (from above) is passed to every tree in the forest, where it is assigned some (tree-
dependent) probability of belonging to each body part. The overall class probability of the pixel is finally
computed as an average probability of the different trees.
Creation of Training Set The primary source is a set of several hundred motion capture sequences fea-
turing actors engaged in typical video game activities such as driving, dancing, kicking, etc. After clustering
close-by pictures and retaining one sample per cluster, a set of about 100K poses is obtained. The mea-
sured articulation parameters are transferred (retargeted) to 15 parametric mesh models of human beings
with a variety of body shapes and sizes. Body parts defined manually in texture maps are also transferred
to these models, which are then skinned by adding different types of clothing and hairstyle, and rendered
from different viewpoints as both depth and label maps using classical computer graphics techniques. 900k
labeled images in total were created this way.
Training Classifier The classifier is trained as a random forest, using the features described above, but
replacing the bootstrap sample used for each tree by a random subset of the training data (2,000 random
pixels from each one of hundreds of thousands of training images).
The experiments described in Shotton et al. (2011) typically use 2,000 pixels per image and per tree
to train random forests made of 3 trees of depth 20, with 2,000 splitting coordinates and 50 thresholds per
node. This takes about one day on a 1,000-core cluster for up to one million training images. The pixelwise
classification rate is ca. 60% (error of 40%!), which may appear as low but chance level is much lower (what
is it?).
92
to pixels labeled k, or using some voting scheme. To improve robustness, it is also possible to use mean
shifts to estimate the mode of the following 3D density distribution:
N
X − 1
kX−Xi k2
σ2
fk (X) ∝ P (k|xi )A(pi )e k , (41)
i=1
where Xi denotes the position of the 3D point associated with pixel pi , and A(pi ) is the area in world units
of a pixel at depth z(pi ), proportional to z(pi )2 , so as to make the contribution of each pixel invariant to the
distance between the sensor and the user. Each mode of this distribution is assigned the weighted sum
of the probability scores of all pixels reaching it during the mean shift optimization process, and the joint is
considered to be detected when the confidence of the highest mode is above some threshold. Since modes
tend to lie on the front surface of the body, the final joint estimate is obtained by pushing back the maximal
mode by a learned depth amount.
The average per-joint precision over all joints is 0.914 for the real data, and 0.731 for the synthetic
one (tolerance of 0.1m). Thus, the voting procedures are relatively robust to 40% errors among individual
voters. The synthetic precision is lower due to a great variability in pose and body shape. In realistic game
scenarios, the precision of the recovered joint parameters is good enough to drive a tracking system that
smoothly and very robustly recovers the parameters of a 3D kinematic model (skeleton) over time, which
can in turn be used to effectively control a video game with natural body motions.
Closing Note
The final component of the system is a tracking algorithm whose details are proprietary but, like any other
approach to tracking, it has temporal information at its disposal for smoothing the recovered skeleton pa-
rameters and recovering from joint detection errors.
93
19 More Systems
Dav p578, ch22
19.1 Video Surveillance
Surveillance is useful for monitoring road traffic, monitoring pedestrians, assisting riot control, monitoring of
crowds on football pitches, checking for overcrowding on underground stations, and generally is exploited in
helping with safety as well as crime; and of course surveillance is used in military applications. Traditional
video surveillance heavily relies on simple matching operations and on background subtraction, but recent
systems, that use the principles of feature detection and matching (as introduced in Section 6), have often
outperform the traditional systems easily. It may thus be somewhat futile to dwell too long in old methods
and rather apply the new methods, although a combination of old and new methods may always be worth
trying. We already mentioned some aspects of surveillance (e.g. in Sections 14 and 9), but we round the
picture by adding some aspects and by giving some examples.
The Geometry The ideal camera position is above the pedestrian, at some height Hc , and the camera’s
optical axis has a declination (angle δ) from the horizontal axis (see Fig. 40). This is simply the most
suitable way to estimate the distance and height Ht of the pedestrian, thereby exploiting triangulation and
the knowledge of where the position of the pedestrian’s feet.
Figure 40: 3-D monitoring: camera tilted downwards. δ is the angle of declination of the camera optical axis.
[Source: Davies 2012; Fig 22.2]
The idea of background subtraction is to eliminate the ’static’ background to obtain only the (moving)
foreground foreground objects. Although that appears to be a straightforward idea in principal, it is a chal-
lenging task because the background also changes continuously, e.g. the illumination changes throughout
the day, vegetation flutters, shadows wander (of fixed objects and of clouds), etc. Furthermore, some of
these background changes aggravate the challenge of detecting initial object motion. As mentioned before,
by using interest point features, much more stable motion detection and tracking can be provided (Section
14.2).
Vehicle License Plate Detection License plate recognition is a challenging task due to the diversity
of plate formats and the nonuniform outdoor illumination conditions during image acquisition. Therefore,
most approaches work only under restricted conditions such as fixed illumination, limited vehicle speed,
94
designated routes, and stationary backgrounds. Algorithms (in images or videos) are generally composed
of the following three processing phases:
1) Extraction of a license plate region. An effective method is described in Figure 41.
2) Segmentation of the plate characters. Thresholding as described previously (Section 9.1)
3) Recognition of each character. A typical optical-character recognition system will do, see Pattern Recog-
nition.
Scholar google ”Anagnostopoulos” for a review on this topic.
Vehicle Recognition (Research) Depending on the purpose, vehicles are categorized into types (e.g.,
car, truck, van,...) or more specifically, their make and model is identified (e.g. Volkswagen, Golf). A type
classification has been approached with an analysis of depth images. For make/model identification, many
systems have focused on analyzing the car’s front, which can be regarded as its face. A typical system
locates the license plates first and then measures other front features in relation to it, such as head lights,
radiator grill, etc.:
Figure 42: Make and model identification using the fronts features and
their relations to the license plate.
So far, (exploratory) studies have used up to 100 auto model tested on still images. Scholar google ’Pearce
and Pears’ for a recent publication.
95
Dav p636, ch23
19.2 Autonomous Vehicles
Two decades ago, many experts would have considered it impossible that one day fully autonomous vehicles
would drive on the road. Meanwhile most car companies develop a system that is capable of nearly or fully
autonomous driving. Here we mention some of the principal issues addressed in such a system.
Such a system contains four interacting processes: environment perception, localization, planning and
control. About two third of the system are involved in perception. Interestingly enough this proportion is
about the same for the human brain.
Vision processing (environment perception) relies to a large degree on range cameras, because a range
image can be easier segmented than an intensity image (see Kinect). A set of different range cameras is
used covering different depth ranges with resolution down to 0.1 degree and depth up to 300m. Both radar
and lidar is used (Appendix A).
The perception processes consist of the detection of pedestrians, bicyclists, cars, traffic signs, traffic
lights, telephone poles, curbs, etc. The algorithms all appear to be based on techniques as we have in-
troduced so far, meaning there is no particular magic involved; it rather requires an elaborate coordination
of the different detection processes. One example was given already with car license plate recognition,
we give more examples below. The recognition tasks are typically solved using multiple techniques com-
plementing each other, as a single technique is often insufficient to reliably solve the problem. We here
mention in particular techniques used in a 2D gray-level image.
Roadway/Roadmarker Location: is addressed for example with multilevel thresholding (Section 9.1); or
with vanishing point detection using RANSAC on edge points (see Figure 43).
Figure 43:
Left: Lane markings identification with RANSAC by placing straight lines through detected edge points. While the lane
markings converge to approximately the right point on the horizon line.
Right: Flow chart of a lane detector algorithm.
[Source: Davies 2012; Fig 23.2], [Source: Davies 2012; Fig 23.4]
Locating of Vehicles: there are two very simple tricks to detect cars (in 2D gray-scale images):
a) shadow induced by vehicle: Importantly, the strongest shadows are those appearing beneath the ve-
hicle, not least because these are present even when the sky is overcast and no other shadows are
visible. Such shadows are again identified by the multilevel thresholding approach (Section 9.1).
b) symmetry: The approach used is the 1-D Hough transform, taking the form of a histogram in which the
bisector positions from pairs of edge points along horizontal lines through the image are accumulated.
When applied to face detection, the technique is so sensitive that it will locate not only the centerlines
96
of faces but also those of the eyes. Symmetry works also for plant leaves and possibly other symmetric
shapes.
Figure 44: Detecting cars by exploiting the symmetry of vertical segments. [Source: Davies 2012; Fig 23.6]
Locating Pedestrians: We have introduced a pedestrian detection algorithm already in Section 8.2, but
here we mention some other techniques that ensure a high recognition accuracy:
a) detection of body parts, arms, head, legs. The region between legs often forms an (upside-down) V
feature.
b) Harris detector (Section 6.1) for feet localization.
c) skin color (see also Section H.1).
A substantial effort in developing such perception software is spent in temporal optimization of the im-
plemented algorithms. Algorithms are not only developed in automobile companies themselves, but also
supplier companies providing car-parts have started to develop recognition software.
Remote sensing is the analysis of images taken by satellites or aircrafts (wiki Satellite imagery, wiki Aerial photography).
The images one obtains are huge and manipulating them therefore takes a lot of time - it is practically im-
possible to apply sophisticated image-processing methods on large areas; for instance scanning the ocean
for floating debris is unfeasible at this point. From a methodological viewpoint, there is no new technique
to explain here. We now merely give some background information on this topic, in particular on satellite
imagery.
In satellite imagery one can distinguish between four types of resolution: spatial, spectral, temporal and
radiometric. The more modern the satellite, the higher are those resolutions in general.
- Spatial Resolution: nowadays, the resolution can be approximately 30 centimeters per pixel.
- Spectral Resolution: satellite images can be simple RGB photographs, but also a broad range of elec-
tromagnetic waves is typically measured. Early satellite recorded so-called multi-spectral images,
where at each pixel several bands of the electromagnetic spectrum were recorded, sometimes up
to 15 bands wiki Multispectral image. Table 1 gives an impression of how some of those bands can be
exploited. Meanwhile there exist satellites that record several tens or even hundreds of bands, gen-
erating so-called hyperimages, which permit detailed selections; the amount of storage required for
such images is very large however.
97
Table 1: Example of bands and their use. Given ranges are approximate - exact values depend on satellite.
Band Label Range (nm) Comments
Blue 450-520 atmosphere and deep water imaging; depths up to 150 feet (50 m) in clear water.
Green 515-600 vegetation and deep water; up to 90 feet (30 m) in clear water.
Red 600-690 man-made objects, in water up to 30 feet (9 m), soil, and vegetation.
Near-infrared (NIR) 750-900 primarily for imaging vegetation.
Mid-infrared (MIR) 1550-1750 imaging vegetation, soil moisture content, and some forest fires.
Far-infrared (FIR) 2080-2350 imaging soil, moisture, geological features, silicates, clays, and fires.
Thermal infrared 10400-12500 emitted instead of reflected radiation to image geological structures, thermal differ-
ences in water currents, and fires, and for night studies.
Radar & related tech mapping terrain and for detecting various objects.
- Temporal Resolution: if one intends to track changes over time, then the temporal resolution is of interest:
it can be several days. It is also called revisiting frequency sometimes.
- Radiometric Resolution: concerns the ’range’ of values and starts typically at 8 bits (256 values).
There exist software tools that preprocess the raw satellite images in order to transform them into a format
that is more suitable for object detection and classification, see wiki Remote sensing application. The larger the
area under investigation, the more time consuming is this transformation.
98
A Image Acquisition
Digital image acquisition is the process of analog-to-digital conversion of the ’outer’ signal to a number,
carried out by one or several image sensors (cameras). The conversion can be very complex and often
involves the generation of an output that is ’visible’ to the human eye.
One principal distinction between acquisition methods is passive versus active. In passive methods,
the scene is observed based on what it offers: the simplest case is the regular light-sensitive camera that
measures the environments luminance - the reflection of the illuminating sun. In active methods, a signal is
sent out to probe the environment, analogous to a radar. Some sensors combines both methods.
The following are the principal sensors used for acquiring images.
Light-Sensitive Camera: measures from the visible part of the electromagnetic spectrum, typically red,
green and blue dominance; the RGB camera is an example. Light is the preferred energy source for
most imaging tasks because it is safe, cheap, easy to control and process with optical hardware, easy
to detect using relatively inexpensive sensors, and readily processed by signal processing hardware.
Multi/Hyper-Spectral Sensors: measure from a broader part of the electromagnetic spectrum (than the
light-sensitive cameras) with individual sensors tuned to specific bands. Originally it was developed for
remote sensing (satellite imagery, Section 19.3), but is now also employed in document and painting
analysis.
Range Sensor (Rangefinder): is a device that measures the distance from the observer to a target, in
a process called ranging or rangefinding. Methods include laser, radar, sonar, lidar and ultrasonic
rangefinding. Applications are for example surveying, navigation, more specifically for example ballis-
tics, virtual reality (to detect operator movements and locate objects) and forestry.
Tomography Device: generates an image of a body by sections or sectioning, through the use of any
kind of penetrating wave. The method is used in radiology, archeology, biology, atmospheric science,
geophysics, oceanography, plasma physics, materials science, astrophysics, quantum information,
and other areas of science.
The obtained ’raw’ image may require some manipulation such as re-sampling in order to assure that
the image coordinate system is correct; or noise reduction in order to assure that sensor noise does not
introduce false information.
Summarizing, the output of sensors is an array of pixels, whose values typically correspond to light
intensity in one or several spectral bands (gray-scale, colour, hyper-spectral, etc.), but can also be related
to various physical measures, such as depth, absorption or reflectance of sonic or electromagnetic waves,
or nuclear magnetic resonance.
99
B Convolution [Signal Processing]
Expressed in the terminology of applied mathematics, convolution is the repeated multiplication of one
function on the domain of another function producing a third function; it is considered an ’operation’ and is
similar to cross-correlation. In signal processing terms, the first function is the signal - in our case often an
image -, and the second function is a so-called kernel and manipulates a local neighborhood of that image
at each location. This is easier to understand in one dimension first (Section B.1), then we introduce this
for two dimensions (Section B.2).
%% ------ Plotting
figure(1);clf;
Xax = 1:nPix;
plot(Xax, S, ’k’); hold on;
plot(Xax, Sa, ’r*’);
plot(Xax, Sb, ’bt’);
plot(Xax, Sg, ’g.’);
So far, not much has happened. The new function looks like its kernel, but scaled in amplitude. It becomes
more interesting if we make the signal more complicated: turn on another pixel in signal S and observe. To
ensure that you understand the detailed convolution process, look at the following explicit example:
Sa2 = zeros(1,nPix);
for i = 2:nPix-1
Nb = S(i-1:i+1); % the neiborhood
Sa2(i) = sum(Nb .* Ka); % multiplication with kernel and summation
end
assert(nnz(Sa2-Sa)==0); % verify that it is same as ’conv’
The example is a simplified implementation of the convolution process, namely we do not observe the
boundary values and it works only for kernel of length equal three pixels, with i running from 2 to nPix-1.
But it contains the gist of the convolution operation.
When applying a convolution function you need to pay attention to what type of treatment you prefer for
the boundary values. Matlab offers three options: full, valid and same; they return outputs of different sizes.
We refer to the documentation for details. They do matter, so when you apply a convolution you need to
think about the boundary values. If you prefer to set your own boundary values, then you compute only the
central part of the convolution - in Matlab with option valid - and then use padarray to set the boundary
values. We did that in the example of the face profiles, see Appendix J.2.
100
Mathematical Formulations In engineering equations the convolution can be written as:
n
X
(S ∗ K)[k] = S[i]K[k − i] (42)
i
where * is the convolution symbol, i is the signal’s variable, n the number of pixels of the signal and k is
the Kernel’s variable. That was the formulation for the discrete convolution. The continuous convolution is
written as: Z ∞
(S ∗ K)(k) = S(i)K(k − i)δi (43)
−∞
%% ------ Plotting
figure(1);clf;[nr nc] = deal(2,2);
subplot(nr,nc,1); imagesc(I,[0 1]);
subplot(nr,nc,2); imagesc(Ia); colorbar;
subplot(nr,nc,3); imagesc(Ig); colorbar;
Speeding Up Because image convolution is a relatively time-consuming operation due to the repeated
multiplication of two matrices, there exist methods to speed up image convolution. Those speed-ups work
only if the kernel shows specific characteristics, in particular it needs to be symmetric. The Gaussian
function for instance is suitable for speed up. In that case, an image convolution with a 2D Gaussian
function can be separated into convolving the image twice with the 1D Gaussian function in two different
orientations:
The product Kg1’*Kg1 creates a 2D Gaussian - we could have also generated it using fspecial(’gaussian’,[5
5],1) for instance.
For small images, the durations for each of those three different versions are not much different - the
duration differences become evident for larger images. Use tic and toc to measure time; or use clock.
101
C Filtering [Signal Processing]
An image can be filtered in different ways depending on the objective: one can measure emphasize or even
search certain image characteristics. The term filter is defined differently depending on the specific topic
(signal processing, computing, etc.). Here the term is understood as a function, which takes a small neigh-
borhood and computes with its pixel values a certain value; that computation is done for each neighborhood
of the entire image. That local function is sometimes called kernel.
For educational purposes, we differentiate here between five types of filtering. In the first one, the kernel
calculates simple statistics with the neighborhood’s pixel values (Section C.1). Then there are three basic
techniques to emphasize a certain ’range’ (band) of signal values: low-pass, band-bass and high-pass
filtering (Sections C.2, C.3 and C.4, respectively). Finally, there can be complex filter kernels.
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
In two dimensions, the Gaussian looks as depicted in Figure 11, namely in the first four patches (filter no.
1-4).
One can take also other functions for low-pass filtering, such as a simple average, see code block above
or code in the Appendix on Convolution above, where Ka = ones(3,3)/9 was used. The reason why the
Gaussian function is so popular, is that it has certain mathematical advantages.
102
C.3 Band-Pass Filtering
One example of band-pass filtering is region detection with the Difference-of-Gaussian (DOG) as we in-
troduced in Section (4.1). Another example are the Laplacian-of-Gaussian (LoG) filters used for texture
detection (Figure 11).
0.4
Figure 46: The Gaussian function (green)
and its first (blue) and second (red) deriva- 0.2
tive. The Gaussian function itself is often
used as low-pass filter. The first derivative of 0
the Gaussian is often used as high-pass fil-
ter, for example in edge detection. The sec- -0.2 Gauss
ond derivative is occasionally used as band- 1st Deriv
pass filter, for example as region detection. 2nd Deriv
-0.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
103
D Neural Networks
The element of a neural network (NN) is a so-called Perceptron. A perceptron essentially corresponds to a
model of a linear classifier as in traditional Machine Learning. Formulated in the terminology of neural net-
works, a Perceptron consist of a ‘neural’ unit that receives input from other units. The inputs are weighted,
then summed and then thresholded. For a two-class problem, there is essentially one integrating unit; for a
multi-class problem there are several integrating units, whose count corresponds to the number of classes
to be discriminated.
If such a Perceptron is stacked, then we talk of a Multi-Layer Perceptron (MLP). We show how to tune
such a MLP in Keras, an API for Google’s tensorflow, coming up next. There are also ‘richer’ ways to wire
neural units; we introduce one such type of network in Section D.2.
In Section D.3 we survey the available, pretrained networks.
An MLP is a neural network consisting of four or more layers (Figure 47): an input layer, receiving the image
in our case; two or more hidden layers that combine information; and an output layer, that indicates the
selected class for an input image. A layer consist of (neural) units. The input layer has a unit count that
typically corresponds to the number of pixels of the input image. The hidden layers have a unit count that is
often a multiple of the input layer count. The output layer has a unit count that corresponds to the number
of classes to be distinguished.
output
104
An Example We demonstrate how such a network can be programmed in Keras, a Python library (mod-
ule) designed to test network structures wiki Keras. It is recommended to download Python version no. 3.5,
which includes the modules for Tensorflow and Keras.
We use a database of handwritten digits (0-9), the so-called MNIST database, containing 60000 training
images and 10000 testing images. How those images can be loaded is shown in Appendix J.5 and we
import that function into our script with the line from LoadMNIST import LoadMNIST. Note that the func-
tion loads the images as two dimensions (28 x 28). The training images are therefore stored as a three-
dimensional array 60000 x 28 x 28. A single digit image can be viewed by imshow(TREN[1,:]) (not shown
in code).
# https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py
# Trains a Multi-Layer Perceptron on the MNIST dataset.
# Achieves 98.20% test accuracy after 12 epochs
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from LoadMNIST import LoadMNIST
batchSz = 128 # batch size (# images per learning step)
nEpoch = 12 # number of epochs = learning duration
For the MLP network we do not even bother that the image has two dimensions but we ’linearize’ the
image by aligning all pixels in a single column: the input to the network is therefore a vector of length 784,
TREN.reshape(60000,784). We now discuss the three sections ’Build Network’, ’Learning’ and ’Evaluation’.
Build Network The network architecture is determined with five N.add commands. The first one deter-
mines the weights between the input layer and the first hidden layer, meaning there are 512*784 = 401408
weight parameters between those two layers. The second N.add determines a certain drop out rate, which
helps the learning process by ignoring certain weights occasionally. The third N.add adds another hidden
layer with 512 units. Thus we have 512*512 = 262144 weight values between the first and second hidden
layer. The fourth N.add specifies again the drop out rate. The fifth N.add specifies the output layer, namely
10 units for 10 classes. With N.summary() the unit count is displayed.
105
There are many types of learning schemes and tricks that can be applied to learn a neural network.
Here, those options are specified with the command N.compile.
Learning Learning occurs with the N.fit command. We learn on the training set TREN and the cor-
responding labels in Lbl.TrenMx. The parameter batchSz determines how many images are used per
training step. There exists a rough optimum, too few or too many images per batch results in less efficient
learning. The parameter nEpoch corresponds to the learning duration essentially.
Evaluation Evaluation takes place with N.evaluate. Here we employ the test set TEST and the corre-
sponding labels Lbl.TestMx to estimate the prediction accuracy of the network.
The NNs as introduced so far are considered meanwhile traditional Neural Networks or shallow NNs,
because they use few layers as opposed to the many layers used by a Deep Net.
A Deep Belief Network (DBN) is another type of Deep Neural Network. It is employed more rarely than
the CNN, because its learning duration is slow in comparison to a CNN, but has the advantage that its
overall architecture is simpler: often, with two layers you can achieve almost the same performance as
with a CNN. Such a DBN has essentially a similar architecture to the Multi-Layer Perceptron, but the DBN
contains also connections within the same layer. That particular characteristic makes it extremely powerful,
but the downside is that learning is terribly slow.
A Belief Network is a network that operates with so-called conditional dependencies. A conditional
dependence expresses the relation of variables more explicitly than just by combining them with a weighted
sum - as is done in most other classifiers. However determining the full set of parameters for such a network
is exceptionally difficult. Deep Belief Networks (DBNs) are specialized in approximating such networks of
conditional dependencies in an efficient manner, that is at least partially and in reasonable time. Popular
implementations of such DBNs consist of layers of so-called Restricted Boltzmann Machines (RBMs).
Architecture The principal architecture of a RBM is similar to the dense layer as used in an MLP or CNN.
An RBM however contains an additional set of bias weights. Those additional weights make learning more
difficult but also more capable - they help solving those conditional dependencies. The typical learning rule
for a RBM is the so-called contrast-divergence algorithm.
The choice of an appropriate topology for the entire DBN is relatively simple. With two hidden layers
made of RBMs, one can obtain already fantastic results. A third layer rarely helps in improving classification
accuracy.
Learning Learning in a DBN occurs in two phases. In a first phase, the RBM layers are trained individually
one at a time in an unsupervised manner (using the contrast-divergence algorithm): the RBMs perform
quasi a clustering process. Then, in the second phase, the entire network is fine-tuned with a so-called
back-propagation algorithm. As with CNNs, a Deep Belief Network takes much time to train. The downside
of a DBN is, that there exists no method (yet) of speeding up the learning process, as is the case for CNNs.
Code Keras does not offer (yet?) methods to run a DBN, but tensorflow does. Tensorflow is however a bit
trickier in specifying and running a network.
106
used testing sets, such as ImageNet, CIFAR, MNIST, etc. to name a few. The typical input size is a 256 x
256 pixel image, which then is cropped to 224 x 224 pixels. We list some popular ones, of which the first
three can be accessed in PyTorch through the module torchvision.models.
ResNet (Deep Residual Learning for Image Recognition): by Microsoft. Comes with 18, 34, 50, 101 and
152 layers, ranging from 46 to 236 MB of weights. They show the best prediction accuracies on the testing
sets.
DenseNet (Densely Connected Convolutional Networks): by Cornell University and FaceBook. Comes
with 121, 169, 201 and 264 layers, with 32 to 114 MB of weights. They show competitive (near best)
prediction accuracies on the training sets, but their weight set is smaller - about half of that for ResNet at
approximately the same prediction accuracy.
InceptionV3 mainly by Google. Comes as a single instantiation with 107 MB of weights. Shows competi-
tive performance. Appears to use the fewest weights.
PoseNet (A Convolutional Network for Real-Time 6-DOF Camera Relocalization): by Cambridge Univer-
sity. The net solves the task of position estimation. Given a few images of an object as training material,
the can predict the viewer’s point from a new position. It consists of 23 layers and has 50MB of weights.
107
E Classification, Clustering [Machine Learning]
Given are the data DAT and - if present - some class (group) labels Lbl:
DAT: matrix of size [nObs x nDim] with rows corresponding to images (whose pixels are taken columnwise)
or features (some attributes extracted from images); and columns corresponding to the dimensionality
(no. pixels per image or no. of attributes, respectively).
Lbl: a one-dimensional array [nObs x 1] with entries corresponding to the class (group or category)
membership. This label vector allows us to train classifiers; if this array does not exist, then we can
apply only clustering algorithms.
E.1 Classification
To properly determine the classification performance, the data set DAT is divided into a training set and a
testing set. With the training set the classifier model is learned, with the testing set the model’s performance
is determined. To generate those two sets the label vector Lbl is used. We generate indices for training and
testing set with crossvalind, then loop through the folds using classify or other classification functions:
Ixs = crossvalind(’Kfold’, Lbl, 3); % indices for 3 folds
Pf = []; % initialize performance structure
for i = 1 : 3
IxTst = Ixs==i; % i’th testing set
IxTrn = Ixs~=i; % i’th training set
LbOut = classify(DAT(IxTst,:), DAT(IxTrn,:), Lbl(IxTrn)); % classification
nTst = nnz(IxTrn);
Pf(i) = nnz(LbOut==Lbl(IxTst))/nTst*100; % in percent
end
fprintf(’Pc correct: %4.2f\n’, mean(Pf));
If the bioinfo toolbox is available, then the command classperf can be used to determine performance
slightly more convenient.
Classification Errors? Chances are fairly good that you will not succeed with such a straightforward
classification attempt. Matlab may complain with some error that the covariance matrix can not be properly
estimated. In that case, there are several options:
1. You can always try a kNN classifier (knnclassify in the bioinfo toolbox): it is somewhat simple, but
you always obtain some results.
2. Use the Principal Component Analysis to reduce the dimensionality, see next Section E.1.1.
3. Use a Support-Vector machine: svmclassify (bioinfo toolbox). This is a very powerful classifier but it
discriminates between two classes only. To exploit this classifier you then need to learn c classifiers,
each one distinguishing between one category and all others (c = number of classes).
108
E.2 Clustering
Given are the data DAT in the format as above, but there are no labels available: in some sense we try to
find labels for the samples.
K-Means To apply this clustering technique we need to provide the number of assumed groups k (denoted
as nc called in our case):
Ixs = kmeans(DAT, nc);
Ixs is a one-dimensional array of length nObs that contains the numbers (∈ 1..nc ) which represent the
cluster labels. We then need to write a loop that finds the corresponding indices, for instance with the
following function. In this function, the variable Pts is equal the variable DAT in our above notation.
% Cluster info.
% IN Cls vector with labels as produced by a clustering algorithm
% Pts points (samples)
% minSize minimum cluster size
% strTyp info string
% OUT I .Cen centers
% .Ix indices to points
% .Sz cluster size
%
function I = f_ClsInfo(Cls, Pts, minSize, strTyp)
nCls = max(Cls);
nDim = size(Pts,2);
H = hist(Cls, 1:nCls);
IxMinSz = find(H>=minSize);
I.n = length(IxMinSz);
I.Cen = zeros(I.n,nDim,’single’);
I.Ix = cell(I.n,1);
I.Sz = zeross(I.n,1);
for i = 1:I.n
bCls = Cls==IxMinSz(i);
cen = mean(Pts(bCls,:),1);
I.Cen(i,:) = cen;
I.Ix{i} = single(find(bCls));
I.Sz(i) = nnz(bCls);
end
nP = size(Pts,1);
I.notUsed = nP-sum(I.Sz);
%% ---- Display
fprintf(’%2d Cls %9s Sz %1d-%2d #PtsNotUsed %d oo %d\n’, ...
I.n, strTyp, min(I.Sz), max(I.Sz), I.notUsed, nP);
end % MAIN
109
F Learning Tricks [Machine Learning]
Here we summarize general learning tricks that are used to improve the prediction accuracy of a classifica-
tion system. They can be used for any classification method, for traditional or for modern (deep learning)
methods.
• Data Augmentation
One challenging issue of training an image classification system is the collection of labeled image material
- we need to instruct the system what the categories (classes) are. The collection is a laborous process
and sometimes one has few images per class only, think of medical images where the number of images
for affected patients is rarer than for healthy patients. In order to maximally exploit the available image
material, one often enlarges the material by introducing slight manipulations of the images, a process called
data augmentation. Data augmentation aims to increase the (class) variability as to mimick the variability
present in real-word classes - think of how many types of cars or chairs there exist. Those manipulations
consist of stretching the images, cropping them, rotating, flipping, etc. Some of those manipulations are
introduced in Section 15.2.1.
In Matlab one would use functions such as imtranslate, imcrop, imresize, imtransform, flipud, fliplr,
etc.
In Python all those functions are found in module skimage.transform.
In PyTorch the functions are provided in module torchvision.transform starting with the term Random,
for example RandomResizedCrop, as introduced in 5.2, see code in J.6.3.
• Drop Out
Drop Out is a method to prevent overtraining. It was originally developed for Neural Networks, but can
also be applied to other classification methods. In Neural Networks, during each learning step, a small
percentage of neural units is eliminated of a layer. This is specified as drop-out rate. The technique
prevents the classifier from overtraining, from becoming overly specific to certain input patterns that are not
reflecting the classes well anymore.
110
G Resources
Databases
List of databases:
http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm
https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
The Kaggle website offers databases and a way to compare results by participating competitions:
https://www.kaggle.com
Libraries
Open-CV is probably the largest (open-source) library. It contains binaries and their corresponding source
code, written in C and C++ :
https://www.opencv.org/
https://github.com/opencv
https://www.learnopencv.com
There does not exist separate documentation yet for how to use OpenCV through Python,
so you need to browse the tutorials to find out about how to use it:
https://docs.opencv.org/master/d6/d00/tutorial_py_root.html
Code for Matlab can be found in particular on:
http://www.mathworks.com/matlabcentral/fileexchange
Other:
http://www.vlfeat.org/
Coding in Python:
http://programmingcomputervision.com/
111
H Color Spaces wiki Color space
There are certain color spaces in which segmentation of specific objects is sometimes easier, see table 4.
To obtain these color spaces, one uses commands such as rgb2hsv for example; for complex spaces we
need to call two commands:
112
R = I(:,:,1);
G = I(:,:,2);
B = I(:,:,3);
Bmn = R>95 & G>40 & B>20;
For tumor/skin discrimination in medical images, the following color information has been used (Vezhnevets
et al. 2003):
- The blue channel of the (original) RGB space
- The ratio between the red and the green channel of the RGB space
- The a channel of the Lab color space
- The H channel of the HSV space
113
I Python Modules and Functions
Python offers functions for computer vision and image processing in various modules. Some functions exist
in multiple modules. Most functions can be found in module skimage, which also contains a number of
graphics routines. Its documentation can be found at http://scikit-image.org. A number of functions
can be found in submodule numpy.ndimage. For signal processing and clustering we employ scipy.signal
and scipy.cluster respectively.
114
J Code Examples
In the following you will find the Matlab code for the example tasks mentioned throughout the script; it is
placed on a light green background shade. For many tasks I also provide Python code, placed on a light
blue shade.
Loading an image in Python in three popular ways, with skimage, with PIL (Python Imaging Library) and
with OpenCV. Note that with OpenCV the chromatic channels are loaded reversed (compared to most other
loading functions): bgr not rgb.
115
imsave(’C:/ztmp/ImgSave.jpg’, Irgb) # saves image as jpeg
import imageio
filename = ’pathToVideo.mp4’
116
J.2 Face Profiles
clear;
Iorg = imread(’KlausIohannisCrop.jpg’); % image is color [m n 3]
Ig = rgb2gray(Iorg); % turn into graylevel image
Ig = Ig(35:end-35,35:end-35); % crop borders a bit more
[h w] = size(Ig); % image height and width
117
LowFlt = LowFlt / LowFlt.sum() # normalize the filter
Pverf = convolve(Pver, LowFlt, ’same’) # filter vertical profile
Phorf = convolve(Phor, LowFlt, ’same’) # filter horizontal profile
118
J.3 Image Processing I: Scale Space and Pyramid
clear;
Icol = imread(’autumn.tif’); % uint8 type; color
Ig = single(rgb2gray(Icol)); % turn into single type
%% ------ Initialize
nLev = 5;
[SS PY aFlt] = deal(cell(nLev,1));
SS{1} = Ig; % scale space: make original image first level
PY{1} = Ig; % pyramid: " " " " "
%% ----- Plotting
figure(1);clf;
[nr nc] = deal(nLev,3);
for i = 1:nLev
if i<nLev,
subplot(nr,nc,i*nc-2);
imagesc(aFlt{i});
end
subplot(nr,nc,i*nc-1);
imagesc(SS{i});
subplot(nr,nc,i*nc);
imagesc(PY{i});
end
For the Python code, we also wrote a function fspecialGauss, which mimics Matlab’s fspecial function,
see separate code block below.
from numpy import mgrid, exp
from skimage import data
from skimage.color import rgb2gray
from skimage.transform import resize
from scipy.signal import convolve2d
def fspecialGauss(size,sigma):
x, y = mgrid[-size//2 + 1:size//2 + 1, -size//2 + 1:size//2 + 1]
g = exp(-((x**2 + y**2)/(2.0*sigma**2)))
return g/g.sum()
119
Ilpf = convolve2d(Ig, Flt, mode=’same’) # low-pass filtered image
SS[i] = Ilpf
# --- Downsampling with stp
stp = 2**i
PY[i] = resize(Ilpf,(m//stp,n//stp))
120
J.4 Feature Extraction I
J.4.1 Regions
clear;
Icol = imread(’tissue.png’);
I = single(rgb2gray(Icol));
%% ----- Plotting
figure(1);clf;
[nr nc] = deal(3,2);
subplot(nr,nc,1), imagesc(Icol);
subplot(nr,nc,2), imagesc(I);
subplot(nr,nc,3), imagesc(Idog);colorbar;
subplot(nr,nc,4), imagesc(BWblobs);
subplot(nr,nc,5), imagesc(Ilog);colorbar;
subplot(nr,nc,6), imagesc(BWlog);
For the Python example we assume that we’ve placed the 2D Gaussian filter - as used in the above example
J.3 - into a separate module:
121
J.4.2 Edge Detection
In the last few lines we also show how to thin the black-white map, an action that is useful for contour tracing
(coming up in a later section).
clear; format compact;
sgm = 1; % scale, typically 1-5
122
#%% ------- Cleaning Edge Map for Contour Tracing -------
Medg = ME1.copy() # we copy for reason of clarity
remove_small_objects(Medg,2,in_place=True) # removes isoloated pixels
Medg = thin(Medg) # turns ’thick’ contours into 1-pixel-wide contours
123
J.4.3 Texture Filters
%% ----- Plotting
figure(figNo); clf; colormap(gray);
for i = 1:ntFlt
subplot(8,6,i)
I = F(:,:,i);
imagesc(I);
set(gca,’fontsize’,4);
title(num2str(i),’fontsize’,6,’fontweight’,’bold’);
end
end % MAIN
function D = ff_Norm(D)
D = D - mean(D(:));
D = D / sum(abs(D(:)));
end % SUB
124
J.5 Loading the MNIST dataset
Note that this function contains two subfunctions, ff LoadImg and ff ReadLab.
% Loads the MNIST data - the 4 files on the following website -
% and converts them from ubyte to single.
% http://yann.lecun.com/exdb/mnist/
% IN - (no input arguments)
% OUT TREN training images column wise [60000 28*28]
% TEST testing (sample) images [10000 28*28]
% Lbl struct with training and testing class labels as matrices
% .Tren [60000 10] binary matrix with training labels
% .Test [60000 10] binary matrix with testing labels
function [TREN TEST Lbl] = LoadMNIST()
fprintf(’Loading MNIST...’);
filePath = ’c:\kzimg_down\MNST\’;
FnTrainImg = [filePath ’train-images.idx3-ubyte’];
FnTrainLab = [filePath ’train-labels.idx1-ubyte’];
FnTestImg = [filePath ’t10k-images.idx3-ubyte’];
FnTestLab = [filePath ’t10k-labels.idx1-ubyte’];
TREN = single(TREN)/255.0;
TEST = single(TEST)/255.0;
fprintf(’done. Normalized\n’);
end % MAIN FUNCTION
125
# Loads the MNIST data - the 4 files on the following website -
# and converts them from ubyte to float.
# http://yann.lecun.com/exdb/mnist/
# IN - (no input arguments)
# OUT TREN training images as 3-dimensional array [60000 28 28]
# TEST testing (sample) images as 3-dim array [10000 28 28]
# Lbl struct with training and testing class labels as vectors and
# matrices:
# .Tren [60000 1] vector with training labels
# .Test [10000 1] vector with testing labels
# .TrenMx [60000 10] binary matrix with training labels
# .TestMx [10000 10] binary matrix with testing labels
def LoadMNIST():
from numpy import fromfile, int8, uint8
from collections import namedtuple
import struct
from keras.utils import to_categorical
print(’Loading MNIST...’)
filePath = ’c:/kzimg_down/MNST/’
Lbl = namedtuple(’Lbl’,[’Tren’,’Test’, ’TrenMx’, ’TestMx’])
#%% -------- TRAINING DATA ------------
fipaImg = filePath + ’train-images.idx3-ubyte’
fipaLab = filePath + ’train-labels.idx1-ubyte’
print(’done. Normalized\n’)
return TREN, TEST, Lbl
126
J.6 CNN Examples [TensorFlow/Keras and PyTorch]
J.6.1 Loading the CIFAR-10 files
%% ---- Train
bt1 = [PATH.IMGS ’data_batch_1’]; BT1 = load(bt1);
bt2 = [PATH.IMGS ’data_batch_2’]; BT2 = load(bt2);
bt3 = [PATH.IMGS ’data_batch_3’]; BT3 = load(bt3);
bt4 = [PATH.IMGS ’data_batch_4’]; BT4 = load(bt4);
bt5 = [PATH.IMGS ’data_batch_5’]; BT5 = load(bt5);
%% ---- Test
btt = [PATH.IMGS ’test_batch’]; BTT = load(btt);
TEST = BTT.data;
LblTest = BTT.labels;
%% ----
bm = [PATH.IMGS ’batches.meta.mat’]; BM = load(bm);
CatNames = BM.label_names;
end
PyTorch provides a method to load this data set, but in case you use TensorFlow only (without access to
the PyTorch routine), then here would be an example how to read the data set:
# Loads the CIFAR10 data set. Here we load the Matlab files (!)
# For the python files see ’http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz’
# IN - (no input arguments)
# OUT TREN training images as 3-dimensional array [50000 32 32]
# TEST testing (sample) images as 3-dim array [10000 32 32]
# Lbl struct with training and testing class labels as vectors and
# matrices:
# .Tren [50000 1] vector with training labels
# .Test [10000 1] vector with testing labels
# .TrenMx [50000 10] binary matrix with training labels
# .TestMx [10000 10] binary matrix with testing labels
#from __future__ import absolute_import
from numpy import shape, zeros, reshape, transpose
import os
import scipy.io as sio
#import pickle
from collections import namedtuple
from keras.utils import to_categorical
def LoadImgCIF10():
path = ’c:/kzimg_down/CIFAR10/’
Lbl = namedtuple(’Lbl’,[’Tren’,’Test’])
127
for i in range(1, 6):
fpath = os.path.join(path, ’data_batch_’ + str(i))
DAT = sio.loadmat(fpath)
TREN[ (i-1)*10000 : i*10000, :] = DAT[’data’]
Lbl.Tren[(i-1)*10000 : i*10000] = DAT[’labels’]
TEST = DAT[’data’]
Lbl.Test = DAT[’labels’]
128
N.add(MaxPooling2D(pool_size=(2,2)))
N.add(Dropout(0.25))
N.add(Flatten())
N.add(Dense(512))
N.add(Activation(’relu’))
N.add(Dropout(0.5))
N.add(Dense(10))
N.add(Activation(’softmax’))
129
J.6.3 Example of Transfer Learning [PyTorch] Section 5.2
Below are three code blocks. The first code block trains the model. The second code block saves the model
- it is a single line. The third code block demonstrates how to apply the model on a single (novel) image.
Block 1 The string variable dirImgs is the directory to your database, with two folders: ‘train’ and ‘valid’
for training and validation. In each one of those two folders, there is the exact same list of folder names
representing the category labels. For those in turn, we provide different images. For small data sets place
two thirds of the images into the corresponding training folder (ie. train/dog/) and one third into the cor-
responding validation folder (valid/dog/). To obtain optimal results, one would calculate the (normalized)
mean and standard devation of the entire dataset - here we take arbitrary values.
import torch
from torchvision import models, transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from torch.optim import SGD, lr_scheduler
import torch.nn as nn
import time
from os.path import join
import copy
dirImgs = ’C:/ImgCollection/’
# computed mean values for all images of entire database
ImgMean = [0.50, 0.50, 0.50] # we assume them to be 0.5
ImgStdv = [0.22, 0.22, 0.22] # we assume them to be 0.22
szImgTarg = 224
#%% ================ Parameters ================
nEpo = 6 # number of epochs (learning steps)
lernRate = 0.001 # learning rate
momFact = 0.9 # momentum factor
szStep = 7 # step size
gam = 0.1 # gamma
szBtch = 4 # number of images per batch
# --- the model
MOD = models.resnet18(pretrained=True)
130
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
MOD = MOD.to(device)
crit = nn.CrossEntropyLoss()
131
# statistics
lossRun += loss.item() * szBtch
cCrrRun += torch.sum(LbPred==LbBtc.data)
# --- performance per epoch
lossEpo = lossRun / nImgValid
accuEpo = cCrrRun.double() / nImgValid
print(f’valid loss: {lossEpo:.4f} acc: {accuEpo:.4f}’)
PrfVal[i,:] = [accuEpo, lossEpo] # record performance
# keep weights if better than previous weights
if accuEpo>accuBest:
accuBest = accuEpo
WgtsBest = copy.deepcopy(MOD.state_dict())
# ---- Concluding
tElaps = time.time() - t0
print(’Training duration {:.0f}min {:.0f}sec’.format(tElaps // 60, tElaps % 60))
print(f’Max accuracy {accuBest:4f}’)
torch.save(MOD.state_dict(), pathVariable)
import torch
from torchvision import models, transforms
import torch.nn as nn
from PIL import Image
f_PrepImg = transforms.Compose([
transforms.Resize((szImgTarg,szImgTarg)),
transforms.ToTensor(),
transforms.Normalize(ImgMean, ImgStdv) ])
132
# make it 4-dimensional by adding 1 dimension as axis=0
Irgb = Irgb.unsqueeze(0)
# now we feed it to the network:
Post = MOD(Irgb) # posteriors [1 nClasses]
psT,lbT = Post.max(1) # posterior and predicted label (still as tensors!)
lbPred = lbT.item() # label as a single scalar value (now as scalar in regular python)
133
J.7 Feature Extraction
J.7.1 Detection
Comparison of four feature detectors.
clear;
I = imread(’cameraman.tif’);
134
J.8 Object Detection
An example for face detection using the venerable Viola-Jones algorithm that uses Haar features (Section
8.1):
import cv2
imPth = ’pathToSomeImage.JPG’
cascPth = ’C:/SOFTWARE/opencv/build/etc/haarcascades/’
FaceDet = cv2.CascadeClassifier(cascPth+’haarcascade_frontalface_default.xml’)
EyeDet = cv2.CascadeClassifier(cascPth+’haarcascade_eye.xml’)
135
J.9 Image Processing II: Segmentation
clear;
I = imread(’coins.png’);
%% ----- K-means
Ix = kmeans(I(:),2);
BWkmen = false(size(I));
BWkmen(Ix==2) = true;
%% ----- Watershed
Ilpf = conv2(I,fspecial(’Gaussian’,[5 5],2));
W = watershed(Ilpf);
%% ----- Plotting
figure(1);clf;[nr nc] = deal(3,2);
subplot(nr,nc,1);
imagesc(I); colormap(gray);
subplot(nr,nc,2);
bar(histc(I(:),[0:255])); hold on;
tOts = tOts*256;
plot([tOts tOts],[0 1000],’b:’);
title(’Histogram’);
subplot(nr,nc,3);
imagesc(BWots); title(’Otsu’);
subplot(nr,nc,4);
imagesc(BWmed); title(’Median Value’);
subplot(nr,nc,5);
imagesc(BWkmen); title(’K-means’);
subplot(nr,nc,6);
imagesc(W); title(’Watershed’);
136
IxNN = DIS.argmin(axis=1) # nearest neighbor
BWkmen = IxNN.reshape((m,n)) # reshape to image for plotting
137
J.10 Shape
In subsection J.10.1 we create a set of shapes. In subsection J.10.2 we measure their similarity with simple
properties; in subsection J.10.3 we use the radial signature to provide better similarity performance.
clear;
I = ones(150,760)*255; % empty image
sz = 28;
Ix = 65:90; % indices for A to Z
nShp = length(Ix);
figure(1);clf;
imagesc(I); hold on; axis off;
for i = 1:nShp
ix = Ix(i);
text(i*sz,20,char(ix),’fontweight’,’bold’);
text(i*sz,50,char(ix),’fontweight’,’bold’ ,’fontsize’,12);
text(i*sz,110,char(ix),’fontweight’,’bold’,’fontsize’,12,’rotation’,45);
text(i*sz,80,char(ix),’fontweight’,’bold’ ,’fontsize’,14);
text(i*sz,140,char(ix),’fontweight’,’bold’,’fontsize’,16);
end
print(’ShapeLetters’,’-djpeg’,’-r300’);
clear;
BW = rgb2gray(imread(’ShapeLetters.jpg’)) < 80;
%% ===== Shape Properties
RG = regionprops(BW, ’all’);
Ara = cat(1,RG.Area);
Ecc = cat(1,RG.Eccentricity);
EqD = cat(1,RG.EquivDiameter);
BBx = cat(1,RG.BoundingBox);
Vec = [Ara Ecc EqD]; % [nShp 3] three dimensions
nShp = length(Ara);
fprintf(’# Shapes %d\n’, nShp);
138
DM(diag(true(nShp,1))) = nan; % inactivate own shape
[DO O] = sort(DM,2,’ascend’); % sort along rows
%% ----- Plotting First nSim Similar Shapes for Each Found Shape
% Bounding box
UL = floor(BBx(:,1:2)); % upper left corner
Wth = BBx(:,3); % width
Hgt = BBx(:,4); % height
mxWth = max(Wth)+2;
mxHgt = max(Hgt)+2;
nSim = 20; % # similar ones we plot
nShp2 = ceil(nShp/2);
[ID1 ID2] = deal(zeros(nShp2*mxWth,nSim*mxHgt));
for i = 1:nShp
figure(1);clf;
subplot(1,2,1); imagesc(ID1); title(’First Half of Letters’);
subplot(1,2,2); imagesc(ID2); title(’Second Half of Letters’);
from numpy import asarray, asmatrix, concatenate, nan, diagflat, floor, ceil, \
arange, zeros, ones, ix_, sort, argsort
from skimage.io import imread
from skimage.color import rgb2gray
from skimage.measure import label, regionprops
from scipy.spatial.distance import pdist, squareform
139
O = argsort(DM,axis=1) # obtain indices separately
#%% ----- Plotting First nSim Similar Shapes for Each Found Shape
BBx = asarray([r.bbox for r in RG]) # Bounding box
UL = floor(BBx[:,0:2]) # upper left corner
#LR = floor(BBx[:,2:4]) # lower rite corner
Hgt = BBx[:,2]-BBx[:,0] # width
Wth = BBx[:,3]-BBx[:,1] # height
mxHgt = max(Hgt)+2
mxWth = max(Wth)+2
nSim = 20 # similar ones we plot
nShp2 = int(ceil(nShp/2))
ID1 = zeros((nShp2*mxHgt,nSim*mxWth))
ID2 = ID1.copy()
for i in range(nShp):
# --- given/selected shape in 1st row
Row = arange(0,Hgt[i])+UL[i,0]-1 # rows
Col = arange(0,Wth[i])+UL[i,1]-1 # columns
Sbw = BW[ix_(Row.astype(int), Col.astype(int))]
Szs = Sbw.shape
RgV = arange(0,Szs[0]) # vertical range
RgH = arange(0,Szs[1]) # horizontal range
if i < nShp2: ID1[ix_(RgV+i*mxHgt, RgH)] = Sbw*2
else: ID2[ix_(RgV+(i-nShp2)*mxHgt, RgH)] = Sbw*2
clear;
BW = imread(’text.png’); % the stimulus
aBonImg = bwboundaries(BW);
nShp = length(aBonImg);
fprintf(’# Shapes %d\n’, nShp);
140
plot(aBon{i}(:,2), aBon{i}(:,1));
pause();
end
end
clear aBonImg
%% ----- Plotting First nSim Similar Shapes for Each Found Shape
sz = 19;
nSim = 20; % # similar ones we plot
nShp2 = ceil(nShp/2);
[ID1 ID2] = deal(zeros(nShp2*sz,nSim*sz));
figure(1);clf;
subplot(1,2,1); imagesc(ID1); title(’First Half of Letters’); hold on;
subplot(1,2,2); imagesc(ID2); title(’Second Half of Letters’); hold on;
for i = 1:nShp
141
J.11 Contour
J.11.1 Edge Following
Contour tracing is easiest carried out by firstly thinning the black-white map. We’ve shown how to do that in
the last few lines of the previous examples J.4.2. Here we use a toy-example to illustrate boundary tracing.
clear;
% ------ Stimulus ---------
M = zeros(15,15); % an empty map
M([5 10],5:10) = 1; % upper & lower sides of a rectangle
M(5:10,[5 10]) = 1; % left & right sides of a rectangle
M(3:8, 3) = 1; % straight line
142
Bon = aBon{1}; % use row/col (matrix) axis
Ro = Bon(:,2); % original row (for plotting)
Co = Bon(:,1); % original column
% adds inter-point spacing as 3rd column:
Bon = [Bon [0; sqrt(sum(diff(Bon,1).^2,2))]];
subplot(2,2,1);
hold on;
plot(Bon(:,1), Bon(:,2));
plot(Cf,Rf);
plot(Bon(ixMxs,1), Bon(ixMxs,2),’r^’);
legend(’original’,’low-pass filtered’,’max curvature’,’location’,’northwest’);
plot(Cf(ixMxs),Rf(ixMxs),’r^’);
plot(Bon(1,1),Bon(1,2),’*’); % starting pixel
axis equal;
143
J.12 Tracking
The first example shows background subtraction in Python using OpenCV. The second example is a full
example of tracking in Matlab, provided for instructional purposes.
import cv2 as cv
vidname = ’videoname.mp4’
cap = cv.VideoCapture(vidname)
cap.set(cv.CAP_PROP_POS_MSEC, 62*1000) # set to appropriate starting time (milliseconds)
while(1):
r,I = cap.read() # reads on frame
cap.release()
cv.destroyAllWindows()
To play the movie for illustration, uncomment the line starting with ’movie(Moc,...’ (line 14 approximately).
clear;
sR = 7; % radius of search region (block matching)
wR = sR*2+1; % diameter of search region
%% Load Movie
ObjVid = VideoReader(’xylophone.mpg’); % movie ’handler’
FRAMES = read(ObjVid); % actual movie data
[nRow nCol dmy nFrm] = size(FRAMES); % rows/columns/colors/frames
MOC(1:nFrm) = struct(’cdata’, zeros(nRow, nCol, 3, ’uint8’), ’colormap’, []);
MOG = zeros(nRow,nCol,nFrm,’uint8’); % movie in grayscale
for k = 1:nFrm
MOC(k).cdata = FRAMES(:,:,:,k); % movie in color
MOG(:,:,k) = rgb2gray(MOC(k).cdata); % movie in grayscale
end
% movie(Moc, 1, ObjVid.FrameRate); % plays movie
%% ======================== LOOPING FRAMES
b_pause = 1; figure(1); clf;
for i = 2:nFrm
144
nLrg = nnz(bLrg); % # of large ones
IXPCH = IXPCH(Ixs(bLrg)); % reduce to detected large changes
fprintf(’reduced to %d large patches\n’, nLrg);
%% -------------------- Bounding Box of patches
PchBB = []; cP = 0;
for k = 1:nLrg
IxPix = IXPCH{k};
[Row Col] = ind2sub([nRow nCol], IxPix);
[leb rib] = deal(min(Col), max(Col)); % left and right boundary
[upb lob] = deal(min(Row), max(Row)); % upper and lower boundary
[leb rib] = deal(max(leb,sR+1), min(rib,nCol-sR-1));
[upb lob] = deal(max(upb,sR+1), min(lob,nRow-sR-1));
% --- check width and height (maybe 0 or even negative)
width = rib-leb;
height = lob-upb;
if width<1 || height<1, continue; end % if any 0, then move on
% --- patch bounding box (coordinates)
cP = cP+1;
PchBB(cP,:) = [upb lob leb rib]; % upper/lower/left/right
end
%% -------------------- Match each motion patch with its neiborhood
MchBB = [];
for k = 1:cP
Co = PchBB(k,:); % coordinates [1 4]
rRow = Co(1):Co(2); % range rows
rCol = Co(3):Co(4); % range cols
Pprv = Fprv(rRow,rCol); % patch in previous frame
ps = size(Pprv);
% --- correlation with neiboring patches
CorrPtch = zeros(wR,wR);
for m = -sR:1:sR
for n = -sR:1:sR
Pnow = Fnow(rRow+m, rCol+n);
CorrPtch(m+sR+1,n+sR+1) = corr2(Pprv,Pnow);
end
end
CorrPtch(sR+1,sR+1) = 0; % set own match to 0
% --- selection
[vl ix] = max(CorrPtch(:)); % select highest correlation
[rr cc] = ind2sub([wR wR], ix); % linear index to subindices
MchBB(k,1:2) = Co(1:2)+rr-sR-1; % store bounding box of best match
MchBB(k,3:4) = Co(3:4)+cc-sR-1;
if 1
figure(10); clf
imagesc(CorrPtch); colormap(gray); hold on;
plot(cc,rr,’*’)
pause();
end
end
%% -------------------- Plotting
MotPresFrm2(i) = sum(FDf(:));
if b_pause,
figure(1); [rr cc] = deal(2,2);
subplot(rr,cc,1), imagesc(Fprv); colormap(gray); title([’Frame ’ num2str(i)]);
subplot(rr,cc,2), imagesc(255-FDforig); title(’Difference Image’);
for k = 1:cP
Ix = PchBB(k,:); Lo1 = [Ix(3) Ix(1) Ix(4)-Ix(3) Ix(2)-Ix(1)];
Ix = MchBB(k,:); Lo2 = [Ix(3) Ix(1) Ix(4)-Ix(3) Ix(2)-Ix(1)];
rectangle(’position’, Lo1, ’edgecolor’, ’b’);
rectangle(’position’, Lo2, ’edgecolor’, ’r’);
end
subplot(rr,cc,3), imagesc(255-FDf); title(’Thresholded Diff Img’);
subplot(4,2,6), hist(FDforig(:),1:255); title(’Histogram of Differences’);
subplot(4,2,8), hist(Sz,1:10:1000); title(’Histogram of Patch Sizes’);
pause();
end
145
end
146
J.13 2D Transformations
We chose as shape a distorted (self-intersecting) rectangle, specified with coordinates in Co. In section
‘Transforming the Shape’ it is then transformed in various ways resulting in shape coordinates Crot, Cshr,
etc. We then try to estimate the motions, once for similarity and once for affinity for which we build the
corresponding variables Asim, Aaff and bsim and baff, respectively. The estimation procedure is done
with lsqlin or lsqnonneg, whereby A and b are passed as arguments. At the very end we use Matlab’s
function scripts fitgeotrans and procrustes.
% Examples of 2D transforms and their motion estimation. Sze pdf 36, 312.
clear;
%% ------ Stimulus: a distorted rectangle
Co = [0 0; 0 .8; 0.6 0.75; 1.2 .8; 1.0 0.9; 1.2 0; 0 0]*4+2;
np = size(Co,1); % # of points
cpt = mean(Co,1); % center point
147
JAff = inline(’[1 0 x y 0 0; 0 1 0 0 x y]’);
% JEuc = inline(’[1 0 -sin(r)*x -cos(r)*y; 0 1 -cos(r)*x -sin(r)*y]’);
Asim = zeros(4,4); bsim = 0; % init A and b for similarity case
Aaff = zeros(6,6); baff = 0; % init A and b for affinity case
DltSim = Csim(:,[1 2])-Coc; % delta (transformed-original 0,0 centered)
DltAff = Caff(:,[1 2])-Coc; % delta (transformed-original 0,0 centered)
for i = 1 : np
pt = Coc(i,:); % original 0,0 centered
Jp = JSim(pt(1),pt(2));
Asim = Asim + Jp’*Jp;
bsim = bsim + DltSim(i,:)*Jp;
Jp = JAff(pt(1),pt(2));
Aaff = Aaff + Jp’*Jp;
baff = baff + DltAff(i,:)*Jp;
end
%% ========= Least-Square for A and b for Similarity =========
disp(’Similarity’);
[Prm1 resn1] = lsqnonneg(Asim, bsim’); % Prm1(3:4) contain estimates for a and b
[Prm2 resn2] = lsqlin(Asim, bsim’);
Prm1’
Prm2’
(Prm1(1:2)-cpt’)’ % translation parameters (tx, ty)
%% ========= Least-Square for A and b for Affinity =========
disp(’Affinity’);
[Prm resn] = lsqlin(Aaff, baff’); % Prm(3:6) contain estimates for a00, a01, a10, a11
Prm’
(Prm(1:2)-cpt’)’ % tx and ty
148
J.14 RanSAC
Example of a primitve version of the Random Sampling Consensus algorithm. Below are two scripts: a
function script f RanSaC and a testing script that runs the function.
% Random Sampling Consensus for affine transformation.
% No correspondence determined - assumes list entries correspond already.
% IN: - Pt1 list of original points [np1,2]
% - Pt2 list of transformed points [np2,2]
% - Opt options
% OUT: - TP struct with estimates
% - PrmEst [nGoodFits x 6] parameters from lsqlin
% - ErrEst [nGoodFits x 1] error
function TP = f_RanSaC(Pt1, Pt2, Opt, b_plot)
TP.nGoodFits = 0;
if isempty(Pt1) || isempty(Pt2), return; end
JAff = inline(’[1 0 x y 0 0; 0 1 0 0 x y]’);
np1 = size(Pt1,1);
np2 = size(Pt2,1);
fprintf(’Original has %d points, transformed has %d points\n’, np1, np2);
nMinPt = Opt.nMinPts; % # of minimum pts required for transformation
nCom = np1-nMinPt; % # of complement pts
if ~nCom, warning(’no complementing points’); end
cpt = mean(Pt1); % center point of original
Pt1Cen = Pt1 - repmat(cpt,np1,1); % original 0,0 centered
%% ========== Plotting
if b_plot.dist
Dis = sort(Di, ’ascend’);
figure(10); clf; [rr cc] = deal(1,2);
subplot(rr,cc,1);
plot(Dis);
set(gca, ’ylim’, [0 1500]);
title([Opt.xlb ’ ’ num2str(cIter)]);
xlabel(Opt.Match, ’fontweight’, ’bold’, ’fontsize’, 12);
subplot(rr,cc,2);
plot(Dis); hold on;
set(gca, ’xlim’, [0 nCom*0.25]);
set(gca, ’ylim’, [0 Opt.thrNear*2]);
plot([0 nCom], ones(1,2)*Opt.thrNear, ’k:’);
pause();
end
cIter = cIter + 1;
fprintf(’\n’);
end
TP.PrmEst = Prm;
149
TP.ErrEst = Err;
TP.nGoodFits = nGoodFits;
fprintf(’#GoodFits %d out of %d iterations\n’, nGoodFits, Opt.nMaxIter);
end % function
150
J.15 Posture Estimation
Two files need to be downloaded, see the first two comment lines of the code block below:
- pose deploy linevec faster 4 stages.prototxt: this is a text file that contains the architecture of the
CNN. It is written in a format that is suitable for the CAFFE software.
- pose iter 160000.caffemodel: this is a file containing the corresponding weight values, ca 200MB.
Those two files will be loaded into the program with the function cv.dnn.readNetFromCaffe from OpenCV:
it generates the network called here NET.
The output of the estimate, OMPS, consists of ca. 40 confidence maps, of which the first 15 correspond to
body parts, and the 16th to the background.
# https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/models/pose/mpi/pose_deploy_linevec_faster_4_stages.prot
# http://posefs1.perception.cs.cmu.edu/OpenPose/models/pose/mpi/pose_iter_160000.caffemodel
from numpy import zeros
import cv2 as cv
151
BodyPart = [None]*16
BodyPart[0] = ’Head’
BodyPart[1] = ’Neck’
BodyPart[2] = ’Right Shoulder’
BodyPart[3] = ’Right Elbow’
BodyPart[4] = ’Right Wrist’
BodyPart[5] = ’Left Shoulder’
BodyPart[6] = ’Left Elbow’
BodyPart[7] = ’Left Wrist’
BodyPart[8] = ’Right Hip’
BodyPart[9] = ’Right Knee’
BodyPart[10] = ’Right Ankle’
BodyPart[11] = ’Left Hip’
BodyPart[12] = ’Left Knee’
BodyPart[13] = ’Left Ankle’
BodyPart[14] = ’Chest’
BodyPart[15] = ’Background’
For the face pose recognition system, one downloads the following two files - to build the network. Apart
from that, the code pretty much remains the same as above.
# https://raw.githubusercontent.com/CMU-Perceptual-Computing-Lab/openpose/master/models/face/pose_deploy.prototxt
# http://posefs1.perception.cs.cmu.edu/OpenPose/models/face/pose_iter_116000.caffemodel
For the hand pose recognition system, one downloads the following two files:
# https://raw.githubusercontent.com/CMU-Perceptual-Computing-Lab/openpose/master/models/hand/pose_deploy.prototxt
# http://posefs1.perception.cs.cmu.edu/OpenPose/models/hand/pose_iter_102000.caffemodel
152