Artificial Intelligence & Machine Learning

Dr. M. Kamakshaiah
kamakshaiah.m@gmail.com

About the author

Kamakshaiah Musunuru is a open source software evangelist and full stack developer. He is academic at GITAM (Deemed to be) University, Visakhapatnam, India, and also a freelance trainer of data science and analytics. His teaching interests are MIS, Data Science, Business Analytics including functional analytics such as marketing analytics, finance analytics etc. He teaches subjects like R for business analytics, Python for business anlaytics, IoT, Big Data analytics using Hadoop, etc. His research interests are related to health care management, educational management, food & agriculture business, utilization of open source software for business applications. He is advocate of open source software. He has conducted few workshops and seminars on open source software tools and their utility for business. He lived in Ethiopia, North-East Africa, for two years to teach business management and visited Bangkok, Thailand, on research assignments.

1 Introduction
1.1 About AI
1.2 Graph Theory
  1.2.1 Definition
  1.2.2 Directed graph
  1.2.3 History
  1.2.4 Applications
1.3 State space search
  1.3.1 Representation
  1.3.2 Examples of state-space search algorithms
1.4 Depth-first search
  1.4.1 Properties
  1.4.2 Example
  1.4.3 Output of a depth-first search
  1.4.4 Vertex orderings
  1.4.5 Complexity
  1.4.6 Applications
  1.4.7 Python practice
  1.4.8 Handling A Disconnected Graph
  1.4.9 Python implementation
1.5 Breadth-first search
  1.5.1 More details
  1.5.2 Analysis
  1.5.3 BFS ordering
  1.5.4 Applications
  1.5.5 Breadth First Search using Python
1.6 Iterative deepening (Depth-first search)
  1.6.1 Properties
  1.6.2 Analysis
  1.6.3 Example
  1.6.4 Python practice for Iterative Deepening
1.7 Informed/Heuristic breadth-first search
  1.7.1 Procedure
1.8 A* Search
  1.8.1 Description
  1.8.2 Complexity
  1.8.3 Python practice
1.9 Branch and bound
  1.9.1 Overview
  1.9.2 Generic version
  1.9.3 Knapsack problem
  1.9.4 8 puzzle problem
  1.9.5 Job Assignment Problem
  1.9.6 Traveling Salesman Problem
2 Learning systems
2.1 Definitions
  2.1.1 What is the learning?
  2.1.2 What is a system?
2.2 Goals & Applications
  2.2.1 Image Recognition
  2.2.2 Speech Recognition
  2.2.3 Traffic prediction
  2.2.4 Product recommendations
  2.2.5 Self-driving cars
  2.2.6 Email Spam and Malware Filtering
  2.2.7 Virtual Personal Assistant
  2.2.8 Online Fraud Detection
  2.2.9 Stock Market trading
  2.2.10 Medical Diagnosis
  2.2.11 Automatic Language Translation
2.3 Characteristics
  2.3.1 Problem Definition
  2.3.2 Data
  2.3.3 Evaluation
  2.3.4 Features
  2.3.5 Modeling
  2.3.6 Experimentation
2.4 Design
  2.4.1 Handwriting recognition learning problem
  2.4.2 Spam Mail detection learning problem
  2.4.3 Checkers learning problem
  2.4.4 Choosing and representing Target Function
2.5 Develop
  2.5.1 Training data
  2.5.2 Validation dataset
  2.5.3 Cross-validation
  2.5.4 Concept representation
  2.5.5 Function approximation
3 Artificial Neural Networks
3.1 Neural Networks
3.2 Overview
3.3 History
3.4 Artificial intelligence
3.5 Applications
3.6 von Neumann architecture
3.7 Turing’s B-type machines
3.8 Hebbian Learning
3.9 Python Practice
  3.9.1 Neural Networks
3.10 Feedforward networks
  3.10.1 Single-layer perceptron
  3.10.2 Multi-layer perceptron
  3.10.3 Other feedforward networks
  3.10.4 Python practice
3.11 Backpropagation network
  3.11.1 Overview
  3.11.2 Learning as an optimization problem
  3.11.3 Loss function
  3.11.4 Python practice
3.12 Agents in Artificial Intelligence
  3.12.1 What is an Agent?
  3.12.2 Intelligent Agents
  3.12.3 Rational Agent
  3.12.4 Structure of an AI Agent
  3.12.5 PEAS Representation
  3.12.6 Types of agents
3.13 Agent Environment in AI
  3.13.1 Features of Environment
3.14 Turing Test in AI
  3.14.1 Chatbots to attempt the Turing test
3.15 The Chinese Room Argument
4 Natural language processing
4.1 History
  4.1.1 Symbolic NLP (1950s early 1990s)
  4.1.2 Statistical NLP (1990s2010s)
  4.1.3 Neural NLP (present)
  4.1.4 Components of NLP
  4.1.5 Natural Language Understanding (NLU)
  4.1.6 Natural Language Generation (NLG)
  4.1.7 Difference between NLU and NLG
4.2 Applications of NLP
4.3 Speech recognition using Python
4.4 How to build an NLP pipeline
4.5 Python practice for NLP pipelines
  4.5.1 spaCy for NLP
  4.5.2 Phases of NLP
5 Machine Learning
5.1 Supervised learning
  5.1.1 Logistic regression
  5.1.2 K-Nearest Neighbors Algorithm (k-NN)
  5.1.3 Decision Trees
  5.1.4 Support Vector Machines
5.2 Unsupervised learning
  5.2.1 Approaches
  5.2.2 Clustering
  5.2.3 Algorithms for clustering
  5.2.4 K-Means clustering
  5.2.5 K-Means clustering using Python
  5.2.6 Hierarchical clustering

Chapter 1
Introduction

1.1 About AI

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen. ’Strong’ AI is usually labelled as AGI (Artificial General Intelligence) while attempts to emulate ’natural’ intelligence have been called ABI (Artificial Biological Intelligence). Leading AI textbooks define the field as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term “artificial intelligence” is often used to describe machines (or computers) that mimic ”cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving”.

As machines become increasingly capable, tasks considered to require “intelligence” are often removed from the definition of AI, a phenomenon known as the AI effect. A quip in Tesler’s Theorem says ”AI is whatever hasn’t been done yet.” ¹ For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology. Modern machine capabilities generally classified as AI include successfully understanding human speech, competing at the highest level in strategic game systems (such as chess and Go), and also imperfect-information games like poker, self-driving cars, intelligent routing in content delivery networks, and military simulations.

Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an “AI winter”), followed by new approaches, success and renewed funding. ² After AlphaGo successfully defeated a professional Go player in 2015, artificial intelligence once again attracted widespread global attention.³ For most of its history, AI research has been divided into sub-fields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. “robotics” or “machine learning”), the use of particular tools (“logic” or artificial neural networks), or deep philosophical differences. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).

The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field’s long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics. The AI field draws upon computer science, information engineering, mathematics, psychology, linguistics, philosophy, and many other fields.

The field was founded on the assumption that human intelligence “can be so precisely described that a machine can be made to simulate it”. This raises philosophical arguments about the mind and the ethics of creating artificial beings endowed with human-like intelligence. These issues have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabated. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment.

In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science, software engineering and operations research.

1.2 Graph Theory

Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.

1.2.1 Definition

Definitions in graph theory vary. The following are some of the more basic ways of defining graphs and related mathematical structures. In one restricted but very common sense of the term, a graph is an ordered pair $G = (V, E)$ comprising:

$V$ , a set of vertices (also called nodes or points);
$E \subseteq {{x, y} ∣ x, y \in V a n d x \neq y}$ , a set of edges (also called links or lines), which are unordered pairs of vertices (that is, an edge is associated with two distinct vertices).

To avoid ambiguity, this type of object may be called precisely an undirected simple graph.

In the edge ${x, y}$ , the vertices $x$ and $y$ are called the endpoints of the edge. The edge is said to join $x$ and $y$ and to be incident on $x$ and on $y$ . A vertex may exist in a graph and not belong to an edge. Multiple edges, not allowed under the definition above, are two or more edges that join the same two vertices.

In one more general sense of the term allowing multiple edges, a graph is an ordered triple $G = (V, E, ϕ)$ comprising:

$V$ , a set of vertices (also called nodes or points);
$E$ , a set of edges (also called links or lines);
$ϕ : E \to {{x, y} ∣ x, y \in V^{2} a n d x \neq y}$ , an incidence function mapping every edge to an unordered pair of vertices (that is, an edge is associated with two distinct vertices).

To avoid ambiguity, this type of object may be called precisely an undirected multigraph. A loop is an edge that joins a vertex to itself. Graphs as defined in the two definitions above cannot have loops, because a loop joining a vertex $x$ to itself is the edge (for an undirected simple graph) or is incident on (for an undirected multigraph) ${x, x} = {x}$ which is not in ${{x, y} ∣ x, y \in V a n d x \neq y}$ . So to allow loops the definitions must be expanded. For undirected simple graphs, the definition of $E$ should be modified to $E \subseteq {{x, y} ∣ x, y \in V}$ . For undirected multigraphs, the definition of $ϕ$ should be modified to $ϕ : E \to {{x, y} ∣ x, y \in V}$ . To avoid ambiguity, these types of objects may be called undirected simple graph permitting loops and undirected multigraph permitting loops, respectively.

$V$ and $E$ are usually taken to be finite, and many of the well-known results are not true (or are rather different) for infinite graphs because many of the arguments fail in the infinite case. Moreover, $V$ is often assumed to be non-empty, but $E$ is allowed to be the empty set. The order of a graph is $| V |$ , its number of vertices. The size of a graph is $| E |$ , its number of edges. The degree or valency of a vertex is the number of edges that are incident to it, where a loop is counted twice. The degree of a graph is the maximum of the degrees of its vertices. In an undirected simple graph of order n, the maximum degree of each vertex is n ? 1 and the maximum size of the graph is n(n - 1)/2.

The edges of an undirected simple graph permitting loops $G$ induce a symmetric homogeneous relation $˜$ on the vertices of $G$ that is called the adjacency relation of $G$ . Specifically, for each edge $(x, y)$ , its endpoints $x$ and $y$ are said to be adjacent to one another, which is denoted $x ˜ y$ .

1.2.2 Directed graph

A directed graph or digraph is a graph in which edges have orientations. In one restricted but very common sense of the term, a directed graph is an ordered pair $G = (V, E)$ comprising:

$V$ , a set of vertices (also called nodes or points);
$E \subseteq {(x, y) ∣ (x, y) \in V^{2} a n d x \neq y}$ , a set of edges (also called directed edges, directed links, directed lines, arrows or arcs) which are ordered pairs of vertices (that is, an edge is associated with two distinct vertices).

To avoid ambiguity, this type of object may be called precisely a directed simple graph. In the edge $(x, y)$ directed from $x$ to $y$ , the vertices $x$ and $y$ are called the endpoints of the edge, $x$ the tail of the edge and $y$ the head of the edge. The edge is said to join $x$ and $y$ and to be incident on $x$ and on $y$ . A vertex may exist in a graph and not belong to an edge. The edge $(y, x)$ is called the inverted edge of $(x, y)$ . Multiple edges, not allowed under the definition above, are two or more edges with both the same tail and the same head.

Figure 1.1:A directed graph with three vertices and four directed edges (the double arrow represents an edge in each direction).

In one more general sense of the term allowing multiple edges, a directed graph is an ordered triple $G = (V, E, ϕ)$ comprising:

$V$ , a set of vertices (also called nodes or points);
$E$ , a set of edges (also called directed edges, directed links, directed lines, arrows or arcs); $ϕ : E \to {(x, y) ∣ (x, y) \in V^{2} a n d x \neq y}$ , an incidence function mapping every edge to an ordered pair of vertices (that is, an edge is associated with two distinct vertices).

To avoid ambiguity, this type of object may be called precisely a directed multigraph. A loop is an edge that joins a vertex to itself. Directed graphs as defined in the two definitions above cannot have loops, because a loop joining a vertex $x$ to itself is the edge (for a directed simple graph) or is incident on (for a directed multigraph) $(x, x)$ which is not in ${(x, y) ∣ (x, y) \in V^{2} a n d x \neq y}$ . So to allow loops the definitions must be expanded. For directed simple graphs, the definition of $E$ should be modified to $E \subseteq {(x, y) ∣ (x, y) \in V^{2}}$ . For directed multigraphs, the definition of $ϕ$ should be modified to $ϕ : E \to {(x, y) ∣ (x, y) \in V^{2}}$ . To avoid ambiguity, these types of objects may be called precisely a directed simple graph permitting loops and a directed multigraph permitting loops (or a quiver) respectively. The edges of a directed simple graph permitting loops $G$ is a homogeneous relation $˜$ on the vertices of $G$ that is called the adjacency relation of $G$ . Specifically, for each edge $(x, y)$ , its endpoints $x$ and $y$ are said to be adjacent to one another, which is denoted $x ˜ y$ .

1.2.3 History

The paper written by Leonhard Euler on the Seven Bridges of $K ö n i g s b e r g$ and published in 1736 is regarded as the first paper in the history of graph theory. ⁴ This paper, as well as the one written by Vandermonde on the knight problem, carried on with the analysis situs initiated by Leibniz. Euler’s formula relating the number of edges, vertices, and faces of a convex polyhedron was studied and generalized by Cauchy and L’Huilier, and represents the beginning of the branch of mathematics known as topology.

Figure 1.2:Seven Bridges of $K ö n i g s b e r g$

More than one century after Euler’s paper on the bridges of $K ö n i g s b e r g$ and while Listing was introducing the concept of topology, Cayley was led by an interest in particular analytical forms arising from differential calculus to study a particular class of graphs, the trees. This study had many implications for theoretical chemistry. The techniques he used mainly concern the enumeration of graphs with particular properties. Enumerative graph theory then arose from the results of Cayley and the fundamental results published by P between 1935 and 1937. These were generalized by De Bruijn in 1959. Cayley linked his results on trees with contemporary studies of chemical composition. The fusion of ideas from mathematics with those from chemistry began what has become part of the standard terminology of graph theory.

In particular, the term “graph” was introduced by Sylvester in a paper published in 1878 in Nature, where he draws an analogy between “quantic invariants” and “co-variants” of algebra and molecular diagrams:

“[...] Every invariant and co-variant thus becomes expressible by a graph precisely identical with a Kekul diagram or chemicograph. [...] I give a rule for the geometrical multiplication of graphs, i.e. for constructing a graph to the product of in- or co-variants whose separate graphs are given. [...]”

The first textbook on graph theory was written by Ds $K ö n i g$ , and published in 1936. ⁵ Another book by Frank Harary, published in 1969, was ”considered the world over to be the definitive textbook on the subject”, and enabled mathematicians, chemists, electrical engineers and social scientists to talk to each other. Harary donated all of the royalties to fund the P Prize.⁶

One of the most famous and stimulating problems in graph theory is the four color problem: “Is it true that any map drawn in the plane may have its regions colored with four colors, in such a way that any two regions having a common border have different colors?” This problem was first posed by Francis Guthrie in 1852 and its first written record is in a letter of De Morgan addressed to Hamilton the same year. Many incorrect proofs have been proposed, including those by Cayley, Kempe, and others. The study and the generalization of this problem by Tait, Heawood, Ramsey and Hadwiger led to the study of the colorings of the graphs embedded on surfaces with arbitrary genus. Tait’s reformulation generated a new class of problems, the factorization problems, particularly studied by Petersen and $K ö n i g$ . The works of Ramsey on colorations and more specially the results obtained by Turn 1941 was at the origin of another branch of graph theory, extremal graph theory.

The four color problem remained unsolved for more than a century. In 1969 Heinrich Heesch published a method for solving the problem using computers. A computer-aided proof produced in 1976 by Kenneth Appel and Wolfgang Haken makes fundamental use of the notion of “discharging” developed by Heesch. The proof involved checking the properties of 1,936 configurations by computer, and was not fully accepted at the time due to its complexity. A simpler proof considering only 633 configurations was given twenty years later by Robertson, Seymour, Sanders and Thomas.

The autonomous development of topology from 1860 and 1930 fertilized graph theory back through the works of Jordan, Kuratowski and Whitney. Another important factor of common development of graph theory and topology came from the use of the techniques of modern algebra. The first example of such a use comes from the work of the physicist Gustav Kirchhoff, who published in 1845 his Kirchhoff’s circuit laws for calculating the voltage and current in electric circuits.

The introduction of probabilistic methods in graph theory, especially in the study of Erd?s and $R ė n y i$ of the asymptotic probability of graph connectivity, gave rise to yet another branch, known as random graph theory, which has been a fruitful source of graph-theoretic results.

1.2.4 Applications

Graphs can be used to model many types of relations and processes in physical, biological, social and information systems. Many practical problems can be represented by graphs. Emphasizing their application to real-world systems, the term network is sometimes defined to mean a graph in which attributes (e.g. names) are associated with the vertices and edges, and the subject that expresses and understands the real-world systems as a network is called network science.

Computer science

In computer science, graphs are used to represent networks of communication, data organization, computational devices, the flow of computation, etc. For instance, the link structure of a website can be represented by a directed graph, in which the vertices represent web pages and directed edges represent links from one page to another. A similar approach can be taken to problems in social media, travel, biology, computer chip design, mapping the progression of neuro-degenerative diseases, and many other fields. The development of algorithms to handle graphs is therefore of major interest in computer science. The transformation of graphs is often formalized and represented by graph rewrite systems. Complementary to graph transformation systems focusing on rule-based in-memory manipulation of graphs are graph databases geared towards transaction-safe, persistent storing and querying of graph-structured data.

Linguistics

Graph-theoretic methods, in various forms, have proven particularly useful in linguistics, since natural language often lends itself well to discrete structure. Traditionally, syntax and compositional semantics follow tree-based structures, whose expressive power lies in the principle of compositionality, modeled in a hierarchical graph. More contemporary approaches such as head-driven phrase structure grammar model the syntax of natural language using typed feature structures, which are directed acyclic graphs. Within lexical semantics, especially as applied to computers, modeling word meaning is easier when a given word is understood in terms of related words; semantic networks are therefore important in computational linguistics. Still, other methods in phonology (e.g. optimality theory, which uses lattice graphs) and morphology (e.g. finite-state morphology, using finite-state transducers) are common in the analysis of language as a graph. Indeed, the usefulness of this area of mathematics to linguistics has borne organizations such as TextGraphs, as well as various ’Net’ projects, such as WordNet, VerbNet, and others.

Physics and chemistry

Graph theory is also used to study molecules in chemistry and physics. In condensed matter physics, the three-dimensional structure of complicated simulated atomic structures can be studied quantitatively by gathering statistics on graph-theoretic properties related to the topology of the atoms. Also, ”the Feynman graphs and rules of calculation summarize quantum field theory in a form in close contact with the experimental numbers one wants to understand.” In chemistry a graph makes a natural model for a molecule, where vertices represent atoms and edges bonds. This approach is especially used in computer processing of molecular structures, ranging from chemical editors to database searching. In statistical physics, graphs can represent local connections between interacting parts of a system, as well as the dynamics of a physical process on such systems. Similarly, in computational neuroscience graphs can be used to represent functional connections between brain areas that interact to give rise to various cognitive processes, where the vertices represent different areas of the brain and the edges represent the connections between those areas. Graph theory plays an important role in electrical modeling of electrical networks, here, weights are associated with resistance of the wire segments to obtain electrical properties of network structures. Graphs are also used to represent the micro-scale channels of porous media, in which the vertices represent the pores and the edges represent the smaller channels connecting the pores. Chemical graph theory uses the molecular graph as a means to model molecules. Graphs and networks are excellent models to study and understand phase transitions and critical phenomena. Removal of nodes or edges lead to a critical transition where the network breaks into small clusters which is studied as a phase transition. This breakdown is studied via percolation theory.

Social sciences

Graph theory is also widely used in sociology as a way, for example, to measure actors’ prestige or to explore rumor spreading, notably through the use of social network analysis software. Under the umbrella of social networks are many different types of graphs. Acquaintanceship and friendship graphs describe whether people know each other. Influence graphs model whether certain people can influence the behavior of others. Finally, collaboration graphs model whether two people work together in a particular way, such as acting in a movie together.

Biology

Likewise, graph theory is useful in biology and conservation efforts where a vertex can represent regions where certain species exist (or inhabit) and the edges represent migration paths or movement between the regions. This information is important when looking at breeding patterns or tracking the spread of disease, parasites or how changes to the movement can affect other species. Graphs are also commonly used in molecular biology and genomics to model and analyse datasets with complex relationships. For example, graph-based methods are often used to ’cluster’ cells together into cell-types in single-cell transcriptome analysis. Another use is to model genes or proteins in a pathway and study the relationships between them, such as metabolic pathways and gene regulatory networks. Evolutionary trees, ecological networks, and hierarchical clustering of gene expression patterns are also represented as graph structures. Graph-based methods are pervasive that researchers in some fields of biology and these will only become far more widespread as technology develops to leverage this kind of high-throughout multidimensional data. Graph theory is also used in connectomics; nervous systems can be seen as a graph, where the nodes are neurons and the edges are the connections between them.

Mathematics

In mathematics, graphs are useful in geometry and certain parts of topology such as knot theory. Algebraic graph theory has close links with group theory. Algebraic graph theory has been applied to many areas including dynamic systems and complexity.

1.3 State space search

State space search is a process used in the field of computer science, including artificial intelligence (AI), in which successive configurations or states of an instance are considered, with the intention of finding a goal state with the desired property.

Problems are often modelled as a state space, a set of states that a problem can be in. The set of states forms a graph where two states are connected if there is an operation that can be performed to transform the first state into the second.

State space search often differs from traditional computer science search methods because the state space is implicit: the typical state space graph is much too large to generate and store in memory. Instead, nodes are generated as they are explored, and typically discarded thereafter. A solution to a combinatorial search instance may consist of the goal state itself, or of a path from some initial state to the goal state.

1.3.1 Representation

In state space search, a state space is formally represented as a tuple $⟨ S, A, A c t i o n (s), R e s u l t (s, a), C o s t (s, a) ⟩$ , in which:

$S$ is the set of all possible states;
$A$ is the set of possible actions, not related to a particular state but regarding all the state space;
$A c t i o n (s)$ is the function that establish which action is possible to perform in a certain state;
$R e s u l t (s, a)$ is the function that returns the state reached performing action $a$ in state $s$
$C o s t (s, a)$ is the cost of performing an action $a$ in state $s$ . In many state spaces is a constant, but this is not true in general.

1.3.2 Examples of state-space search algorithms

Uninformed search

According to Poole and Mack worth, the following are uninformed state-space search methods, meaning that they do not have any prior information about the goal’s location.

Traditional depth-first search
Breadth-first search
Iterative deepening
Lowest-cost-first search

Heuristic search

Some algorithms take into account information about the goal node’s location in the form of a heuristic function. Poole and Mackworth cite the following examples as informed search algorithms:

Informed/Heuristic breadth-first search
Greedy best-first search
A* search

1.4 Depth-first search

Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible along each branch before backtracking. A version of depth-first search was investigated in the $1 9^{t h}$ century by French mathematician Charles Pierre Trux as a strategy for solving mazes. ⁷ ⁸


Class Search	algorithm
Data structure	Graph
Worst-case performance	$O (\| V \| + \| E \|)$ for explicit
	graphs traversed without
	repetition, $O (b^{d})$ for
	implicit graphs with branching
	factor $b$ searched to depth $d$
Worst-case space complexity	$O (\| V \|)$ if entire graph is
	traversed without repetition,
	$O$ (longest path length
	searched) = $O (b d)$ for implicit
	graphs without elimination
	of duplicate nodes

1.4.1 Properties

The time and space analysis of DFS differs according to its application area. In theoretical computer science, DFS is typically used to traverse an entire graph, and takes time $O (| V | + | E |)$ , ⁹ where $| V |$ is the number of vertices and $| E |$ the number of edges. This is linear in the size of the graph. In these applications it also uses space $O (| V |)$ in the worst case to store the stack of vertices on the current search path as well as the set of already-visited vertices. Thus, in this setting, the time and space bounds are the same as for breadth-first search and the choice of which of these two algorithms to use depends less on their complexity and more on the different properties of the vertex orderings the two algorithms produce.

For applications of DFS in relation to specific domains, such as searching for solutions in artificial intelligence or web-crawling, the graph to be traversed is often either too large to visit in its entirety or infinite (DFS may suffer from non-termination). In such cases, search is only performed to a limited depth; due to limited resources, such as memory or disk space, one typically does not use data structures to keep track of the set of all previously visited vertices. When search is performed to a limited depth, the time is still linear in terms of the number of expanded vertices and edges (although this number is not the same as the size of the entire graph because some vertices may be searched more than once and others not at all) but the space complexity of this variant of DFS is only proportional to the depth limit, and as a result, is much smaller than the space needed for searching to the same depth using breadth-first search. For such applications, DFS also lends itself much better to heuristic methods for choosing a likely-looking branch. When an appropriate depth limit is not known a priori, iterative deepening depth-first search applies DFS repeatedly with a sequence of increasing limits. In the artificial intelligence mode of analysis, with a branching factor greater than one, iterative deepening increases the running time by only a constant factor over the case in which the correct depth limit is known due to the geometric growth of the number of nodes per level. DFS may also be used to collect a sample of graph nodes. However, incomplete DFS, similarly to incomplete BFS, is biased towards nodes of high degree.

1.4.2 Example

For the following graph:

A depth-first search starting at the node $A$ , assuming that the left edges in the shown graph are chosen before right edges, and assuming the search remembers previously visited nodes and will not repeat them (since this is a small graph), will visit the nodes in the following order: $A, B, D, F, E, C, G$ . The edges traversed in this search form a $T r ê m a u x$ tree, a structure with important applications in graph theory. Performing the same search without remembering previously visited nodes results in visiting the nodes in the order $A, B, D, F, E, A, B, D, F, E$ , etc. forever, caught in the $A, B, D, F, E$ cycle and never reaching $C$ or $G$ . Iterative deepening is one technique to avoid this infinite loop and would reach all nodes.

1.4.3 Output of a depth-first search

The result of a depth-first search of a graph can be conveniently described in terms of a spanning tree of the vertices reached during the search. Based on this spanning tree, the edges of the original graph can be divided into three classes: forward edges, which point from a node of the tree to one of its descendants, back edges, which point from a node to one of its ancestors, and cross edges, which do neither. Sometimes tree edges, edges which belong to the spanning tree itself, are classified separately from forward edges. If the original graph is undirected then all of its edges are tree edges or back edges.

1.4.4 Vertex orderings

It is also possible to use depth-first search to linearly order the vertices of a graph or tree. There are four possible ways of doing this:

A preordering is a list of the vertices in the order that they were first visited by the depth-first search algorithm. This is a compact and natural way of describing the progress of the search, as was done earlier in this article. A preordering of an expression tree is the expression in Polish notation.
A postordering is a list of the vertices in the order that they were last visited by the algorithm. A postordering of an expression tree is the expression in reverse Polish notation.
A reverse preordering is the reverse of a preordering, i.e. a list of the vertices in the opposite order of their first visit. Reverse preordering is not the same as postordering.
A reverse postordering is the reverse of a postordering, i.e. a list of the vertices in the opposite order of their last visit. Reverse postordering is not the same as preordering.

For binary trees there is additionally in-ordering and reverse in-ordering. For example, when searching the directed graph below beginning at node A, the sequence of traversals is either $A B D B A C A$ or $A C D C A B A$ (choosing to first visit B or C from A is up to the algorithm). Note that repeat visits in the form of backtracking to a node, to check if it has still unvisited neighbors, are included here (even if it is found to have none). Thus the possible preorderings are A B D C and A C D B, while the possible postorderings are $D B C A$ and $D C B A$ , and the possible reverse postorderings are $A C B D$ and $A B C D$ .

Figure 1.6:If-then-else control flow graph.

Reverse postordering produces a topological sorting of any directed acyclic graph. This ordering is also useful in control-flow analysis as it often represents a natural linearization of the control flows. The graph above might represent the flow of control in the code fragment below, and it is natural to consider this code in the order $A B C D$ or $A C B D$ but not natural to use the order $A B D C$ or $A C D B$ .

1.4.5 Complexity

The computational complexity of DFS was investigated by John Reif. More precisely, given a graph $G$ , let $O = (v_{1}, \dots, v_{n})$ be the ordering computed by the standard recursive DFS algorithm. This ordering is called the lexicographic depth-first search ordering. John Reif considered the complexity of computing the lexicographic depth-first search ordering, given a graph and a source. A decision version of the problem (testing whether some vertex u occurs before some vertex v in this order) is P-complete, meaning that it is “a nightmare for parallel processing”. ¹⁰

A depth-first search ordering (not necessarily the lexicographic one), can be computed by a randomized parallel algorithm in the complexity class RNC. As of 1997, it remained unknown whether a depth-first traversal could be constructed by a deterministic parallel algorithm, in the complexity class NC.

1.4.6 Applications

Algorithms that use depth-first search as a building block include:

Finding connected components.
Topological sorting.
Finding 2-(edge or vertex)-connected components.
Finding 3-(edge or vertex)-connected components.
Finding the bridges of a graph.
Generating words in order to plot the limit set of a group.
Finding strongly connected components.
Determining whether a species is closer to one species or another in a phylogenetic tree.
Planarity testing.
Solving puzzles with only one solution, such as mazes. (DFS can be adapted to find all solutions to a maze by only including nodes on the current path in the visited set.)
Maze generation may use a randomized depth-first search.
Finding biconnectivity in graphs.

1.4.7 Python practice

Depth First Traversal (or Search) for a graph is similar to Depth First Traversal of a tree. The only catch here is, unlike trees, graphs may contain cycles (a node may be visited twice). To avoid processing a node more than once, use a boolean visited array.

Example

Input n = 4, e = 6
$0 - > 1, 0 - > 2, 1 - > 2, 2 - > 0, 2 - > 3, 3 - > 3$
Output DFS from vertex $1 : 1203$

Figure 1.7:DFS algorithm example 1

Example

Input: n = 4, e = 6
$2 - > 0, 0 - > 2, 1 - > 2, 0 - > 1, 3 - > 3, 1 - > 3$
Output: DFS from vertex $2 : 2013$

Figure 1.8:DFS algorithm example 2

Depth-first search is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible along each branch before backtracking. So the basic idea is to start from the root or any arbitrary node and mark the node and move to the adjacent unmarked node and continue this loop until there is no unmarked adjacent node. Then backtrack and check for other unmarked nodes and traverse them. Finally, print the nodes in the path.

Below are implementations of simple Depth First Traversal. The Python implementation uses an adjacency list representation of graphs. STLs list container is used to store lists of adjacent nodes.

 
1# Python3 program to print DFS traversal 
2# from a given given graph 
3from collections import defaultdict 
4 
5# This class represents a directed graph using 
6# adjacency list representation 
7 
8 
9class Graph: 
10 
11    # Constructor 
12    def __init__(self): 
13 
14        # default dictionary to store graph 
15        self.graph = defaultdict(list) 
16 
17    # function to add an edge to graph 
18    def addEdge(self, u, v): 
19        self.graph[u].append(v) 
20 
21    # A function used by DFS 
22    def DFSUtil(self, v, visited): 
23 
24        # Mark the current node as visited 
25        # and print it 
26        visited.add(v) 
27        print(v, end=’ ’) 
28 
29        # Recur for all the vertices 
30        # adjacent to this vertex 
31        for neighbour in self.graph[v]: 
32            if neighbour not in visited: 
33                self.DFSUtil(neighbour, visited) 
34 
35    # The function to do DFS traversal. It uses 
36    # recursive DFSUtil() 
37    def DFS(self, v): 
38 
39        # Create a set to store visited vertices 
40        visited = set() 
41 
42        # Call the recursive helper function 
43        # to print DFS traversal 
44        self.DFSUtil(v, visited) 
45 
46# Driver code 
47 
48 
49# Create a graph given 
50# in the above diagram 
51g = Graph() 
52g.addEdge(0, 1) 
53g.addEdge(0, 2) 
54g.addEdge(1, 2) 
55g.addEdge(2, 0) 
56g.addEdge(2, 3) 
57g.addEdge(3, 3) 
58 
59print("Following is DFS from (starting from vertex 2)") 
60g.DFS(2)
 
61

Time complexity: $O (V + E)$ , where $V$ is the number of vertices and E is the number of edges in the graph.
Space Complexity: $O (V)$ , since an extra visited array of size $V$ is required.

1.4.8 Handling A Disconnected Graph

Solution

This will happen by handling a corner case. The above code traverses only the vertices reachable from a given source vertex. All the vertices may not be reachable from a given vertex, as in a Disconnected graph. To do a complete DFS traversal of such graphs, run DFS from all unvisited nodes after a DFS. The recursive function remains the same.

Algorithm

Create a recursive function that takes the index of the node and a visited array.
Mark the current node as visited and print the node.
Traverse all the adjacent and unmarked nodes and call the recursive function with the index of the adjacent node.
Run a loop from 0 to the number of vertices and check if the node is unvisited in the previous DFS, call the recursive function with the current node.

1.4.9 Python implementation

The below code explains procedure to handle a disconnected graph.

 
1# Python program to print DFS traversal for complete graph 
2 
3from collections import defaultdict 
4 
5# this class represents a directed graph using adjacency list representation 
6 
7class Graph: 
8# Constructor 
9def __init__(self): 
10# default dictionary to store graph 
11self.graph = defaultdict(list) 
12 
13# Function to add an edge to graph 
14def addEdge(self, u, v): 
15self.graph[u].append(v) 
16# A function used by DFS 
17 
18def DFSUtil(self, v, visited): 
19# Mark the current node as visited and print it 
20visited.add(v) 
21print(v,end=" ") 
22 
23# recur for all the vertices adjacent to this vertex 
24for neighbour in self.graph[v]: 
25if neighbour not in visited: 
26self.DFSUtil(neighbour, visited) 
27# The function to do DFS traversal. It uses recursive DFSUtil 
28 
29def DFS(self): 
30# create a set to store all visited vertices 
31visited = set() 
32# call the recursive helper function to print DFS traversal starting from all 
33# vertices one by one 
34for vertex in self.graph: 
35if vertex not in visited: 
36self.DFSUtil(vertex, visited) 
37# Driver code 
38# create a graph given in the above diagram 
39 
40print("Following is Depth First Traversal \n") 
41g = Graph() 
42g.addEdge(0, 1) 
43g.addEdge(0, 2) 
44g.addEdge(1, 2) 
45g.addEdge(2, 0) 
46g.addEdge(2, 3) 
47g.addEdge(3, 3) 
48g.DFS()
 
49

Complexity Analysis

Time complexity: $O (V + E)$ , where $V$ is the number of vertices and $E$ is the number of edges in the graph.
Space Complexity: $O (V)$ , since an extra visited array of size $V$ is required.

1.5 Breadth-first search

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

For example, in a chess endgame a chess engine may build the game tree from the current position by applying all possible moves, and use breadth-first search to find a win position for white. Implicit trees (such as game trees or other problem-solving trees) may be of infinite size; breadth-first search is guaranteed to find a solution node if one exists.

In contrast, (plain) depth-first search, which explores the node branch as far as possible before backtracking and expanding other nodes, may get lost in an infinite branch and never make it to the solution node. Iterative deepening depth-first search avoids the latter drawback at the price of exploring the tree’s top parts over and over again. On the other hand, both depth-first algorithms get along without extra memory. Breadth-first search can be generalized to graphs, when the start node (sometimes referred to as a ’search key’) is explicitly given, and precautions are taken against following a vertex twice.

BFS and its application in finding connected components of graphs were invented in 1945 by Konrad Zuse, in his (rejected) Ph.D. thesis on the Plankalkl programming language, but this was not published until 1972. It was reinvented in 1959 by Edward F. Moore, who used it to find the shortest path out of a maze, and later developed by C. Y. Lee into a wire routing algorithm (published 1961).


Class	Search algorithm
Data structure	Graph
Worst-case performance	$O (\| V \| + \| E \|) = O (b^{d})$
Worst-case space complexity	$O (\| V \|) = O (b^{d})$

Table 1.1:Deatails of BFS

1.5.1 More details

This non-recursive implementation is similar to the non-recursive implementation of depth-first search, but differs from it in two ways:

it uses a queue (First In First Out) instead of a stack and
it checks whether a vertex has been explored before enqueueing the vertex rather than delaying this check until the vertex is dequeued from the queue.

If G is a tree, replacing the queue of this breadth-first search algorithm with a stack will yield a depth-first search algorithm. For general graphs, replacing the stack of the iterative depth-first search implementation with a queue would also produce a breadth-first search algorithm, although a somewhat nonstandard one.

The Q queue contains the frontier along which the algorithm is currently searching. Nodes can be labelled as explored by storing them in a set, or by an attribute on each node, depending on the implementation. Note that the word node is usually interchangeable with the word vertex. The parent attribute of each node is useful for accessing the nodes in a shortest path, for example by backtracking from the destination node up to the starting node, once the BFS has been run, and the predecessors nodes have been set. Breadth-first search produces a so-called breadth first tree. You can see how a breadth first tree looks in the following example.

Example

The following is an example of the breadth-first tree obtained by running a BFS on German cities starting from Frankfurt:

Figure 1.11:The breadth-first tree obtained when running BFS on the given map and starting in Frankfurt

Practice exercise

Let us see how this algorithm works with an example. Here, we will use an undirected graph with 5 vertices. Refer to 1.12.

We begin from the vertex P, the BFS algorithmic program starts by putting it within the Visited list and puts all its adjacent vertices within the stack. Refer to 1.13

Next, we have a tendency to visit the part at the front of the queue i.e. Q and visit its adjacent nodes. Since P has already been visited, we have a tendency to visit R instead. Refer to ??

Vertex R has an unvisited adjacent vertex in T, thus we have a tendency to add that to the rear of the queue and visit S, which is at the front of the queue. Refer to 1.15 & 1.16.

Now, only T remains within the queue since the only adjacent node of S i.e. P is already visited. We have a tendency to visit it. Refer to 1.17

Since the queue is empty, we’ve completed the Traversal of the graph. The time complexity of the Breadth first Search algorithm is in the form of $O (V + E)$ , where $V$ is the representation of the number of nodes and $E$ is the number of edges. Also, the space complexity of the BFS algorithm is $O (V)$ .

1.5.2 Analysis

Time and space complexity

The time complexity can be expressed as $O (| V | + | E |)$ , since every vertex and every edge will be explored in the worst case. $| V |$ is the number of vertices and $| E |$ is the number of edges in the graph. Note that $O (| E |)$ may vary between $O (1)$ and $O (| V |^{2})$ , depending on how sparse the input graph is.

When the number of vertices in the graph is known ahead of time, and additional data structures are used to determine which vertices have already been added to the queue, the space complexity can be expressed as $O (| V |)$ , where $| V |$ is the number of vertices. This is in addition to the space required for the graph itself, which may vary depending on the graph representation used by an implementation of the algorithm.

When working with graphs that are too large to store explicitly (or infinite), it is more practical to describe the complexity of breadth-first search in different terms: to find the nodes that are at distance d from the start node (measured in number of edge traversals), BFS takes $O (b d + 1)$ time and memory, where b is the “branching factor” of the graph (the average out-degree).

Completeness

In the analysis of algorithms, the input to breadth-first search is assumed to be a finite graph, represented as an adjacency list, adjacency matrix, or similar representation. However, in the application of graph traversal methods in artificial intelligence the input may be an implicit representation of an infinite graph. In this context, a search method is described as being complete if it is guaranteed to find a goal state if one exists. Breadth-first search is complete, but depth-first search is not. When applied to infinite graphs represented implicitly, breadth-first search will eventually find the goal state, but depth first search may get lost in parts of the graph that have no goal state and never return.

1.5.3 BFS ordering

An enumeration of the vertices of a graph is said to be a BFS ordering if it is the possible output of the application of BFS to this graph.

Let $G = (V, E)$ be a graph with $n$ vertices. Recall that $N (v)$ is the set of neighbors of $v$ . Let $σ = (v_{1}, \dots, v_{m})$ be a list of distinct elements of $V$ , for $v \in V ∖ {v_{1}, \dots, v_{m}}$ , let $ν_{σ} (v)$ be the least $i$ such that $v_{i}$ is a neighbor of $v$ , if such a $i$ exists, and be $\infty$ otherwise.

Let $σ = (v_{1}, \dots, v_{n})$ be an enumeration of the vertices of $V$ . The enumeration $σ$ is said to be a BFS ordering (with source $v_{1}$ ) if, for all $1 < i \leq n$ , $v_{i}$ is the vertex $w \in V ∖ {v_{1}, \dots, v_{i - 1}}$ such that $ν_{(v_{1}, \dots, v_{i - 1})} (w)$ is minimal. Equivalently, $σ$ is a BFS ordering if, for all $1 \leq i < j < k \leq n$ with $v_{i} \in N (v_{k}) ∖ N (v_{j})$ , there exists a neighbor $v_{m}$ of $v_{j}$ such that $m < i$ .

1.5.4 Applications

Breadth-first search can be used to solve many problems in graph theory, for example:

Copying garbage collection, Cheney’s algorithm
Finding the shortest path between two nodes u and v, with path length measured by number of edges (an advantage over depth-first search)
(Reverse) CuthillMcKee mesh numbering
FordFulkerson method for computing the maximum flow in a flow network
Serialization/Deserialization of a binary tree vs serialization in sorted order, allows the tree to be re-constructed in an efficient manner.
Construction of the failure function of the Aho-Corasick pattern matcher.
Testing bipartiteness of a graph.
Implementing parallel algorithms for computing a graph’s transitive closure.

1.5.5 Breadth First Search using Python

Breadth-First Traversal (or Search) for a graph is similar to Breadth-First Traversal of a tree. The only catch here is, unlike trees, graphs may contain cycles, so we may come to the same node again. To avoid processing a node more than once, we use a boolean visited array. For simplicity, it is assumed that all vertices are reachable from the starting vertex.

For example, in the following graph, we start traversal from vertex 2. When we come to vertex 0, we look for all adjacent vertices of it. 2 is also an adjacent vertex of 0. If we dont mark visited vertices, then 2 will be processed again and it will become a non-terminating process. A Breadth-First Traversal of the following graph is 2, 0, 3, 1.

 
1# Python3 Program to print BFS traversal 
2# from a given source vertex. BFS(int s) 
3# traverses vertices reachable from s. 
4from collections import defaultdict 
5 
6# This class represents a directed graph 
7# using adjacency list representation 
8class Graph: 
9 
10    # Constructor 
11    def __init__(self): 
12 
13        # default dictionary to store graph 
14        self.graph = defaultdict(list) 
15 
16    # function to add an edge to graph 
17    def addEdge(self,u,v): 
18        self.graph[u].append(v) 
19 
20    # Function to print a BFS of graph 
21    def BFS(self, s): 
22 
23        # Mark all the vertices as not visited 
24        visited = [False] * (max(self.graph) + 1) 
25 
26        # Create a queue for BFS 
27        queue = [] 
28 
29        # Mark the source node as 
30        # visited and enqueue it 
31        queue.append(s) 
32        visited[s] = True 
33 
34        while queue: 
35 
36            # Dequeue a vertex from 
37            # queue and print it 
38            s = queue.pop(0) 
39            print (s, end = " ") 
40 
41            # Get all adjacent vertices of the 
42            # dequeued vertex s. If a adjacent 
43            # has not been visited, then mark it 
44            # visited and enqueue it 
45            for i in self.graph[s]: 
46                if visited[i] == False: 
47                    queue.append(i) 
48                    visited[i] = True 
49 
50# Driver code 
51 
52# Create a graph given in 
53# the above diagram 
54g = Graph() 
55g.addEdge(0, 1) 
56g.addEdge(0, 2) 
57g.addEdge(1, 2) 
58g.addEdge(2, 0) 
59g.addEdge(2, 3) 
60g.addEdge(3, 3) 
61 
62print ("Following is Breadth First Traversal" 
63                  " (starting from vertex 2)") 
64g.BFS(2)
 
65

1.6 Iterative deepening (Depth-first search)

In computer science, iterative deepening search or more specifically iterative deepening depth-first search (IDS or IDDFS) is a state space/graph search strategy in which a depth-limited version of depth-first search is run repeatedly with increasing depth limits until the goal is found. IDDFS is optimal like breadth-first search, but uses much less memory; at each iteration, it visits the nodes in the search tree in the same order as depth-first search, but the cumulative order in which nodes are first visited is effectively breadth-first. ¹¹

1.6.1 Properties

IDDFS combines depth-first search’s space-efficiency and breadth-first search’s completeness (when the branching factor is finite). If a solution exists, it will find a solution path with the fewest arcs. Since iterative deepening visits states multiple times, it may seem wasteful, but it turns out to be not so costly, since in a tree most of the nodes are in the bottom level, so it does not matter much if the upper levels are visited multiple times.

The main advantage of IDDFS in game tree searching is that the earlier searches tend to improve the commonly used heuristics, such as the killer heuristic and alphabeta pruning, so that a more accurate estimate of the score of various nodes at the final depth search can occur, and the search completes more quickly since it is done in a better order. For example, alphabeta pruning is most efficient if it searches the best moves first. ¹²

A second advantage is the responsiveness of the algorithm. Because early iterations use small values for $d$ , they execute extremely quickly. This allows the algorithm to supply early indications of the result almost immediately, followed by refinements as $d$ increases. When used in an interactive setting, such as in a chess-playing program, this facility allows the program to play at any time with the current best move found in the search it has completed so far. This can be phrased as each depth of the search corecursively producing a better approximation of the solution, though the work done at each step is recursive. This is not possible with a traditional depth-first search, which does not produce intermediate results.

1.6.2 Analysis

Time complexity

The time complexity of IDDFS in a (well-balanced) tree works out to be the same as breadth-first search, i.e. $O (b^{d})$ where $b$ is the branching factor and $d$ is the depth of the goal.

Proof

In an iterative deepening search, the nodes at depth $d$ are expanded once, those at depth $d - 1$ are expanded twice, and so on up to the root of the search tree, which is expanded $d + 1$ times.? So the total number of expansions in an iterative deepening search is

$b^{d} + 2 b^{d - 1} + 3 b^{d - 2} + \dots + (d - 1) b^{2} + d b + (d + 1) = \sum_{i = 0}^{d} (d + 1 - i) b^{i}$

where $b^{d}$ is the number of expansions at depth $d$ , $2 b^{d - 1}$ is the number of expansions at depth $d - 1$ , and so on. Factoring out $b^{d}$ gives

$b^{d} (1 + 2 b^{- 1} + 3 b^{- 2} + \dots + (d - 1) b^{2 - d} + d b^{1 - d} + (d + 1) b^{- d})$

Now let $x = \frac{1}{b} = b^{- 1}$ . Then we have

$b^{d} (1 + 2 x + 3 x^{2} + \dots + (d - 1) x^{d - 2} + d x^{d - 1} + (d + 1) x^{d})$

This is less than the infinite series

$b^{d} (1 + 2 x + 3 x^{2} + 4 x^{3} + \dots) = b^{d} (\sum_{n = 1}^{\infty} n x^{n - 1})$

which converges to

$b^{d} {(1 - x)}^{- 2} = b^{d} \frac{1}{{(1 - x)}^{2}}$

That is, we have

$b^{d} (1 + 2 x + 3 x^{2} + \dots + (d - 1) x^{d - 2} + d x^{d - 1} + (d + 1) x^{d}) \leq b^{d} {(1 - x)}^{- 2}$ , for $a b s (x) < 1$

Since ${(1 - x)}^{- 2}$ or ${(1 - \frac{1}{b})}^{- 2}$ is a constant independent of $d$ (the depth), if $b > 1$ (i.e., if the branching factor is greater than 1), the running time of the depth-first iterative deepening search is $O (b^{d})$ .

1.6.3 Example

For $b = 10$ and $d = 5$ the number is

$\sum_{i = 0}^{5} (5 + 1 - i) 1 0^{i} = 6 + 50 + 400 + 3000 + 20000 + 100000 = 123456$

All together, an iterative deepening search from depth $1$ all the way down to depth $d$ expands only about $11 %$ more nodes than a single breadth-first or depth-limited search to depth $d$ , when $b = 10$ . The higher the branching factor, the lower the overhead of repeatedly expanded states, but even when the branching factor is 2, iterative deepening search only takes about twice as long as a complete breadth-first search. This means that the time complexity of iterative deepening is still $O (b^{d})$ .

Space complexity

The space complexity of IDDFS is $O (d)$ , where $d$ is the depth of the goal.

Proof

Since IDDFS, at any point, is engaged in a depth-first search, it need only store a stack of nodes which represents the branch of the tree it is expanding. Since it finds a solution of optimal length, the maximum depth of this stack is $d$ , and hence the maximum amount of space is $O (d)$ . In general, iterative deepening is the preferred search method when there is a large search space and the depth of the solution is not known.

1.6.4 Python practice for Iterative Deepening

The Iterative Deepening Depth-First Search (also ID-DFS) algorithm is an algorithm used to find a node in a tree. This means that given a tree data structure, the algorithm will return the first node in this tree that matches the specified condition. Nodes are sometimes referred to as vertices (plural of vertex) - here, we’ll call them nodes. The edges have to be unweighted. This algorithm can also work with unweighted graphs if mechanism to keep track of already visited nodes is added.

Description of the Algorithm

The basic principle of the algorithm is to start with a start node, and then look at the first child of this node. It then looks at the first child of that node (grandchild of the start node) and so on, until a node has no more children (weve reached a leaf node). It then goes up one level, and looks at the next child. If there are no more children, it goes up one more level, and so on, until it find more children or reaches the start node. If hasnt found the goal node after returning from the last child of the start node, the goal node cannot be found, since by then all nodes have been traversed.

So far this has been describing Depth-First Search (DFS). Iterative deepening adds to this, that the algorithm not only returns one layer up the tree when the node has no more children to visit, but also when a previously specified maximum depth has been reached. Also, if we return to the start node, we increase the maximum depth and start the search all over, until weve visited all leaf nodes (bottom nodes) and increasing the maximum depth wont lead to us visiting more nodes.

Specifically, these are the steps:

For each child of the current node
If it is the target node, return
If the current maximum depth is reached, return
Set the current node to this node and go back to 1.
After having gone through all children, go to the next child of the parent (the next sibling)
After having gone through all children of the start node, increase the maximum depth and go back to 1.
If we have reached all leaf (bottom) nodes, the goal node doesnt exist.

Example of the Algorithm

Consider the following tree (see 1.18):

The steps the algorithm performs on this tree if given node 0 as a starting point, in order, are:

Visiting Node 0
Visiting Node 1
Current maximum depth reached, returning
Visiting Node 2
Current maximum depth reached, returning
Increasing depth to 2
Visiting Node 0
Visiting Node 1
Visiting Node 3
Current maximum depth reached, returning
Visiting Node 4
Current maximum depth reached, returning
Visiting Node 2
Visiting Node 5
Current maximum depth reached, returning
Visiting Node 6
Found the node were looking for, returning

Runtime of the Algorithm

If we double the maximum depth each time we need to go deeper, the runtime complexity of Iterative Deepening Depth-First Search (ID-DFS) is the same as regular Depth-First Search (DFS), since all previous depths added up will have the same runtime as the current depth $(1 ∕ 2 + 1 ∕ 4 + 1 ∕ 8 + < 1)$ . The runtime of regular Depth-First Search (DFS) is $O (| N |) (| N |$ = number of Nodes in the tree), since every node is traversed at most once. The number of nodes is equal to $b^{d}$ , where b is the branching factor and d is the depth, so the runtime can be rewritten as $O (b^{d})$ .

Space of the Algorithm

The space complexity of Iterative Deepening Depth-First Search (ID-DFS) is the same as regular Depth-First Search (DFS), which is, if we exclude the tree itself, O(d), with d being the depth, which is also the size of the call stack at maximum depth. If we include the tree, the space complexity is the same as the runtime complexity, as each node needs to be saved.

 
1# Python program to print DFS traversal from a given 
2# given graph 
3from collections import defaultdict 
4 
5# This class represents a directed graph using adjacency 
6# list representation 
7class Graph: 
8 
9    def __init__(self,vertices): 
10 
11        # No. of vertices 
12        self.V = vertices 
13 
14        # default dictionary to store graph 
15        self.graph = defaultdict(list) 
16 
17    # function to add an edge to graph 
18    def addEdge(self,u,v): 
19        self.graph[u].append(v) 
20 
21    # A function to perform a Depth-Limited search 
22    # from given source ’src’ 
23    def DLS(self,src,target,maxDepth): 
24 
25        if src == target : return True 
26 
27        # If reached the maximum depth, stop recursing. 
28        if maxDepth <= 0 : return False 
29 
30        # Recur for all the vertices adjacent to this vertex 
31        for i in self.graph[src]: 
32                if(self.DLS(i,target,maxDepth-1)): 
33                    return True 
34        return False 
35 
36    # IDDFS to search if target is reachable from v. 
37    # It uses recursive DLS() 
38    def IDDFS(self,src, target, maxDepth): 
39 
40        # Repeatedly depth-limit search till the 
41        # maximum depth 
42        for i in range(maxDepth): 
43            if (self.DLS(src, target, i)): 
44                return True 
45        return False 
46 
47# Create a graph given in the above diagram 
48g = Graph (7); 
49g.addEdge(0, 1) 
50g.addEdge(0, 2) 
51g.addEdge(1, 3) 
52g.addEdge(1, 4) 
53g.addEdge(2, 5) 
54g.addEdge(2, 6) 
55 
56target = 6; maxDepth = 3; src = 0 
57 
58if g.IDDFS(src, target, maxDepth) == True: 
59    print ("Target is reachable from source " + 
60        "within max depth") 
61else : 
62    print ("Target is NOT reachable from source " + 
63        "within max depth")
 
67

1.7 Informed/Heuristic breadth-first search

Also known as Best-first search. Best-first search is a class of search algorithms, which explore a graph by expanding the most promising node chosen according to a specified rule. Judea Pearl described the best-first search as estimating the promise of node n by a “heuristic evaluation function $f (n)$ which, in general, may depend on the description of n, the description of the goal, the information gathered by the search up to that point, and most importantly, on any extra knowledge about the problem domain.”

Some authors have used “best-first search” to refer specifically to a search with a heuristic that attempts to predict how close the end of a path is to a solution (or, goal), so that paths which are judged to be closer to a solution (or, goal) are extended first. This specific type of search is called greedy best-first search or pure heuristic search. Efficient selection of the current best candidate for extension is typically implemented using a priority queue.

The A* search algorithm is an example of a best-first search algorithm, as is B*. Best-first algorithms are often used for path finding in combinatorial search. Neither A* nor B* is a greedy best-first search, as they incorporate the distance from the start in addition to estimated distances to the goal. Using a greedy algorithm, expand the first successor of the parent. After a successor is generated:

If the successor’s heuristic is better than its parent, the successor is set at the front of the queue (with the parent reinserted directly behind it), and the loop restarts.
Else, the successor is inserted into the queue (in a location determined by its heuristic value). The procedure will evaluate the remaining successors (if any) of the parent.

In BFS and DFS, when we are at a node, we can consider any of the adjacent as next node. So both BFS and DFS blindly explore paths without considering any cost function. The idea of Best First Search is to use an evaluation function to decide which adjacent is most promising and then explore. Best First Search falls under the category of Heuristic Search or Informed Search. We use a priority queue to store costs of nodes. So the implementation is a variation of BFS, we just need to change Queue to PriorityQueue. Let us consider the below example.

Figure 1.19:Informed/Heuristic breadth-first search

1.7.1 Procedure

: We start from source “S” and search for goal “I” using given costs and Best First search.
: “pq” initially contains “S”. We remove “s” from and process unvisited neighbors of “S” to “pq”. “pq” now contains $A, C, B$ (C is put before B because C has lesser cost).
: We remove “A” from “pq” and process unvisited neighbors of “A” to “pq”. “pq” now contains $C, B, E, D$ .
: We remove “C” from “pq” and process unvisited neighbors of “C” to “pq”. “pq” now contains $B, H, E, D$ .
: We remove “B” from “pq” and process unvisited neighbors of “B” to “pq”. “pq” now contains $H, E, D, F, G$ .
: We remove “H” from “pq”. Since our goal “I” is a neighbor of “H”, we return.

Below is the implementation of the above idea:

 
1from queue import PriorityQueue 
2v = 14 
3graph = [[] for i in range(v)] 
4 
5# Function For Implementing Best First Search 
6# Gives output path having lowest cost 
7 
8 
9def best_first_search(source, target, n): 
10    visited = [False] * n 
11    visited = True 
12    pq = PriorityQueue() 
13    pq.put((0, source)) 
14    while pq.empty() == False: 
15        u = pq.get()[1] 
16        # Displaying the path having lowest cost 
17        print(u, end=" ") 
18        if u == target: 
19            break 
20 
21        for v, c in graph[u]: 
22            if visited[v] == False: 
23                visited[v] = True 
24                pq.put((c, v)) 
25    print() 
26 
27# Function for adding edges to graph 
28 
29 
30def addedge(x, y, cost): 
31    graph[x].append((y, cost)) 
32    graph[y].append((x, cost)) 
33 
34 
35# The nodes shown in above example(by alphabets) are 
36# implemented using integers addedge(x,y,cost); 
37addedge(0, 1, 3) 
38addedge(0, 2, 6) 
39addedge(0, 3, 5) 
40addedge(1, 4, 9) 
41addedge(1, 5, 8) 
42addedge(2, 6, 12) 
43addedge(2, 7, 14) 
44addedge(3, 8, 7) 
45addedge(8, 9, 5) 
46addedge(8, 10, 6) 
47addedge(9, 11, 1) 
48addedge(9, 12, 10) 
49addedge(9, 13, 2) 
50 
51source = 0 
52target = 9 
53best_first_search(source, target, v)
 
54

Output

0 1 3 2 8 9

1.8 A* Search

A* (“A-star”) is a graph traversal and path search algorithm, which is often used in many fields of computer science due to its completeness, optimality, and optimal efficiency. ¹³ One major practical drawback is its $O (b^{d})$ space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms which can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases. Peter Hart, Nils Nilsson and Bertram Raphael of Stanford Research Institute (now SRI International) first published the algorithm in 1968. ¹⁴ It can be seen as an extension of Dijkstra’s algorithm. A* achieves better performance by using heuristics to guide its search.

A* was created as part of the Shakey project, which had the aim of building a mobile robot that could plan its own actions. Nils Nilsson originally proposed using the Graph Traverser algorithm for Shakey’s path planning. Graph Traverser is guided by a heuristic function h(n), the estimated distance from node n to the goal node: it entirely ignores g(n), the distance from the start node to n. Bertram Raphael suggested using the sum, g(n) + h(n). Peter Hart invented the concepts we now call admissibility and consistency of heuristic functions. A* was originally designed for finding least-cost paths when the cost of a path is the sum of its costs, but it has been shown that A* can be used to find optimal paths for any problem satisfying the conditions of a cost algebra.

The original 1968 A* paper contained a theorem stating that no A*-like algorithm could expand fewer nodes than A* if the heuristic function is consistent and A*’s tie-breaking rule is suitably chosen. ¹⁵ A “correction” was published a few years later[8] claiming that consistency was not required, but this was shown to be false in Dechter and Pearl’s definitive study of A*’s optimality (now called optimal efficiency), which gave an example of A* with a heuristic that was admissible but not consistent expanding arbitrarily more nodes than an alternative A*-like algorithm.

1.8.1 Description

A* is an informed search algorithm, or a best-first search, meaning that it is formulated in terms of weighted graphs: starting from a specific starting node of a graph, it aims to find a path to the given goal node having the smallest cost (least distance travelled, shortest time, etc.). It does this by maintaining a tree of paths originating at the start node and extending those paths one edge at a time until its termination criterion is satisfied.

At each iteration of its main loop, A* needs to determine which of its paths to extend. It does so based on the cost of the path and an estimate of the cost required to extend the path all the way to the goal. Specifically, A* selects the path that minimizes

$f (n) = g (n) + h (n)$

where n is the next node on the path, $g (n)$ is the cost of the path from the start node to $n$ , and $h (n)$ is a heuristic function that estimates the cost of the cheapest path from n to the goal. A* terminates when the path it chooses to extend is a path from start to goal or if there are no paths eligible to be extended. The heuristic function is problem-specific. If the heuristic function is admissible, meaning that it never overestimates the actual cost to get to the goal, A* is guaranteed to return a least-cost path from start to goal.

Typical implementations of A* use a priority queue to perform the repeated selection of minimum (estimated) cost nodes to expand. This priority queue is known as the open set or fringe. At each step of the algorithm, the node with the lowest $f (x)$ value is removed from the queue, the f and g values of its neighbors are updated accordingly, and these neighbors are added to the queue. The algorithm continues until a removed node (thus the node with the lowest f value out of all fringe nodes) is a goal node. The f value of that goal is then also the cost of the shortest path, since h at the goal is zero in an admissible heuristic.

The algorithm described so far gives us only the length of the shortest path. To find the actual sequence of steps, the algorithm can be easily revised so that each node on the path keeps track of its predecessor. After this algorithm is run, the ending node will point to its predecessor, and so on, until some node’s predecessor is the start node.

As an example, when searching for the shortest route on a map, $h (x)$ might represent the straight-line distance to the goal, since that is physically the smallest possible distance between any two points. For a grid map from a video game, using the Manhattan distance or the octile distance becomes better depending on the set of movements available (4-way or 8-way).

If the heuristic $h$ satisfies the additional condition $h (x) \leq d (x, y) + h (y)$ for every edge $(x, y)$ of the graph (where $d$ denotes the length of that edge), then $h$ is called monotone, or consistent. With a consistent heuristic, A* is guaranteed to find an optimal path without processing any node more than once and A* is equivalent to running Dijkstra’s algorithm with the reduced cost $d^{'} (x, y) = d (x, y) + h (y) - h (x)$ .

1.8.2 Complexity

The time complexity of A* depends on the heuristic. In the worst case of an unbounded search space, the number of nodes expanded is exponential in the depth of the solution (the shortest path) $d : O (b d)$ , where $b$ is the branching factor (the average number of successors per state). This assumes that a goal state exists at all, and is reachable from the start state; if it is not, and the state space is infinite, the algorithm will not terminate.

The heuristic function has a major effect on the practical performance of A* search, since a good heuristic allows A* to prune away many of the bd nodes that an uninformed search would expand. Its quality can be expressed in terms of the effective branching factor b*, which can be determined empirically for a problem instance by measuring the number of nodes generated by expansion, N, and the depth of the solution, then solving

$N + 1 = 1 + b^{*} + {(b^{*})}^{2} + \dots + {(b^{*})}^{d}$

Good heuristics are those with low effective branching factor (the optimal being $b * = 1$ ). The time complexity is polynomial when the search space is a tree, there is a single goal state, and the heuristic function h meets the following condition:

$| h (x) - h^{*} (x) | = O (\log h^{*} (x))$

where $h *$ is the optimal heuristic, the exact cost to get from $x$ to the goal. In other words, the error of $h$ will not grow faster than the logarithm of the “perfect heuristic” $h *$ that returns the true distance from x to the goal. The space complexity of A* is roughly the same as that of all other graph search algorithms, as it keeps all generated nodes in memory. In practice, this turns out to be the biggest drawback of A* search, leading to the development of memory-bounded heuristic searches, such as Iterative deepening A*, memory bounded A*, and SMA*.

1.8.3 Python practice

This is a direct implementation of A* on a graph structure. The heuristic function is defined as 1 for all nodes for the sake of simplicity and brevity. The graph is represented with an adjacency list, where the keys represent graph nodes, and the values contain a list of edges with the the corresponding neighboring nodes. Here you’ll find the A* algorithm implemented in Python:

 
1from collections import deque 
2 
3class Graph: 
4    # example of adjacency list (or rather map) 
5    # adjacency_list = { 
6    # ’A’: [(’B’, 1), (’C’, 3), (’D’, 7)], 
7    # ’B’: [(’D’, 5)], 
8    # ’C’: [(’D’, 12)] 
9    # } 
10 
11    def __init__(self, adjacency_list): 
12        self.adjacency_list = adjacency_list 
13 
14    def get_neighbors(self, v): 
15        return self.adjacency_list[v] 
16 
17    # heuristic function with equal values for all nodes 
18    def h(self, n): 
19        H = { 
20            ’A’: 1, 
21            ’B’: 1, 
22            ’C’: 1, 
23            ’D’: 1 
24        } 
25 
26        return H[n] 
27 
28    def a_star_algorithm(self, start_node, stop_node): 
29        # open_list is a list of nodes which have been visited, but who’s neighbors 
30        # haven’t all been inspected, starts off with the start node 
31        # closed_list is a list of nodes which have been visited 
32        # and who’s neighbors have been inspected 
33        open_list = set([start_node]) 
34        closed_list = set([]) 
35 
36        # g contains current distances from start_node to all other nodes 
37        # the default value (if it’s not found in the map) is +infinity 
38        g = {} 
39 
40        g[start_node] = 0 
41 
42        # parents contains an adjacency map of all nodes 
43        parents = {} 
44        parents[start_node] = start_node 
45 
46        while len(open_list) > 0: 
47            n = None 
48 
49            # find a node with the lowest value of f() - evaluation function 
50            for v in open_list: 
51                if n == None or g[v] + self.h(v) < g[n] + self.h(n): 
52                    n = v; 
53 
54            if n == None: 
55                print(’Path does not exist!’) 
56                return None 
57 
58            # if the current node is the stop_node 
59            # then we begin reconstructin the path from it to the start_node 
60            if n == stop_node: 
61                reconst_path = [] 
62 
63                while parents[n] != n: 
64                    reconst_path.append(n) 
65                    n = parents[n] 
66 
67                reconst_path.append(start_node) 
68 
69                reconst_path.reverse() 
70 
71                print(’Path found: {}’.format(reconst_path)) 
72                return reconst_path 
73 
74            # for all neighbors of the current node do 
75            for (m, weight) in self.get_neighbors(n): 
76                # if the current node isn’t in both open_list and closed_list 
77                # add it to open_list and note n as it’s parent 
78                if m not in open_list and m not in closed_list: 
79                    open_list.add(m) 
80                    parents[m] = n 
81                    g[m] = g[n] + weight 
82 
83                # otherwise, check if it’s quicker to first visit n, then m 
84                # and if it is, update parent data and g data 
85                # and if the node was in the closed_list, move it to open_list 
86                else: 
87                    if g[m] > g[n] + weight: 
88                        g[m] = g[n] + weight 
89                        parents[m] = n 
90 
91                        if m in closed_list: 
92                            closed_list.remove(m) 
93                            open_list.add(m) 
94 
95            # remove n from the open_list, and add it to closed_list 
96            # because all of his neighbors were inspected 
97            open_list.remove(n) 
98            closed_list.add(n) 
99 
100        print(’Path does not exist!’) 
101        return None
 
102

Let’s look at an example with the following weighted graph:

We run the code as so:

 
1adjacency_list = { 
2    ’A’: [(’B’, 1), (’C’, 3), (’D’, 7)], 
3    ’B’: [(’D’, 5)], 
4    ’C’: [(’D’, 12)] 
5} 
6graph1 = Graph(adjacency_list) 
7graph1.a_star_algorithm(’A’, ’D’)
 
8

And the output would look like:

Path found: [’A’, ’B’, ’D’]
[’A’, ’B’, ’D’]

Thus, the optimal path from A to D, found using A*, is $A - > B - > D$ .

1.9 Branch and bound

Branch and bound (BB, B&B, or BnB) is an algorithm design paradigm for discrete and combinatorial optimization problems, as well as mathematical optimization. A branch-and-bound algorithm consists of a systematic enumeration of candidate solutions by means of state space search: the set of candidate solutions is thought of as forming a rooted tree with the full set at the root. The algorithm explores branches of this tree, which represent subsets of the solution set. Before enumerating the candidate solutions of a branch, the branch is checked against upper and lower estimated bounds on the optimal solution, and is discarded if it cannot produce a better solution than the best one found so far by the algorithm. The algorithm depends on efficient estimation of the lower and upper bounds of regions/branches of the search space. If no bounds are available, the algorithm degenerates to an exhaustive search.

The method was first proposed by Ailsa Land and Alison Doig whilst carrying out research at the London School of Economics sponsored by British Petroleum in 1960 for discrete programming, and has become the most commonly used tool for solving NP-hard optimization problems. The name “branch and bound” first occurred in the work of Little et al. on the traveling salesman problem.

1.9.1 Overview

The goal of a branch-and-bound algorithm is to find a value x that maximizes or minimizes the value of a real-valued function $f (x)$ , called an objective function, among some set S of admissible, or candidate solutions. The set $S$ is called the search space, or feasible region. The rest of this section assumes that minimization of $f (x)$ is desired; this assumption comes without loss of generality, since one can find the maximum value of $f (x)$ by finding the minimum of $g (x) = ? f (x)$ . A B&B algorithm operates according to two principles:

It recursively splits the search space into smaller spaces, then minimizing f(x) on these smaller spaces; the splitting is called branching.
Branching alone would amount to brute-force enumeration of candidate solutions and testing them all. To improve on the performance of brute-force search, a B&B algorithm keeps track of bounds on the minimum that it is trying to find, and uses these bounds to ”prune” the search space, eliminating candidate solutions that it can prove will not contain an optimal solution.

Turning these principles into a concrete algorithm for a specific optimization problem requires some kind of data structure that represents sets of candidate solutions. Such a representation is called an instance of the problem. Denote the set of candidate solutions of an instance I by SI. The instance representation has to come with three operations:

branch(I) produces two or more instances that each represent a subset of SI. (Typically, the subsets are disjoint to prevent the algorithm from visiting the same candidate solution twice, but this is not required. However, an optimal solution among SI must be contained in at least one of the subsets.)
bound(I) computes a lower bound on the value of any candidate solution in the space represented by I, that is, $b o u n d (I) \leq f (x)$ for all x in SI.
solution(I) determines whether I represents a single candidate solution. (Optionally, if it does not, the operation may choose to return some feasible solution from among SI.) If solution(I) returns a solution then f(solution(I)) provides an upper bound for the optimal objective value over the whole space of feasible solutions.

Using these operations, a B&B algorithm performs a top-down recursive search through the tree of instances formed by the branch operation. Upon visiting an instance I, it checks whether bound(I) is greater than an upper bound found so far; if so, I may be safely discarded from the search and the recursion stops. This pruning step is usually implemented by maintaining a global variable that records the minimum upper bound seen among all instances examined so far.

1.9.2 Generic version

The following is the skeleton of a generic branch and bound algorithm for minimizing an arbitrary objective function f . To obtain an actual algorithm from this, one requires a bounding function bound, that computes lower bounds of f on nodes of the search tree, as well as a problem-specific branching rule. As such, the generic algorithm presented here is a higher-order function.

Using a heuristic, find a solution xh to the optimization problem. Store its value, $B = f (x h)$ . (If no heuristic is available, set B to infinity.) B will denote the best solution found so far, and will be used as an upper bound on candidate solutions.
Initialize a queue to hold a partial solution with none of the variables of the problem assigned.
Loop until the queue is empty:
- Take a node N off the queue.
- If N represents a single candidate solution x and $f (x) < B$ , then x is the best solution so far. Record it and set $B \leftarrow f (x)$ .
- Else, branch on N to produce new nodes $N_{i}$ . For each of these:
- If $b o u n d (N_{i}) > B$ , do nothing; since the lower bound on this node is greater than the upper bound of the problem, it will never lead to the optimal solution, and can be discarded.
- Else, store $N_{i}$ on the queue.

Several different queue data structures can be used. This FIFO queue-based implementation yields a breadth-first search. A stack (LIFO queue) will yield a depth-first algorithm. A best-first branch and bound algorithm can be obtained by using a priority queue that sorts nodes on their lower bound. Examples of best-first search algorithms with this premise are Dijkstra’s algorithm and its descendant A* search. The depth-first variant is recommended when no good heuristic is available for producing an initial solution, because it quickly produces full solutions, and therefore upper bounds.

1.9.3 Knapsack problem

Branch and bound is an algorithm design paradigm which is generally used for solving combinatorial optimization problems. These problems are typically exponential in terms of time complexity and may require exploring all possible permutations in worst case. The Branch and Bound Algorithm technique solves these problems relatively quickly. Let us consider the 0/1 Knapsack problem to understand Branch and Bound. There are many algorithms by which the knapsack problem can be solved:

Greedy Algorithm for Fractional Knapsack
DP solution for 0/1 Knapsack
Backtracking Solution for 0/1 Knapsack.

0/1 Knapsack using Branch and Bound

Let us consider below 0/1 Knapsack problem to understand Branch and Bound. Given two integer arrays $v a l [0 . . n - 1]$ and $w t [0 . . n - 1]$ that represent values and weights associated with n items respectively. Find out the maximum value subset of $v a l []$ such that sum of the weights of this subset is smaller than or equal to Knapsack capacity $W$ .

Let us explore all approaches for this problem.

A Greedy approach is to pick the items in decreasing order of value per unit weight. The Greedy approach works only for fractional knapsack problem and may not produce correct result for $0 ∕ 1$ knapsack.
We can use Dynamic Programming (DP) for $0 ∕ 1$ Knapsack problem. In DP, we use a 2D table of size $n \times W$ . The DP Solution doesnt work if item weights are not integers.
Since DP solution does not always work, a solution is to use Brute Force. With n items, there are $2 n$ solutions to be generated, check each to see if they satisfy the constraint, save maximum solution that satisfies constraint. This solution can be expressed as tree.

Figure 1.22:0/1 Knapsack
We can use Backtracking to optimize the Brute Force solution. In the tree representation, we can do DFS of tree. If we reach a point where a solution no longer is feasible, there is no need to continue exploring. In the given example, backtracking would be much more effective if we had even more items or a smaller knapsack capacity.

Figure 1.23:0/1 Knapsack

The backtracking based solution works better than brute force by ignoring infeasible solutions. We can do better (than backtracking) if we know a bound on best possible solution subtree rooted with every node. If the best in subtree is worse than current best, we can simply ignore this node and its subtrees. So we compute bound (best solution) for every node and compare the bound with current best solution before exploring the node. Example bounds used in below diagram are, $A$ down can give $315, B down can $275, C down can $225, D down can $125 and $E$ down can $30. In the next article, we have discussed the process to get these bounds.

Branch and bound is very useful technique for searching a solution but in worst case, we need to fully calculate the entire tree. At best, we only need to fully calculate one path through the tree and prune the rest of it.

Implementation

How to find bound for every node for 0/1 Knapsack? The idea is to use the fact that the Greedy approach provides the best solution for Fractional Knapsack problem. To check if a particular node can give us a better solution or not, we compute the optimal solution (through the node) using Greedy approach. If the solution computed by Greedy approach itself is more than the best so far, then we cant get a better solution through the node.

Complete Algorithm:

Sort all items in decreasing order of ratio of value per unit weight so that an upper bound can be computed using Greedy Approach.
Initialize maximum profit, maxProfit = 0
Create an empty queue, Q.
Create a dummy node of decision tree and enqueue it to Q. Profit and weight of dummy node are 0.
Do following while Q is not empty.
Extract an item from Q. Let the extracted item be u.
Compute profit of next level node. If the profit is more than maxProfit, then update maxProfit.
Compute bound of next level node. If bound is more than maxProfit, then add next level node to Q.
Consider the case when next level node is not considered as part of solution and add a node to queue with level as next, but weight and profit without considering next level nodes.

Illustration

1Input:
2// First thing in every pair is weight of item
3// and second thing is value of item
4Item arr[] = {{2, 40}, {3.14, 50}, {1.98, 100},
5 {5, 95}, {3, 30}};
6Knapsack Capacity W = 10
7
8Output:
9The maximum possible profit = 235
10
11Below diagram shows illustration. Items are
12considered sorted by value/weight.

Figure 1.25:Implementation of 0/1 Knapsack using Branch and Bound

Following is C++ implementation of above idea.

1// C++ program to solve knapsack problem using
2// branch and bound
3#include <bits/stdc++.h>
4using namespace std;
5
6// Structure for Item which store weight and corresponding
7// value of Item
8struct Item
9{
10    float weight;
11    int value;
12};
13
14// Node structure to store information of decision
15// tree
16struct Node
17{
18    // level --> Level of node in decision tree (or index
19    //     in arr[]
20    // profit --> Profit of nodes on path from root to this
21    //   node (including this node)
22    // bound ---> Upper bound of maximum profit in subtree
23    //   of this node/
24    int level, profit, bound;
25    float weight;
26};
27
28// Comparison function to sort Item according to
29// val/weight ratio
30bool cmp(Item a, Item b)
31{
32    double r1 = (double)a.value / a.weight;
33    double r2 = (double)b.value / b.weight;
34    return r1 > r2;
35}
36
37// Returns bound of profit in subtree rooted with u.
38// This function mainly uses Greedy solution to find
39// an upper bound on maximum profit.
40int bound(Node u, int n, int W, Item arr[])
41{
42    // if weight overcomes the knapsack capacity, return
43    // 0 as expected bound
44    if (u.weight >= W)
45        return 0;
46
47    // initialize bound on profit by current profit
48    int profit_bound = u.profit;
49
50    // start including items from index 1 more to current
51    // item index
52    int j = u.level + 1;
53    int totweight = u.weight;
54
55    // checking index condition and knapsack capacity
56    // condition
57    while ((j < n) && (totweight + arr[j].weight <= W))
58    {
59        totweight += arr[j].weight;
60        profit_bound += arr[j].value;
61        j++;
62    }
63
64    // If k is not n, include last item partially for
65    // upper bound on profit
66    if (j < n)
67        profit_bound += (W - totweight) * arr[j].value /
68                                        arr[j].weight;
69
70    return profit_bound;
71}
72
73// Returns maximum profit we can get with capacity W
74int knapsack(int W, Item arr[], int n)
75{
76    // sorting Item on basis of value per unit
77    // weight.
78    sort(arr, arr + n, cmp);
79
80    // make a queue for traversing the node
81    queue<Node> Q;
82    Node u, v;
83
84    // dummy node at starting
85    u.level = -1;
86    u.profit = u.weight = 0;
87    Q.push(u);
88
89    // One by one extract an item from decision tree
90    // compute profit of all children of extracted item
91    // and keep saving maxProfit
92    int maxProfit = 0;
93    while (!Q.empty())
94    {
95        // Dequeue a node
96        u = Q.front();
97        Q.pop();
98
99        // If it is starting node, assign level 0
100        if (u.level == -1)
101            v.level = 0;
102
103        // If there is nothing on next level
104        if (u.level == n-1)
105            continue;
106
107        // Else if not last node, then increment level,
108        // and compute profit of children nodes.
109        v.level = u.level + 1;
110
111        // Taking current level’s item add current
112        // level’s weight and value to node u’s
113        // weight and value
114        v.weight = u.weight + arr[v.level].weight;
115        v.profit = u.profit + arr[v.level].value;
116
117        // If cumulated weight is less than W and
118        // profit is greater than previous profit,
119        // update maxprofit
120        if (v.weight <= W && v.profit > maxProfit)
121            maxProfit = v.profit;
122
123        // Get the upper bound on profit to decide
124        // whether to add v to Q or not.
125        v.bound = bound(v, n, W, arr);
126
127        // If bound value is greater than profit,
128        // then only push into queue for further
129        // consideration
130        if (v.bound > maxProfit)
131            Q.push(v);
132
133        // Do the same thing, but Without taking
134        // the item in knapsack
135        v.weight = u.weight;
136        v.profit = u.profit;
137        v.bound = bound(v, n, W, arr);
138        if (v.bound > maxProfit)
139            Q.push(v);
140    }
141
142    return maxProfit;
143}
144
145// driver program to test above function
146int main()
147{
148    int W = 10; // Weight of knapsack
149    Item arr[] = {{2, 40}, {3.14, 50}, {1.98, 100},
150                {5, 95}, {3, 30}};
151    int n = sizeof(arr) / sizeof(arr[0]);
152
153    cout << "Maximum possible profit = "
154        << knapsack(W, arr, n);
155
156    return 0;
157}

Output :

1Maximum possible profit = 235

1.9.4 8 puzzle problem

Given a 3 board with 8 tiles (every tile has one number from 1 to 8) and one empty space. The objective is to place the numbers on tiles to match the final configuration using the empty space. We can slide four adjacent (left, right, above, and below) tiles into the empty space.

Methods

DFS (Brute-Force) We can perform a depth-first search on state-space (Set of all configurations of a given problem i.e. all states that can be reached from the initial state) tree.

Figure 1.27:8 puzzle problem - Puzzle 1 (State Space Tree for 8 Puzzle)

In this solution, successive moves can take us away from the goal rather than bringing us closer. The search of state-space tree follows the leftmost path from the root regardless of the initial state. An answer node may never be found in this approach.
BFS (Brute-Force) We can perform a Breadth-first search on the state space tree. This always finds a goal state nearest to the root. But no matter what the initial state is, the algorithm attempts the same sequence of moves like DFS.
Branch and Bound The search for an answer node can often be speeded by using an “intelligent” ranking function, also called an approximate cost function to avoid searching in sub-trees that do not contain an answer node. It is similar to the backtracking technique but uses a BFS-like search.
There are basically three types of nodes involved in Branch and Bound
1. Live node is a node that has been generated but whose children have not yet been generated.
2. E-node is a live node whose children are currently being explored. In other words, an E-node is a node currently being expanded.
3. Dead node is a generated node that is not to be expanded or explored any further. All children of a dead node have already been expanded.
Cost function:
Each node $X$ in the search tree is associated with a cost. The cost function is useful for determining the next E-node. The next E-node is the one with the least cost. The cost function is defined as

1   C(X) = g(X) + h(X) where
2   g(X) = cost of reaching the current node
3          from the root
4   h(X) = cost of reaching an answer node from $X$.

The ideal Cost function for an 8-puzzle Algorithm :
We assume that moving one tile in any direction will have a 1 unit cost. Keeping that in mind, we define a cost function for the 8-puzzle algorithm as below:

1   c(x) = f(x) + h(x) where
2   f(x) is the length of the path from root to x
3        (the number of moves so far) and
4   h(x) is the number of non-blank tiles not in
5        their goal position (the number of mis-
6        -placed tiles). There are at least h(x)
7        moves to transform state x to a goal state

An algorithm is available for getting an approximation of $h (x)$ which is an unknown value. The below diagram shows the path followed by the algorithm to reach the final configuration from the given initial configuration of the 8-Puzzle. Note that only nodes having the least value of cost function are expanded.

 
1# Python3 program to print the path from root 
2# node to destination node for N*N-1 puzzle 
3# algorithm using Branch and Bound 
4# The solution assumes that instance of 
5# puzzle is solvable 
6 
7# Importing copy for deepcopy function 
8import copy 
9 
10# Importing the heap functions from python 
11# library for Priority Queue 
12from heapq import heappush, heappop 
13 
14# This variable can be changed to change 
15# the program from 8 puzzle(n=3) to 15 
16# puzzle(n=4) to 24 puzzle(n=5)... 
17n = 3 
18 
19# bottom, left, top, right 
20row = [ 1, 0, -1, 0 ] 
21col = [ 0, -1, 0, 1 ] 
22 
23# A class for Priority Queue 
24class priorityQueue: 
25 
26    # Constructor to initialize a 
27    # Priority Queue 
28    def __init__(self): 
29        self.heap = [] 
30 
31    # Inserts a new key ’k’ 
32    def push(self, k): 
33        heappush(self.heap, k) 
34 
35    # Method to remove minimum element 
36    # from Priority Queue 
37    def pop(self): 
38        return heappop(self.heap) 
39 
40    # Method to know if the Queue is empty 
41    def empty(self): 
42        if not self.heap: 
43            return True 
44        else: 
45            return False 
46 
47# Node structure 
48class node: 
49 
50    def __init__(self, parent, mat, empty_tile_pos, 
51                 cost, level): 
52 
53        # Stores the parent node of the 
54        # current node helps in tracing 
55        # path when the answer is found 
56        self.parent = parent 
57 
58        # Stores the matrix 
59        self.mat = mat 
60 
61        # Stores the position at which the 
62        # empty space tile exists in the matrix 
63        self.empty_tile_pos = empty_tile_pos 
64 
65        # Storesthe number of misplaced tiles 
66        self.cost = cost 
67 
68        # Stores the number of moves so far 
69        self.level = level 
70 
71    # This method is defined so that the 
72    # priority queue is formed based on 
73    # the cost variable of the objects 
74    def __lt__(self, nxt): 
75        return self.cost < nxt.cost 
76 
77# Function to calculate the number of 
78# misplaced tiles ie. number of non-blank 
79# tiles not in their goal position 
80def calculateCost(mat, final) -> int: 
81 
82    count = 0 
83    for i in range(n): 
84        for j in range(n): 
85            if ((mat[i][j]) and 
86                (mat[i][j] != final[i][j])): 
87                count += 1 
88 
89    return count 
90 
91def newNode(mat, empty_tile_pos, new_empty_tile_pos, 
92            level, parent, final) -> node: 
93 
94    # Copy data from parent matrix to current matrix 
95    new_mat = copy.deepcopy(mat) 
96 
97    # Move tile by 1 position 
98    x1 = empty_tile_pos[0] 
99    y1 = empty_tile_pos[1] 
100    x2 = new_empty_tile_pos[0] 
101    y2 = new_empty_tile_pos[1] 
102    new_mat[x1][y1], new_mat[x2][y2] = new_mat[x2][y2], new_mat[x1][y1] 
103 
104    # Set number of misplaced tiles 
105    cost = calculateCost(new_mat, final) 
106 
107    new_node = node(parent, new_mat, new_empty_tile_pos, 
108                    cost, level) 
109    return new_node 
110 
111# Function to print the N x N matrix 
112def printMatrix(mat): 
113 
114    for i in range(n): 
115        for j in range(n): 
116            print("%d " % (mat[i][j]), end = " ") 
117 
118        print() 
119 
120# Function to check if (x, y) is a valid 
121# matrix coordinate 
122def isSafe(x, y): 
123 
124    return x >= 0 and x < n and y >= 0 and y < n 
125 
126# Print path from root node to destination node 
127def printPath(root): 
128 
129    if root == None: 
130        return 
131 
132    printPath(root.parent) 
133    printMatrix(root.mat) 
134    print() 
135 
136# Function to solve N*N - 1 puzzle algorithm 
137# using Branch and Bound. empty_tile_pos is 
138# the blank tile position in the initial state. 
139def solve(initial, empty_tile_pos, final): 
140 
141    # Create a priority queue to store live 
142    # nodes of search tree 
143    pq = priorityQueue() 
144 
145    # Create the root node 
146    cost = calculateCost(initial, final) 
147    root = node(None, initial, 
148                empty_tile_pos, cost, 0) 
149 
150    # Add root to list of live nodes 
151    pq.push(root) 
152 
153    # Finds a live node with least cost, 
154    # add its children to list of live 
155    # nodes and finally deletes it from 
156    # the list. 
157    while not pq.empty(): 
158 
159        # Find a live node with least estimated 
160        # cost and delete it form the list of 
161        # live nodes 
162        minimum = pq.pop() 
163 
164        # If minimum is the answer node 
165        if minimum.cost == 0: 
166 
167            # Print the path from root to 
168            # destination; 
169            printPath(minimum) 
170            return 
171 
172        # Generate all possible children 
173        for i in range(n): 
174            new_tile_pos = [ 
175                minimum.empty_tile_pos[0] + row[i], 
176                minimum.empty_tile_pos[1] + col[i], ] 
177 
178            if isSafe(new_tile_pos[0], new_tile_pos[1]): 
179 
180                # Create a child node 
181                child = newNode(minimum.mat, 
182                                minimum.empty_tile_pos, 
183                                new_tile_pos, 
184                                minimum.level + 1, 
185                                minimum, final,) 
186 
187                # Add child to list of live nodes 
188                pq.push(child) 
189 
190# Driver Code 
191 
192# Initial configuration 
193# Value 0 is used for empty space 
194initial = [ [ 1, 2, 3 ], 
195            [ 5, 6, 0 ], 
196            [ 7, 8, 4 ] ] 
197 
198# Solvable Final configuration 
199# Value 0 is used for empty space 
200final = [ [ 1, 2, 3 ], 
201          [ 5, 8, 6 ], 
202          [ 0, 7, 4 ] ] 
203 
204# Blank tile coordinates in 
205# initial configuration 
206empty_tile_pos = [ 1, 2 ] 
207 
208# Function call to solve the puzzle 
209solve(initial, empty_tile_pos, final)
 
212

Output :

1.9.5 Job Assignment Problem

Let there be N workers and N jobs. Any worker can be assigned to perform any job, incurring some cost that may vary depending on the work-job assignment. It is required to perform all jobs by assigning exactly one worker to each job and exactly one job to each agent in such a way that the total cost of the assignment is minimized.

Let us explore all approaches for this problem.

Solution 1: Brute Force

We generate $n!$ possible job assignments and for each such assignment, we compute its total cost and return the less expensive assignment. Since the solution is a permutation of the n jobs, its complexity is $O (n!)$ .

Solution 2: Hungarian Algorithm

The optimal assignment can be found using the Hungarian algorithm. The Hungarian algorithm has worst case run-time complexity of $O (n^{3})$ .

Solution 3: DFS/BFS on state space tree

A state space tree is a N-ary tree with property that any path from root to leaf node holds one of many solutions to given problem. We can perform depth-first search on state space tree and but successive moves can take us away from the goal rather than bringing closer. The search of state space tree follows leftmost path from the root regardless of initial state. An answer node may never be found in this approach. We can also perform a Breadth-first search on state space tree. But no matter what the initial state is, the algorithm attempts the same sequence of moves like DFS.

Solution 4: Finding Optimal Solution using Branch and Bound The selection rule for the next node in BFS and DFS is blind. i.e. the selection rule does not give any preference to a node that has a very good chance of getting the search to an answer node quickly. The search for an optimal solution can often be speeded by using an intelligent ranking function, also called an approximate cost function to avoid searching in sub-trees that do not contain an optimal solution. It is similar to BFS-like search but with one major optimization. Instead of following FIFO order, we choose a live node with least cost. We may not get optimal solution by following node with least promising cost, but it will provide very good chance of getting the search to an answer node quickly.

There are two approaches to calculate the cost function:

For each worker, we choose job with minimum cost from list of unassigned jobs (take minimum entry from each row).
For each job, we choose a worker with lowest cost for that job from list of unassigned workers (take minimum entry from each column).

We will choose the first approach for the solution. Lets take below example and try to calculate promising cost when Job 2 is assigned to worker A.

Since Job 2 is assigned to worker A (marked in green), cost becomes 2 and Job 2 and worker A becomes unavailable (marked in red).

Now we assign job 3 to worker B as it has minimum cost from list of unassigned jobs. Cost becomes 2 + 3 = 5 and Job 3 and worker B also becomes unavailable.

Finally, job 1 gets assigned to worker C as it has minimum cost among unassigned jobs and job 4 gets assigned to worker C as it is only Job left. Total cost becomes 2 + 3 + 5 + 4 = 14.

Below diagram shows complete search space diagram showing optimal solution path in green.

Below is its C++ implementation.

1// Program to solve Job Assignment problem
2// using Branch and Bound
3#include <bits/stdc++.h>
4using namespace std;
5#define N 4
6
7// state space tree node
8struct Node
9{
10    // stores parent node of current node
11    // helps in tracing path when answer is found
12    Node* parent;
13
14    // contains cost for ancestors nodes
15    // including current node
16    int pathCost;
17
18    // contains least promising cost
19    int cost;
20
21    // contain worker number
22    int workerID;
23
24    // contains Job ID
25    int jobID;
26
27    // Boolean array assigned will contains
28    // info about available jobs
29    bool assigned[N];
30};
31
32// Function to allocate a new search tree node
33// Here Person x is assigned to job y
34Node* newNode(int x, int y, bool assigned[],
35            Node* parent)
36{
37    Node* node = new Node;
38
39    for (int j = 0; j < N; j++)
40        node->assigned[j] = assigned[j];
41    node->assigned[y] = true;
42
43    node->parent = parent;
44    node->workerID = x;
45    node->jobID = y;
46
47    return node;
48}
49
50// Function to calculate the least promising cost
51// of node after worker x is assigned to job y.
52int calculateCost(int costMatrix[N][N], int x,
53                int y, bool assigned[])
54{
55    int cost = 0;
56
57    // to store unavailable jobs
58    bool available[N] = {true};
59
60    // start from next worker
61    for (int i = x + 1; i < N; i++)
62    {
63        int min = INT_MAX, minIndex = -1;
64
65        // do for each job
66        for (int j = 0; j < N; j++)
67        {
68            // if job is unassigned
69            if (!assigned[j] && available[j] &&
70                costMatrix[i][j] < min)
71            {
72                // store job number
73                minIndex = j;
74
75                // store cost
76                min = costMatrix[i][j];
77            }
78        }
79
80        // add cost of next worker
81        cost += min;
82
83        // job becomes unavailable
84        available[minIndex] = false;
85    }
86
87    return cost;
88}
89
90// Comparison object to be used to order the heap
91struct comp
92{
93    bool operator()(const Node* lhs,
94                const Node* rhs) const
95    {
96        return lhs->cost > rhs->cost;
97    }
98};
99
100// print Assignments
101void printAssignments(Node *min)
102{
103    if(min->parent==NULL)
104        return;
105
106    printAssignments(min->parent);
107    cout << "Assign Worker " << char(min->workerID + ’A’)
108        << " to Job " << min->jobID << endl;
109
110}
111
112// Finds minimum cost using Branch and Bound.
113int findMinCost(int costMatrix[N][N])
114{
115    // Create a priority queue to store live nodes of
116    // search tree;
117    priority_queue<Node*, std::vector<Node*>, comp> pq;
118
119    // initialize heap to dummy node with cost 0
120    bool assigned[N] = {false};
121    Node* root = newNode(-1, -1, assigned, NULL);
122    root->pathCost = root->cost = 0;
123    root->workerID = -1;
124
125    // Add dummy node to list of live nodes;
126    pq.push(root);
127
128    // Finds a live node with least cost,
129    // add its childrens to list of live nodes and
130    // finally deletes it from the list.
131    while (!pq.empty())
132    {
133    // Find a live node with least estimated cost
134    Node* min = pq.top();
135
136    // The found node is deleted from the list of
137    // live nodes
138    pq.pop();
139
140    // i stores next worker
141    int i = min->workerID + 1;
142
143    // if all workers are assigned a job
144    if (i == N)
145    {
146        printAssignments(min);
147        return min->cost;
148    }
149
150    // do for each job
151    for (int j = 0; j < N; j++)
152    {
153        // If unassigned
154        if (!min->assigned[j])
155        {
156        // create a new tree node
157        Node* child = newNode(i, j, min->assigned, min);
158
159        // cost for ancestors nodes including current node
160        child->pathCost = min->pathCost + costMatrix[i][j];
161
162        // calculate its lower bound
163        child->cost = child->pathCost +
164            calculateCost(costMatrix, i, j, child->assigned);
165
166        // Add child to list of live nodes;
167        pq.push(child);
168        }
169    }
170    }
171}
172
173// Driver code
174int main()
175{
176    // x-coordinate represents a Worker
177    // y-coordinate represents a Job
178    int costMatrix[N][N] =
179    {
180        {9, 2, 7, 8},
181        {6, 4, 3, 7},
182        {5, 8, 1, 8},
183        {7, 6, 9, 4}
184    };
185
186
187    /* int costMatrix[N][N] =
188    {
189        {82, 83, 69, 92},
190        {77, 37, 49, 92},
191        {11, 69, 5, 86},
192        { 8, 9, 98, 23}
193    };
194    */
195
196    /* int costMatrix[N][N] =
197    {
198        {2500, 4000, 3500},
199        {4000, 6000, 3500},
200        {2000, 4000, 2500}
201    };*/
202
203    /*int costMatrix[N][N] =
204    {
205        {90, 75, 75, 80},
206        {30, 85, 55, 65},
207        {125, 95, 90, 105},
208        {45, 110, 95, 115}
209    };*/
210
211
212    cout << "\nOptimal Cost is "
213        << findMinCost(costMatrix);
214
215    return 0;
216}

Output :

     Assign Worker A to Job 1
     Assign Worker B to Job 0
     Assign Worker C to Job 2
     Assign Worker D to Job 3
     
     Optimal Cost is 13

1.9.6 Traveling Salesman Problem

Given a set of cities and distance between every pair of cities, the problem is to find the shortest possible tour that visits every city exactly once and returns to the starting point.

For example, consider the graph shown in figure on right side. A TSP tour in the graph is 0-1-3-2-0. The cost of the tour is 10+25+30+15 which is 80.

In Branch and Bound method, for current node in tree, we compute a bound on best possible solution that we can get if we down this node. If the bound on best possible solution itself is worse than current best (best computed so far), then we ignore the subtree rooted with the node.

Note that the cost through a node includes two costs.

Cost of reaching the node from the root (When we reach a node, we have this cost computed)
Cost of reaching an answer from current node to a leaf (We compute a bound on this cost to decide whether to ignore subtree with this node or not).

: In cases of a maximization problem, an upper bound tells us the maximum possible solution if we follow the given node. For example in 0/1 knapsack we used Greedy approach to find an upper bound.
: In cases of a minimization problem, a lower bound tells us the minimum possible solution if we follow the given node. For example, in Job Assignment Problem, we get a lower bound by assigning least cost job to a worker. In branch and bound, the challenging part is figuring out a way to compute a bound on best possible solution. Below is an idea used to compute bounds for Traveling salesman problem.

Cost of any tour can be written as below.

1Cost of a tour T = (1/2) * ? (Sum of cost of two edges
2                              adjacent to u and in the
3                              tour T)
4                   where u \in V
5For every vertex u, if we consider two edges through it in T,
6and sum their costs.  The overall sum for all vertices would
7be twice of cost of tour T (We have considered every edge
8twice.)
9
10(Sum of two tour edges adjacent to u) >= (sum of minimum weight
11                two edges adjacent to u)
12
13Cost of any tour >=  1/2) * ? (Sum of cost of two minimum
14                weight edges adjacent to u)
15                   where u \in V

For example, consider the above shown graph. Below are minimum cost two edges adjacent to every node.

1Node     Least cost edges   Total cost
20     (0, 1), (0, 2)            25
31     (0, 1), (1, 3)             35
42     (0, 2), (2, 3)            45
53     (0, 3), (1, 3)            45
6
7Thus a lower bound on the cost of any tour =
8         1/2(25 + 35 + 45 + 45)
9       = 75

Now we have an idea about computation of lower bound. Let us see how to how to apply it state space search tree. We start enumerating all possible nodes (preferably in lexicographical order)

The first one is Root Node: Without loss of generality, we assume we start at vertex 0 for which the lower bound has been calculated above.

Dealing with Level 2: The next level enumerates all possible vertices we can go to (keeping in mind that in any path a vertex has to occur only once) which are, 1, 2, 3 n (Note that the graph is complete). Consider we are calculating for vertex 1, Since we moved from 0 to 1, our tour has now included the edge 0-1. This allows us to make necessary changes in the lower bound of the root.

1Lower Bound for vertex 1 =
2   Old lower bound - ((minimum edge cost of 0 +
3                    minimum edge cost of 1) / 2)
4                  + (edge cost 0-1)

How does it work? To include edge 0-1, we add the edge cost of 0-1, and subtract an edge weight such that the lower bound remains as tight as possible which would be the sum of the minimum edges of 0 and 1 divided by 2. Clearly, the edge subtracted cant be smaller than this.

Dealing with other levels: As we move on to the next level, we again enumerate all possible vertices. For the above case going further after 1, we check out for 2, 3, 4, n. Consider lower bound for 2 as we moved from 1 to 1, we include the edge 1-2 to the tour and alter the new lower bound for this node.

1Lower bound(2) =
2     Old lower bound - ((second minimum edge cost of 1 +
3                         minimum edge cost of 2)/2)
4                     + edge cost 1-2)

Note: The only change in the formula is that this time we have included second minimum edge cost for 1, because the minimum edge cost has already been subtracted in previous level.

 
1# Python3 program to solve 
2# Traveling Salesman Problem using 
3# Branch and Bound. 
4import math 
5maxsize = float(’inf’) 
6 
7# Function to copy temporary solution 
8# to the final solution 
9def copyToFinal(curr_path): 
10final_path[:N + 1] = curr_path[:] 
11final_path[N] = curr_path[0] 
12 
13# Function to find the minimum edge cost 
14# having an end at the vertex i 
15def firstMin(adj, i): 
16min = maxsize 
17for k in range(N): 
18if adj[i][k] < min and i != k: 
19min = adj[i][k] 
20 
21return min 
22 
23# function to find the second minimum edge 
24# cost having an end at the vertex i 
25def secondMin(adj, i): 
26first, second = maxsize, maxsize 
27for j in range(N): 
28if i == j: 
29continue 
30if adj[i][j] <= first: 
31second = first 
32first = adj[i][j] 
33 
34elif(adj[i][j] <= second and 
35adj[i][j] != first): 
36second = adj[i][j] 
37 
38return second 
39 
40# function that takes as arguments: 
41# curr_bound -> lower bound of the root node 
42# curr_weight-> stores the weight of the path so far 
43# level-> current level while moving 
44# in the search space tree 
45# curr_path[] -> where the solution is being stored 
46# which would later be copied to final_path[] 
47def TSPRec(adj, curr_bound, curr_weight, 
48level, curr_path, visited): 
49global final_res 
50 
51# base case is when we have reached level N 
52# which means we have covered all the nodes once 
53if level == N: 
54 
55# check if there is an edge from 
56# last vertex in path back to the first vertex 
57if adj[curr_path[level - 1]][curr_path[0]] != 0: 
58 
59# curr_res has the total weight 
60# of the solution we got 
61curr_res = curr_weight + adj[curr_path[level - 1]]\ 
62[curr_path[0]] 
63if curr_res < final_res: 
64copyToFinal(curr_path) 
65final_res = curr_res 
66return 
67 
68# for any other level iterate for all vertices 
69# to build the search space tree recursively 
70for i in range(N): 
71 
72# Consider next vertex if it is not same 
73# (diagonal entry in adjacency matrix and 
74# not visited already) 
75if (adj[curr_path[level-1]][i] != 0 and 
76visited[i] == False): 
77temp = curr_bound 
78curr_weight += adj[curr_path[level - 1]][i] 
79 
80# different computation of curr_bound 
81# for level 2 from the other levels 
82if level == 1: 
83curr_bound -= ((firstMin(adj, curr_path[level - 1]) + 
84firstMin(adj, i)) / 2) 
85else: 
86curr_bound -= ((secondMin(adj, curr_path[level - 1]) + 
87firstMin(adj, i)) / 2) 
88 
89# curr_bound + curr_weight is the actual lower bound 
90# for the node that we have arrived on. 
91# If current lower bound < final_res, 
92# we need to explore the node further 
93if curr_bound + curr_weight < final_res: 
94curr_path[level] = i 
95visited[i] = True 
96 
97# call TSPRec for the next level 
98TSPRec(adj, curr_bound, curr_weight, 
99level + 1, curr_path, visited) 
100 
101# Else we have to prune the node by resetting 
102# all changes to curr_weight and curr_bound 
103curr_weight -= adj[curr_path[level - 1]][i] 
104curr_bound = temp 
105 
106# Also reset the visited array 
107visited = [False] * len(visited) 
108for j in range(level): 
109if curr_path[j] != -1: 
110visited[curr_path[j]] = True 
111 
112# This function sets up final_path 
113def TSP(adj): 
114 
115# Calculate initial lower bound for the root node 
116# using the formula 1/2 * (sum of first min + 
117# second min) for all edges. Also initialize the 
118# curr_path and visited array 
119curr_bound = 0 
120curr_path = [-1] * (N + 1) 
121visited = [False] * N 
122 
123# Compute initial bound 
124for i in range(N): 
125curr_bound += (firstMin(adj, i) + 
126secondMin(adj, i)) 
127 
128# Rounding off the lower bound to an integer 
129curr_bound = math.ceil(curr_bound / 2) 
130 
131# We start at vertex 1 so the first vertex 
132# in curr_path[] is 0 
133visited[0] = True 
134curr_path[0] = 0 
135 
136# Call to TSPRec for curr_weight 
137# equal to 0 and level 1 
138TSPRec(adj, curr_bound, 0, 1, curr_path, visited) 
139 
140# Driver code 
141 
142# Adjacency matrix for the given graph 
143adj = [[0, 10, 15, 20], 
144[10, 0, 35, 25], 
145[15, 35, 0, 30], 
146[20, 25, 30, 0]] 
147N = 4 
148 
149# final_path[] stores the final solution 
150# i.e. the // path of the salesman. 
151final_path = [None] * (N + 1) 
152 
153# visited[] keeps track of the already 
154# visited nodes in a particular path 
155visited = [False] * N 
156 
157# Stores the final minimum weight 
158# of shortest tour. 
159final_res = maxsize 
160 
161TSP(adj) 
162 
163print("Minimum cost :", final_res) 
164print("Path Taken : ", end = ’ ’) 
165for i in range(N + 1): 
166print(final_path[i], end = ’ ’)
 
169

Output :

Minimum cost : 80
Path Taken : 0 1 3 2 0

: Time Complexity: The worst case complexity of Branch and Bound remains same as that of the Brute Force clearly because in worst case, we may never get a chance to prune a node. Whereas, in practice it performs very well depending on the different instance of the TSP. The complexity also depends on the choice of the bounding function as they are the ones deciding how many nodes to be pruned.

Notes

¹ Maloof, Mark. ”Artificial Intelligence: An Introduction, p. 37”. georgetown.edu. Archived (PDF) from the original on 25 August 2018.

² Optimism of early AI: * Herbert Simon quote: Simon 1965, p. 96 quoted in Crevier 1993, p. 109. Marvin Minsky quote: Minsky 1967, p. 2 quoted in Crevier 1993, p. 109.

³ Haenlein, Michael; Kaplan, Andreas (2019). “A Brief History of Artificial Intelligence: On the Past, Present, and Future of Artificial Intelligence”. California Management Review. 61 (4): 514.

⁴ Biggs, N.; Lloyd, E.; Wilson, R. (1986), Graph Theory, 1736-1936, Oxford University Press

⁵ Tutte, W.T. (2001), Graph Theory, Cambridge University Press, p. 30, ISBN 978-0-521-79489-3, retrieved 2016-03-14

⁶ Gardner, Martin (1992), Fractal Music, Hypercards, and moreMathematical Recreations from Scientific American, W. H. Freeman and Company, p. 203

⁷ Even, Shimon (2011), Graph Algorithms (2nd ed.), Cambridge University Press, pp. 4648, ISBN 978-0-521-73653-4.

⁸ Sedgewick, Robert (2002), Algorithms in C++: Graph Algorithms (3rd ed.), Pearson Education, ISBN 978-0-201-36118-6.

⁹ Cormen, Thomas H., Charles E. Leiserson, and Ronald L. Rivest. p.606

¹⁰ Mehlhorn, Kurt; Sanders, Peter (2008). Algorithms and Data Structures: The Basic Toolbox (PDF). Springer.

¹¹ KORF, Richard E. (1985). ”Depth-first iterative deepening”. https://cse.sc.edu/~mgv/csce580f09/gradPres/korfIDAStar1985.pdf

¹² Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, ISBN 0-13-790395-2

¹³ Russell, Stuart J. (2018). Artificial intelligence a modern approach. Norvig, Peter (4th ed.). Boston: Pearson. ISBN 978-0134610993. OCLC 1021874142

¹⁴ Hart, P. E.; Nilsson, N. J.; Raphael, B. (1968). ”A Formal Basis for the Heuristic Determination of Minimum Cost Paths”. IEEE Transactions on Systems Science and Cybernetics. 4 (2): 100107. doi:10.1109/TSSC.1968.300136

¹⁵ Hart, P. E.; Nilsson, N. J.; Raphael, B. (1968). ”A Formal Basis for the Heuristic Determination of Minimum Cost Paths”. IEEE Transactions on Systems Science and Cybernetics. 4 (2): 100107. doi:10.1109/TSSC.1968.300136

Chapter 2
Learning systems

2.1 Definitions

The oft quoted and widely accepted formal definition of machine learning as stated by field pioneer Tom M. Mitchell is:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

The following is my less formal way to describe machine learning. Machine learning is a subfield of computer science, but is often also referred to as predictive analytics, or predictive modeling. Its goal and usage is to build new and/or leverage existing algorithms to learn from data, in order to build generalizable models that give accurate predictions, or to find patterns, particularly with new and unseen similar data.

2.1.1 What is the learning?

Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. ¹⁶ The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of learning in certain plants.¹⁷ Some learning is immediate, induced by a single event (e.g. being burned by a hot stove), but much skill and knowledge accumulate from repeated experiences. The changes induced by learning often last a lifetime, and it is hard to distinguish learned material that seems to be ”lost” from that which cannot be retrieved.¹⁸

Human learning starts at birth (it might even start before) and continues until death as a consequence of ongoing interactions between people and their environment. The nature and processes involved in learning are studied in many fields, including educational psychology, neuropsychology, experimental psychology, and pedagogy. Research in such fields has led to the identification of various sorts of learning. For example, learning may occur as a result of habituation, or classical conditioning, operant conditioning or as a result of more complex activities such as play, seen only in relatively intelligent animals. Learning may occur consciously or without conscious awareness. Learning that an aversive event can’t be avoided nor escaped may result in a condition called learned helplessness. There is evidence for human behavioral learning prenatally, in which habituation has been observed as early as 32 weeks into gestation, indicating that the central nervous system is sufficiently developed and primed for learning and memory to occur very early on in development.¹⁹

Play has been approached by several theorists as a form of learning. Children experiment with the world, learn the rules, and learn to interact through play. Lev Vygotsky agrees that play is pivotal for children’s development, since they make meaning of their environment through playing educational games. For Vygotsky, however, play is the first form of learning language and communication and the stage where a child begins to understand rules and symbols. This has led to a view that learning in organisms is always related to semiosis.²⁰ ²¹

Types

Non-associative learning: Non-associative learning refers to ”a relatively permanent change in the strength of response to a single stimulus due to repeated exposure to that stimulus.” This definition exempt the changes caused by sensory adaptation, fatigue, or injury. Non-associative learning can be divided into habituation and sensitization.
Active learning:Active learning occurs when a person takes control of his/her learning experience. Since understanding information is the key aspect of learning, it is important for learners to recognize what they understand and what they do not. By doing so, they can monitor their own mastery of subjects. Active learning encourages learners to have an internal dialogue in which they verbalize understandings. This and other meta-cognitive strategies can be taught to a child over time. Studies within metacognition have proven the value in active learning, claiming that the learning is usually at a stronger level as a result. In addition, learners have more incentive to learn when they have control over not only how they learn but also what they learn. Active learning is a key characteristic of student-centered learning. Conversely, passive learning and direct instruction are characteristics of teacher-centered learning (or traditional education).
Associative learning: Associative learning is the process by which a person or animal learns an association between two stimuli or events. In classical conditioning a previously neutral stimulus is repeatedly paired with a reflex-eliciting stimulus until eventually the neutral stimulus elicits a response on its own. In operant conditioning, a behavior that is reinforced or punished in the presence of a stimulus becomes more or less likely to occur in the presence of that stimulus.

2.1.2 What is a system?

A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described by its boundaries, structure and purpose and expressed in its functioning. Systems are the subjects of study of systems theory. The term ”system” comes from the Latin word $s y s t ē m a$ , in turn from Greek $s y s t ē m a$ : “whole concept made of several parts or members, system”, literary “composition”.

Concepts

Environment and boundaries: Systems theory views the world as a complex system of interconnected parts. One scopes a system by defining its boundary; this means choosing which entities are inside the system and which are outsidepart of the environment. One can make simplified representations (models) of the system in order to understand it and to predict or impact its future behavior. These models may define the structure and behavior of the system.
Natural and human-made systems: There are natural and human-made (designed) systems. Natural systems may not have an apparent objective but their behavior can be interpreted as purposeful by an observer. Human-made systems are made with variable purposes that are achieved by some action performed by or with the system. The parts of a system must be related; they must be ”designed to work as a coherent entity” otherwise they would be two or more distinct systems.
Theoretical framework: Most systems are open systems, exchanging matter and energy with its surroundings; like a car, a coffeemaker, or Earth. A closed system exchanges energy, but not matter, with its environment; like a computer or the project Biosphere 2. An isolated system exchanges neither matter nor energy with its environment. A theoretical example of such system is the Universe.
Process and transformation process: An open system can also be viewed as a bounded transformation process, that is, a black box that is a process or collection of processes that transforms inputs into outputs. Inputs are consumed; outputs are produced. The concept of input and output here is very broad. For example, an output of a passenger ship is the movement of people from departure to destination.
System model: A system comprises multiple views. Man-made systems may have such views as concept, analysis, design, implementation, deployment, structure, behavior, input data, and output data views. A system model is required to describe and represent all these views.
Systems architecture: A systems architecture, using one single integrated model for the description of multiple views, is a kind of system model.

2.2 Goals & Applications

Machine learning is a buzzword for today’s technology, and it is growing very rapidly day by day. We are using machine learning in our daily life even without knowing it such as Google Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:

2.2.1 Image Recognition

Image recognition is one of the most common applications of machine learning. It is used to identify objects, persons, places, digital images, etc. The popular use case of image recognition and face detection is, Automatic friend tagging suggestion. Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this is machine learning’s face detection and recognition algorithm. It is based on the Facebook project named ”Deep Face,” which is responsible for face recognition and person identification in the picture.

2.2.2 Speech Recognition

While using Google, we get an option of ”Search by voice,” it comes under speech recognition, and it’s a popular application of machine learning. Speech recognition is a process of converting voice instructions into text, and it is also known as ”Speech to text”, or ”Computer speech recognition.” At present, machine learning algorithms are widely used by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the voice instructions.

2.2.3 Traffic prediction

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest route and predicts the traffic conditions. It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of two ways:

Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from the user and sends back to its database to improve the performance.

2.2.4 Product recommendations

Machine learning is widely used by various e-commerce and entertainment companies such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product on Amazon, then we started getting an advertisement for the same product while internet surfing on the same browser and this is because of machine learning. Google understands the user interest using various machine learning algorithms and suggests the product as per customer interest. As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and this is also done with the help of machine learning.

2.2.5 Self-driving cars

One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-driving car. It is using unsupervised learning method to train the car models to detect people and objects while driving.

2.2.6 Email Spam and Malware Filtering

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the technology behind this is Machine learning. Below are some spam filters used by Gmail:

Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Na Bayes classifier are used for email spam filtering and malware detection.

2.2.7 Virtual Personal Assistant

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the information using our voice instruction. These assistants can help us in various ways just by our voice instructions such as Play music, call someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part. These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML algorithms and act accordingly.

2.2.8 Online Fraud Detection

Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever we perform some online transaction, there may be various ways that a fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values become the input for the next round. For each genuine transaction, there is a specific pattern which gets change for the fraud transaction hence, it detects it and makes our online transactions more secure.

2.2.9 Stock Market trading

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and downs in shares, so for this machine learning’s long short term memory neural network is used for the prediction of stock market trends.

2.2.10 Medical Diagnosis

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in finding brain tumors and other brain-related diseases easily.

2.2.11 Automatic Language Translation

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as for this also machine learning helps us by converting the text into our known languages. Google’s GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning that translates the text into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm, which is used with image recognition and translates the text from one language to another language.

2.3 Characteristics

There are a lot of things to consider while building a great machine learning system. But often it happens that we as data scientists only worry about certain parts of the project. Most of the time that happens to be modelling, but in reality, the success or failure of a Machine Learning project depends on a lot of other factors. It is essential to understand what happens before training a model and after training the model and deploying it in production.

2.3.1 Problem Definition

There are two types of problems; simple and complex. Simple problems are either small or does not require much effort, but complex problems are those problems which needs systematic approach. The approach which is followed in ML is that each problem is factors for Task, Performance, Experience (TPE). In every ML problem the data is essential for performing task and also required for experience from database.

How to define a problem for Machine learning? That depends on a lot of factors. Amongst all the elements that we consider, the first one should be to understand how it will benefit the business. That is the holy grail of any data science project. If the project does not help business, the applicability of the same will be in question. Once after getting an idea and determining business compatibility, success metric needs to be defined. What does success look like? Is it 90% accuracy or 95% accuracy or 99% accuracy? The question is acceptability. 70% of prediction accuracy may be acceptable since an average human won’t surpass that accuracy ever. At the same time also beware of lofty targets; it is the time to be logical and sensible about how every 1 percent accuracy change could affect success. For example, for a click prediction problem/Fraud application, a 1% accuracy increase will boost the business bottom line compared to a 1% accuracy increase in review sentiment prediction.

2.3.2 Data

There are several questions which needs to be answered at the time of data acquisition and data creation machine learning model. The most important question that needs to be answered is perhaps; does the model need to work with real-time? This type of questions determine type of the tools or techniques that are necessary. Suppose that the data is real-time then strategies for streaming such data sets so as to minimize latency and accuracy will be at most priority. The most interesting use-case can be click prediction systems. These systems need to evaluate user clicks instantaneously and determine the appropriate place for placing the ad. This must happen so dynamically with very precise training algorithms. If the algorithm chosen for such tasks are imprecise that might leads to loss of revenue.

2.3.3 Evaluation

How will we evaluate the performance of our Model? The gold standard here is the train-test-validation split. Frequently making a train-validation-test set, by sampling, we forgot about an implicit assumption Data is rarely ever IID(independently and identically distributed). In simple terms, our assumption that each data point is independent of each other and comes from the same distribution is faulty at best if not downright incorrect. For an internet company, a data point from 2007 is very different from a data point that comes in 2019. They dont come from the same distribution because of a lot of factors- internet speed being the foremost. If you have a cat vs. dog prediction problem, you are pretty much good with Random sampling. But, in most of the machine learning models, the task is to predict the future. You can think about splitting your data using the time variable rather than sampling randomly from the data. For example: for the click prediction problem you can have all your past data till last month as training data and data for last month as validation. The next thing you will need to think about is the baseline model. Let us say we use RMSE as an evaluation metric for our time series models. We evaluated the model on the test set, and the RMSE came out to be 4.8. Is that a good RMSE? How do we know? We need a baseline RMSE. This could come from a currently employed model for the same task. Or by using some simple model. For Time series model, a baseline to defeat is last day prediction. i.e., predict the number on the previous day. For NLP classification models, I usually set the baseline to be the evaluation metric(Accuracy, F1, log loss) of Logistic regression models on Countvectorizer(Bag of words). You should also think about how you will be breaking evaluation in multiple groups so that your model doesn’t induce unnecessary biases.

Last year, Amazon was in the news for a secret AI recruiting tool that showed bias against women. To save our Machine Learning model from such inconsistencies, we need to evaluate our model on different groups. Maybe our model is not so accurate for women as it is for men because there is far less number of women in training data. Or maybe a model predicting if a product is going to be bought or not given a view works pretty well for a specific product category and not for other product categories. Keeping such things in mind beforehand and thinking precisely about what could go wrong with a particular evaluation approach is something that could definitely help us in designing a good ML system.

2.3.4 Features

Good Features are the backbone of any machine learning model. And often the part where you would spend the most time. I have seen that this is the part which you can tune for maximum model performance. Good feature creation often needs domain knowledge, creativity, and lots of time. On top of that, the feature creation exercise might change for different models. For example, feature creation is very different for Neural networks vs. XGboost. Understanding various methods for Feature creation is a pretty big topic in itself.

2.3.5 Modeling

Now comes the part we mostly tend to care about. And why not? It is the piece that we end up delivering at the end of the project. And this is the part for which we have spent all those hours on data acquisition and cleaning, feature creation and whatnot. So what do we need to think while creating a model? The first question that you may need to ask ourselves is that if your model needs to be interpretable? There are quite a lot of use cases where the business may want an interpretable model. One such use case is when we want to do attribution modeling. Here we define the effect of various advertising streams(TV, radio, newspaper, etc.) on the revenue. In such cases, understanding the response from each advertisement stream becomes essential. If we need to maximize the accuracy or any other metric, we will still want to go for black-box models like NeuralNets or XGBoost. Apart from model selection, there should be other things on your mind too:

Model Architecture: How many layers for NNs, or how many trees for GBT or how you need to create feature interactions for Linear models.
How to tune hyperparameters?: You should try to automate this part. There are a lot of tools in the market for this. I tend to use hyperopt.

2.3.6 Experimentation

Now you have created your model. It performs better than the baseline/your current model. How should we go forward? We have two choices:

Go into an endless loop in improving our model further.
Test our model in production settings, get more insights about what could go wrong and then continue improving our model with continuous integration.

One thing I would also like to stress is continuous integration. If your current model performs better than the existing model, why not deploy it in production rather than running after incremental gains? To test the validity of your assumption that your model being better than the existing model, you can set up an A/B test. Some users(Test group)see your model while some users(Control) see the predictions from the previous model. You should always aim to minimize the time to first online experiment for your model. This not only generated value but also lets you understand the shortcomings of your model with realtime feedback which you can then work on.

2.4 Design

There is definition for learning systems in section 2.1. Now we shall see as how to design a learning system based on that definition. Let’s take a few examples to understand these factors.

2.4.1 Handwriting recognition learning problem

For handwriting recognition learning problem, T (Task), P (Performance Measure), and E (Training Experience) i.e., TPE would be,

: Task T: To recognize and classify handwritten words within the given images.
: Performance measure P: Total percent of words being correctly classified by the program.
: Training experience E: A set of handwritten words with given classifications/labels.

2.4.2 Spam Mail detection learning problem

For a system being designed to detect spam emails, TPE would be,

: Task T: To recognize and classify mails into ’spam’ or ’not spam’.
: Performance measure P: Total percent of mails being correctly classified as ’spam’ (or ’not spam’ ) by the program.
: Training experience E: A set of mails with given labels (’spam’ / ’not spam’).

2.4.3 Checkers learning problem

For a checkers learning problem, TPE would be,

: Task T: To play checkers
: Performance measure P: Total percent of the game won in the tournament.
: Training experience E: A set of games played against itself

If we are able to find the factors T, P, and E of a learning problem, we will be able to decide the following three key components:

The exact type of knowledge to be learned (Choosing the Target Function)
A representation for this target knowledge (Choosing a representation for the Target Function)
A learning mechanism (Choosing an approximation algorithm for the Target Function)

2.4.4 Choosing and representing Target Function

Choosing

Let’s take the example of a checkers-playing program that can generate the legal moves (M) from any board state (B). The program needs only to learn how to choose the best move from among these legal moves. Let’s assume a function NextMove such that:

NextMove: B - $>$ M

Here, B denotes the set of board states and M denotes the set of legal moves given a board state. NextMove is our target function.

Representation

We need to choose a representation that the learning algorithm will use to describe the function NextMove. The function NextMove will be calculated as a linear combination of the following board features:

: xl: the number of black pieces on the board
: x2: the number of red pieces on the board
: x3: the number of black kings on the board
: x4: the number of red kings on the board
: x5: the number of black pieces threatened by red (i.e., which can be captured on red’s next turn)
: x6: the number of red pieces threatened by black

The target function for the above features is as follows.

NextMove = u0 + u1x1 + u2x2 + u3x3 + u4x4 + u5x5 + u6x6

Here u0, u1 up to u6 are the coefficients that will be chosen(learned) by the learning algorithm.

Choose a Function Approximation Algorithm

To learn the target function NextMove, we require a set of training examples, each describing a specific board state b and the training value (Correct Move) y for b. The training algorithm learns/approximate the coefficients u0, u1 up to u6 with the help of these training examples by estimating and adjusting these weights.

2.5 Develop

2.5.1 Training data

In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. ²² Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. The data used to build the final model usually or may comes from multiple datasets. In particular, three datasets are commonly used in different stages of the creation of the model.

The model is initially fit on a training dataset, which is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks) of the model. ²³ ²⁴ The model (e.g. a neural net or a naive Bayes classifier) is trained on the training dataset using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent. In practice, the training dataset often consists of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), where the answer key is commonly denoted as the target (or label). The current model is run with the training dataset and produces a result, which is then compared with the target, for each input vector in the training dataset. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted. The model fitting can include both variable selection and parameter estimation.

Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset. ²⁵ The validation dataset provides an unbiased evaluation of a model fit on the training dataset while tuning the model’s hyperparameters (e.g. the number of hidden unitslayers and layer widthsin a neural network). ²⁶ Validation datasets can be used for regularization by early stopping (stopping training when the error on the validation dataset increases, as this is a sign of overfitting to the training dataset). ²⁷ This simple procedure is complicated in practice by the fact that the validation dataset’s error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when overfitting has truly begun.

Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset. ²⁸ If the data in the test dataset has never been used in training (for example in cross-validation), the test dataset is also called a holdout dataset. The term ”validation set” is sometimes used instead of ”test set” in some literature (e.g., if the original dataset was partitioned into only two subsets, the test set might be referred to as the validation set).

A training dataset is a dataset of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier. For classification tasks, a supervised learning algorithm looks at the training dataset to determine, or learn, the optimal combinations of variables that will generate a good predictive model. The goal is to produce a trained (fitted) model that generalizes well to new, unknown data. The fitted model is evaluated using new examples from the held-out datasets (validation and test datasets) to estimate the models accuracy in classifying new data. To reduce the risk of issues such as overfitting, the examples in the validation and test datasets should not be used to train the model. Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify and exploit apparent relationships in the training data that do not hold in general.

Python practice

The foremost step in training data is train-test split. The Python package or library sklearn provides a method train_test_split provides a way to perform this step. This method split arrays or matrices into random train and test subsets.

 
1>>> import numpy as np 
2>>> from sklearn.model_selection import train_test_split 
3>>> X, y = np.arange(10).reshape((5, 2)), range(5) 
4>>> X 
5array([[0, 1], 
6       [2, 3], 
7       [4, 5], 
8       [6, 7], 
9       [8, 9]]) 
10>>> list(y) 
11[0, 1, 2, 3, 4]
 
12

The above code snippet is related to simulation of data sets. The Python package numpy is a best library for this task. It is possible to simulate a bivariate array with the help of the method arange with other method reshape. The function range(5) just hands over a number series starting from zero to 4.

 
1>>> train_test_split(y, shuffle=False) 
2[[0, 1, 2], [3, 4]]
                                                    
                                                    
 
3

The function train_test_split() function splits the data into two different sets. The set with maximum rows can be considered for training and the other for testing. It is also possible to split the data using proportions with the help of argument test_size.

 
1>>> X_train, X_test, y_train, y_test = train_test_split( 
2...     X, y, test_size=0.33, random_state=42) 
3... 
4>>> X_train 
5array([[4, 5], 
6       [0, 1], 
7       [6, 7]]) 
8>>> y_train 
9[2, 0, 3] 
10>>> X_test 
11array([[2, 3], 
12       [8, 9]]) 
13>>> y_test 
14[1, 4]
 
15

Rest of the concepts such as training, models/functions, algorithms, parameters will be handled in forthcoming chapters.

2.5.2 Validation dataset

A validation dataset is a dataset of examples used to tune the hyperparameters (i.e. the architecture) of a classifier. It is sometimes also called the development set or the ”dev set”. An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer. It, as well as the testing set (as mentioned above), should follow the same probability distribution as the training dataset.

In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a validation dataset in addition to the training and test datasets. For example, if the most suitable classifier for the problem is sought, the training dataset is used to train the different candidate classifiers, the validation dataset is used to compare their performances and decide which one to take and, finally, the test dataset is used to obtain the performance characteristics such as accuracy, sensitivity, specificity, F-measure, and so on. The validation dataset functions as a hybrid: it is training data used for testing, but neither as part of the low-level training nor as part of the final testing. The basic aim of using a validation dataset is model selection.

Python practice

Testing set is very critical for validating models. That is why it is referred to validation set. We validate models using accuracy metrics. Few of the metrics are sensitivity, specificity, precision, F1 score and accuracy as a measure in itself.

Example of confusion matrix usage to evaluate the quality of the output of a classifier on the iris data set. The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.

The figures show the confusion matrix with and without normalization by class support size (number of elements in each class). This kind of normalization can be interesting in case of class imbalance to have a more visual interpretation of which class is being misclassified. Following code snippet describes as how to load required packages and load required data set.

 
1 
2import numpy as np 
3import matplotlib.pyplot as plt 
4 
5from sklearn import svm, datasets 
6from sklearn.model_selection import train_test_split 
7from sklearn.metrics import plot_confusion_matrix 
8 
9# import some data to play with 
10iris = datasets.load_iris() 
11X = iris.data 
12y = iris.target 
13class_names = iris.target_names
 
14

Now its time to train the data and make the classifier.

 
1# Split the data into a training set and a test set 
2X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 
3 
4# Run classifier, using a model that is too regularized (C too low) to see 
5# the impact on the results 
6classifier = svm.SVC(kernel=’linear’, C=0.01).fit(X_train, y_train)
 
7

We can print or plot confusion matrix with the following snippet.

 
1np.set_printoptions(precision=2) 
2 
3# Plot non-normalized confusion matrix 
4titles_options = [("Confusion matrix, without normalization", None), ("Normalized confusion matrix", ’true’)] 
5for title, normalize in titles_options: 
6    disp = plot_confusion_matrix(classifier, X_test, y_test, display_labels=class_names, cmap=plt.cm.Blues, normalize=normalize) 
7    disp.ax_.set_title(title) 
8 
9    print(title) 
10    print(disp.confusion_matrix) 
11 
12plt.show()
 
13

The object disp has the details of confusion matrix which can be printed with the help of the statement print(disp.confusion_matrix). The resultant plot appears as below:

Figure 2.1:Confusion matrix for iris data set using SVM.

2.5.3 Cross-validation

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. ²⁹ It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). ³⁰ The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem). ³¹

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance.

A dataset can be repeatedly split into a training dataset and a validation dataset: this is known as cross-validation. These repeated partitions can be done in various ways, such as dividing into 2 equal datasets and using them as training/validation, and then validation/training, or repeatedly selecting a random subset as a validation dataset. To validate the model performance, sometimes an additional test dataset that was held out from cross-validation is used. In summary, cross-validation combines (averages) measures of fitness in prediction to derive a more accurate estimate of model prediction performance. ³²

Types

Two types of cross-validation can be distinguished: exhaustive and non-exhaustive cross-validation.

Exhaustive cross-validation: Exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set.
1. Leave-p-out cross-validation: Leave-p-out cross-validation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set. LpO cross-validation require training and validating the model $C_{p}^{n}$ times, where n is the number of observations in the original sample, and where $C_{p}^{n}$ is the binomial coefficient. For p ¿ 1 and for even moderately large n, LpO CV can become computationally infeasible. For example, with n = 100 and p = 30, $C_{30}^{100} \approx 3 \times 1 0^{25}$ . A variant of LpO cross-validation with p=2 known as leave-pair-out cross-validation has been recommended as a nearly unbiased method for estimating the area under ROC curve of binary classifiers.
2. Leave-one-out cross-validation: Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1.The process looks similar to jackknife; however, with cross-validation one computes a statistic on the left-out sample(s), while with jackknifing one computes a statistic from the kept samples only. LOO cross-validation requires less computation time than LpO cross-validation because there are only $C_{1}^{n} = n$ passes rather than $C_{p}^{n}$ . However, n passes may still require quite a large computation time, in which case other approaches such as k-fold cross validation may be more appropriate.
Non-exhaustive cross-validation: Non-exhaustive cross validation methods do not compute all ways of splitting the original sample. Those methods are approximations of leave-p-out cross-validation.
1. k-fold cross-validation: In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k ? 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter. For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and validate on d1, followed by training on d1 and validating on d0. When k = n (the number of observations), k-fold cross-validation is equivalent to leave-one-out cross-validation. In stratified k-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In repeated cross-validation the data is randomly split into k partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.
2. Holdout method: In the holdout method, we randomly assign data points to two sets d0 and d1, usually called the training set and the test set, respectively. The size of each of the sets is arbitrary although typically the test set is smaller than the training set. We then train (build a model) on d0 and test (evaluate its performance) on d1. In typical cross-validation, results of multiple runs of model-testing are averaged together; in contrast, the holdout method, in isolation, involves a single run. It should be used with caution because without such averaging of multiple runs, one may achieve highly misleading results. One’s indicator of predictive accuracy (F*) will tend to be unstable since it will not be smoothed out by multiple iterations (see below). Similarly, indicators of the specific role played by various predictor variables (e.g., values of regression coefficients) will tend to be unstable. While the holdout method can be framed as ”the simplest kind of cross-validation”, many sources instead classify holdout as a type of simple validation, rather than a simple or degenerate form of cross-validation.
3. Repeated random sub-sampling validation: This method, also known as Monte Carlo cross-validation, creates multiple random splits of the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (i.e., the number of partitions). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap. This method also exhibits Monte Carlo variation, meaning that the results will vary if the analysis is repeated with different random splits. As the number of random splits approaches infinity, the result of repeated random sub-sampling validation tends towards that of leave-p-out cross-validation. In a stratified variant of this approach, the random samples are generated in such a way that the mean response value (i.e. the dependent variable in the regression) is equal in the training and testing sets. This is particularly useful if the responses are dichotomous with an unbalanced representation of the two response values in the data.

Python practice

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word ”experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally. Here is a flowchart of typical cross validation work-flow in model training. The best parameters can be determined by grid search techniques.

Figure 2.2:Cross validation using grid search work-flow.

In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function. Lets load the iris data set to fit a linear support vector machine on it:

 
1>>> import numpy as np 
2>>> from sklearn.model_selection import train_test_split 
3>>> from sklearn import datasets 
4>>> from sklearn import svm 
5 
6>>> X, y = datasets.load_iris(return_X_y=True) 
7>>> X.shape, y.shape 
8((150, 4), (150,))
 
9

We can now quickly sample a training set while holding out $40 %$ of the data for testing (evaluating) our classifier:

 
1>>> X_train, X_test, y_train, y_test = train_test_split( 
2...     X, y, test_size=0.4, random_state=0) 
3 
4>>> X_train.shape, y_train.shape 
5((90, 4), (90,)) 
6>>> X_test.shape, y_test.shape 
7((60, 4), (60,)) 
8 
9>>> clf = svm.SVC(kernel=’linear’, C=1).fit(X_train, y_train) 
10>>> clf.score(X_test, y_test) 
110.96...
 
12

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the data set can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV , the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

A model is trained using of the folds as training data;
The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

Computing cross-validated metrics

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the data set. The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris data set by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

 
1>>> from sklearn.model_selection import cross_val_score 
2>>> clf = svm.SVC(kernel=’linear’, C=1, random_state=42) 
3>>> scores = cross_val_score(clf, X, y, cv=5) 
4>>> scores 
5array([0.96..., 1. , 0.96..., 0.96..., 1. ])
 
6

The mean score and the standard deviation are hence given by:

 
1>>> print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std())) 
20.98 accuracy with a standard deviation of 0.02
 
3

There are number of options which can be used while defining model evaluation rules. Visit https://scikit-learn.org/stable/modules/crossvalidation.html for details.

2.5.4 Concept representation

Concepts in Machine Learning can be thought of as a boolean-valued function defined over a large set of training data. Taking a very simple example, one possible target concept may be to find the day when somebody enjoys his/her favorite sport. We have some attributes/features of the day like, Sky, Air Temperature, Humidity, Wind, Water, Forecast and based on this we have a target Concept named EnjoySport. We have the following training example available


Example	Sky	AirTemp	Humidity	Wind	Water	Forecast	EnjoySport

1	Sunny	Warm	Normal	Strong	Warm	Same	Yes
2	Sunny	Warm	High	Strong	Warm	Same	Yes
3	Rainy	Cold	High	Strong	Warm	Change	No
4	Sunny	Warm	High	Strong	Cool	Charge	Yes

Table 2.1:

Lets Design the problem formally with TPE(Task, Performance, Experience):

: Problem: Leaning the day when someone enjoys the sport.
: Task T: Learn to predict the value of EnjoySport for an arbitrary day, based on the values of the attributes of the day.
: Performance measure P: Total percent of days (EnjoySport) correctly predicted.
: Training experience E: A set of days with given labels (EnjoySport: Yes/No)

Let us take a very simple hypothesis representation which consists of a conjunction of constraints in the instance attributes. We get a hypothesis $h_{i}$ with the help of example i for our training set as below:

$h i (x) : = < x 1, x 2, x 3, x 4, x 5, x 6 >$

where x1, x2, x3, x4, x5 and x6 are the values of Sky, AirTemp, Humidity, Wind, Water and Forecast. Hence h1 will look like(the first row of the table above):

$h 1 (x = 1) : < S u n n y, W a r m, N o r m a l, S t r o n g, W a r m, S a m e >$ Note: x=1 represents a positive hypothesis / Positive example

We want to find the most suitable hypothesis which can represent the concept. For example, Say someone called Ramesh enjoys his favorite sport only on cold days with high humidity (This seems independent of the values of the other attributes present in the training examples).

$h (x = 1) = < ?, C o l d, H i g h, ?, ?, ? >$

Here ? indicates that any value of the attribute is acceptable. The most generic hypothesis will be $< ?, ?, ?, ?, ?, ? >$ where every day is a positive example and the most specific hypothesis will be $< ?, ?, ?, ?, ?, ? >$ where no day is a positive example. We will discuss the two most popular approaches to find a suitable hypothesis, they are:

Find-S Algorithm
List-Then-Eliminate Algorithm

Find-S Algorithm

Following are the steps for the Find-S algorithm:

Initialize h to the most specific hypothesis in H
For each positive training example,
1. For each attribute, constraint ai in h
  1. If the constraints ai is satisfied by x
  2. Then do nothing
  3. Else replace ai in h by the next more general constraint that is satisfied by x
Output hypothesis h

The LIST-THEN-ELIMINATE Algorithm

Following are the steps for the LIST-THE-ELIMINATE algorithm:

$V e r s i o n S p a c e < -$ a list containing every hypothesis in $H$

For each training example, $< x, c (x) >$ .

Remove from VersionSpace any hypothesis h for which h(x) != c(x). Output the list of hypotheses in VersionSpace.

2.5.5 Function approximation

Function approximation is an instance of supervised learning, the primary topic studied in machine learning, artificial neural networks, pattern recognition, and statistical curve fitting. In principle, any of the methods studied in these fields can be used in reinforcement learning. A function approximation problem asks us to select a function among a well-defined class that closely matches (”approximates”) a target function in a task-specific way. The need for function approximations arises in many branches of applied mathematics, and computer science in particular.

Two major classes of function approximation problems: First, for known target functions approximation theory is the branch of numerical analysis that investigates how certain known functions (for example, special functions) can be approximated by a specific class of functions (for example, polynomials or rational functions) that often have desirable properties (inexpensive computation, continuity, integral and limit values, etc.). Second, the target function, call it g, may be unknown; instead of an explicit formula, only a set of points of the form (x, g(x)) is provided. Depending on the structure of the domain and codomain of g, several techniques for approximating g may be applicable. For example, if g is an operation on the real numbers, techniques of interpolation, extrapolation, regression analysis, and curve fitting can be used. If the codomain (range or target set) of g is a finite set, one is dealing with a classification problem instead.

Function approximation theories

Approximation theory is concerned with how functions can best be approximated with simpler functions, and with quantitatively characterizing the errors introduced thereby. Note that what is meant by best and simpler will depend on the application. A closely related topic is the approximation of functions by generalized Fourier series, that is, approximations based upon summation of a series of terms based upon orthogonal polynomials. One problem of particular interest is that of approximating a function in a computer mathematical library, using operations that can be performed on the computer or calculator (e.g. addition and multiplication), such that the result is as close to the actual function as possible. This is typically done with polynomial or rational (ratio of polynomials) approximations.

Optimal polynomials: Once the domain (typically an interval) and degree of the polynomial are chosen, the polynomial itself is chosen in such a way as to minimize the worst-case error. That is, the goal is to minimize the maximum value of $∣ P (x) - f (x) ∣$ , where P(x) is the approximating polynomial, f(x) is the actual function, and x varies over the chosen interval. For well-behaved functions, there exists an Nth-degree polynomial that will lead to an error curve that oscillates back and forth between $+ 𝜀$ and $- 𝜀$ a total of N+2 times, giving a worst-case error of $𝜀$ . It is seen that there exists an Nth-degree polynomial can interpolate N+1 points in a curve. Such a polynomial is always optimal. It is possible to make contrived functions f(x) for which no such polynomial exists, but these occur rarely in practice.

Figure 2.4:Function approximation: Error P(x) - f(x) for level polynomial (red), and for purported better polynomial (blue)

For example, the graphs shown to the right show the error in approximating log(x) and exp(x) for N = 4. The red curves, for the optimal polynomial, are level, that is, they oscillate between $+ 𝜀$ and $- 𝜀$ exactly. Note that, in each case, the number of extrema is N+2, that is, 6. Two of the extrema are at the end points of the interval, at the left and right edges of the graphs.

To prove this is true in general, suppose P is a polynomial of degree N having the property described, that is, it gives rise to an error function that has N + 2 extrema, of alternating signs and equal magnitudes. The red graph to the right shows what this error function might look like for N = 4. Suppose Q(x) (whose error function is shown in blue to the right) is another N-degree polynomial that is a better approximation to f than P. In particular, Q is closer to f than P for each value xi where an extreme of P?f occurs, so

$| Q (x_{i}) - f (x_{i}) | < | P (x_{i}) - f (x_{i}) | . | Q (x_{i}) - f (x_{i}) | < | P (x_{i}) - f (x_{i}) |$

When a maximum of P-f occurs at xi, then

$Q (x_{i}) - f (x_{i}) \leq | Q (x_{i}) - f (x_{i}) | < | P (x_{i}) - f (x_{i}) | = P (x_{i}) - f (x_{i}),$

And when a minimum of P-f occurs at xi, then

$f (x_{i}) - Q (x_{i}) \leq | Q (x_{i}) - f (x_{i}) | < | P (x_{i}) - f (x_{i}) | = f (x_{i}) - P (x_{i})$

So, as can be seen in the graph, $[P (x) - f (x)] - [Q (x) - f (x)]$ must alternate in sign for the N + 2 values of xi. But [P(x) - f(x)] - [Q(x) - f(x)] reduces to P(x) ? Q(x) which is a polynomial of degree N. This function changes sign at least N+1 times so, by the Intermediate value theorem, it has N+1 zeroes, which is impossible for a polynomial of degree N.

The following code snippet shows as how to fit a polynomial using numpy library of Python. Fit a polynomial $p (x) = p [0] * x * * d e g + . . . + p [d e g]$ of degree deg to points $(x, y)$ . Returns a vector of coefficients p that minimises the squared error in the order deg, deg-1, ..., 0. The Polynomial.fit class method is recommended for new code as it is more stable numerically. See the documentation of the method for more information.

The solution minimizes the squared error

E = \sum_{j = 0}^{k} ∥ p (x_{j}) - y_{j} ∥^{2}

(2.1)

in the equations:

[0] ∗∗n ∗ p[0] + ... + x[0] ∗ p[n − 1] + p[n] = y[0] x[1] ∗∗n ∗ p[0] + ... + x[1] ∗ p[n − 1] + p[n] = y[1] ... x[k] ∗∗n ∗ p[0] + ... + x[k] ∗ p[n − 1] + p[n] = y[k]

The coefficient matrix of the coefficients p is a Vandermonde matrix. polyfit issues a RankWarning when the least-squares fit is badly conditioned. This implies that the best fit is not well-defined due to numerical error. The results may be improved by lowering the polynomial degree or by replacing x by x - x.mean(). The rcond parameter can also be set to a value smaller than its default, but the resulting fit may be spurious: including contributions from the small singular values can add numerical noise to the result.

Note that fitting polynomial coefficients is inherently badly conditioned when the degree of the polynomial is large or the interval of sample points is badly centered. The quality of the fit should always be checked in these cases. When polynomial fits are not satisfactory, splines may be a good alternative.

 
1import warnings 
2x = np.array([0.0, 1.0, 2.0, 3.0,  4.0,  5.0]) 
3y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0]) 
4z = np.polyfit(x, y, 3) 
5z 
6array([ 0.08703704, -0.81349206,  1.69312169, -0.03968254]) # may vary
      
7

It is convenient to use poly1d objects for dealing with polynomials:

 
1p = np.poly1d(z) 
2p(0.5) 
30.6143849206349179 # may vary 
4p(3.5) 
5-0.34732142857143039 # may vary 
6p(10) 
722.579365079365115 # may vary
      
8

High-order polynomials may oscillate wildly:

 
1with warnings.catch_warnings(): 
2    warnings.simplefilter(’ignore’, np.RankWarning) 
3    p30 = np.poly1d(np.polyfit(x, y, 30)) 
4 
5p30(4) 
6-0.80000000000000204 # may vary 
7p30(5) 
8-0.99999999999999445 # may vary 
9p30(4.5) 
10-0.10547061179440398 # may vary
      
11

Illustration:

 
1import matplotlib.pyplot as plt 
2xp = np.linspace(-2, 6, 100) 
3_ = plt.plot(x, y, ’.’, xp, p(xp), ’-’, xp, p30(xp), ’--’) 
4plt.ylim(-2,2) 
5(-2, 2) 
6plt.show()
      
7

The plot looks like Figure 2.5:

Figure 2.5:Function approximation for polynomials using numpy.

Chebyshev approximation:

One can obtain polynomials very close to the optimal one by expanding the given function in terms of Chebyshev polynomials and then cutting off the expansion at the desired degree. This is similar to the Fourier analysis of the function, using the Chebyshev polynomials instead of the usual trigonometric functions. If one calculates the coefficients in the Chebyshev expansion for a function:

$f (x) \sim \sum_{i = 0}^{\infty} c_{i} T_{i} (x)$

and then cuts off the series after the $T_{N}$ term, one gets an Nth-degree polynomial approximating f(x). The reason this polynomial is nearly optimal is that, for functions with rapidly converging power series, if the series is cut off after some term, the total error arising from the cutoff is close to the first term after the cutoff. That is, the first term after the cutoff dominates all later terms. The same is true if the expansion is in terms of bucking polynomials. If a Chebyshev expansion is cut off after $T_{N}$ , the error will take a form close to a multiple of $T_{N + 1}$ . The Chebyshev polynomials have the property that they are level - they oscillate between +1 and -1 in the interval [-1, 1]. $T$

In the graphs above 2.4, note that the blue error function is sometimes better than (inside of) the red function, but sometimes worse, meaning that it is not quite the optimal polynomial. The discrepancy is less serious for the exp function, which has an extremely rapidly converging power series, than for the log function. Chebyshev approximation is the basis for ClenshawCurtis quadrature, a numerical integration technique.

Remez’s algorithm: The Remez algorithm (sometimes spelled Remes) is used to produce an optimal polynomial P(x) approximating a given function f(x) over a given interval. It is an iterative algorithm that converges to a polynomial that has an error function with N+2 level extrema. By the theorem above, that polynomial is optimal. Remez’s algorithm uses the fact that one can construct an Nth-degree polynomial that leads to level and alternating error values, given N+2 test points. Given N+2 test points $x_{1}, x_{2}, . . . x_{N + 2}$ (where $x_{1}$ and $x_{N + 2}$ are presumably the end points of the interval of approximation), these equations need to be solved:

P(x˙1)-f(x˙1)=+ $𝜀 P (x_{2}) - f (x_{2}) = - 𝜀 P (x_{3}) - f (x_{3}) = + 𝜀 ⋮ P (x_{N + 2}) - f (x_{N + 2}) = \pm 𝜀 .$

The right-hand sides alternate in sign. That is,

P˙0+P˙1x˙1+P˙2x˙1ˆ2+P˙3x˙1ˆ3+…+P˙Nx˙1ˆN-f(x˙1)=+ $𝜀 P_{0} + P_{1} x_{2} + P_{2} x_{2}^{2} + P_{3} x_{2}^{3} + \dots + P_{N} x_{2}^{N} - f (x_{2}) = - 𝜀 ⋮$

Since $x_{1}, . . ., x_{N + 2}$ were given, all of their powers are known, and $f (x_{1}), . . ., f (x_{N + 2})$ are also known. That means that the above equations are just N+2 linear equations in the N+2 variables $P_{0}, P_{1}, . . ., P_{N}$ , and $𝜀$ . Given the test points $x_{1}, . . ., x_{N + 2}$ , one can solve this system to get the polynomial P and the number $𝜀$ .

The graph below shows an example of this, producing a fourth-degree polynomial approximating $e^{x}$ over [-1, 1]. The test points were set at $- 1, - 0.7, - 0.1, + 0.4, + 0.9$ , and 1. Those values are shown in green. The resultant value of $𝜀$ is 4.43 $\times 10 e - 4 .$

Figure 2.6:Error of the polynomial produced by the first step of Remez’s algorithm, approximating $e^{x}$ over the interval [-1, 1]. Vertical divisions are $1 0^{- 4}$ .

Note that the error graph does indeed take on the values $\pm 𝜀$ at the six test points, including the end points, but that those points are not extrema. If the four interior test points had been extrema (that is, the function P(x)f(x) had maxima or minima there), the polynomial would be optimal.

The second step of Remez’s algorithm consists of moving the test points to the approximate locations where the error function had its actual local maxima or minima. For example, one can tell from looking at the graph that the point at ?0.1 should have been at about -0.28. The way to do this in the algorithm is to use a single round of Newton’s method. Since one knows the first and second derivatives of $P (x) - f (x)$ , one can calculate approximately how far a test point has to be moved so that the derivative will be zero.

Calculating the derivatives of a polynomial is straightforward. One must also be able to calculate the first and second derivatives of f(x). Remez’s algorithm requires an ability to calculate $f (x), f^{'} (x),$ and $f^{″} (x)$ to extremely high precision. The entire algorithm must be carried out to higher precision than the desired precision of the result.

After moving the test points, the linear equation part is repeated, getting a new polynomial, and Newton’s method is used again to move the test points again. This sequence is continued until the result converges to the desired accuracy. The algorithm converges very rapidly. Convergence is quadratic for well-behaved functions - if the test points are within $1 0^{- 15}$ of the correct result, they will be approximately within $1 0^{- 30}$ of the correct result after the next round. Remez’s algorithm is typically started by choosing the extrema of the Chebyshev polynomial $T_{N + 1}$ as the initial points, since the final error function will be similar to that polynomial.

Notes

¹⁶ Richard Gross, Psychology: The Science of Mind and Behaviour 6E, Hachette UK, ISBN 978-1-4441-6436-7.

¹⁷ Karban, R. (2015). Plant Learning and Memory. In: Plant Sensing and Communication. Chicago and London: The University of Chicago Press, pp. 3144,

¹⁸ Daniel L. Schacter; Daniel T. Gilbert; Daniel M. Wegner (2011) [2009]. Psychology, 2nd edition. Worth Publishers. p. 264. ISBN 978-1-4292-3719-2.

¹⁹ Sandman, Wadhwa; Hetrick, Porto; Peeke (1997). ”Human fetal heart rate dishabituation between thirty and thirty-two weeks gestation”. Child Development. 68 (6): 10311040.

²⁰ Sheridan, Mary; Howard, Justine; Alderson, Dawn (2010). Play in Early Childhood: From Birth to Six Years. Oxon: Routledge. ISBN 978-1-136-83748-7.

²¹ Sheridan, Mary; Howard, Justine; Alderson, Dawn (2010). Play in Early Childhood: From Birth to Six Years. Oxon: Routledge. ISBN 978-1-136-83748-7.

²² Ron Kohavi; Foster Provost (1998). ”Glossary of terms”. Machine Learning. 30: 271274. doi:10.1023/A:1007411609915.

²³ James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. p. 176. ISBN 978-1461471370.

²⁴ Ripley, Brian (1996). Pattern Recognition and Neural Networks. Cambridge University Press. p. 354. ISBN 978-0521717700.

²⁵ James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. p. 176. ISBN 978-1461471370.

²⁶ Ripley, Brian (1996). Pattern Recognition and Neural Networks. Cambridge University Press. p. 354. ISBN 978-0521717700.

²⁷ Prechelt, Lutz; Genevi B. Orr (2012-01-01). ”Early Stopping But When?”. In Grire Montavon; Klaus-Robert Mller (eds.). Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 5367.

²⁸ Brownlee, Jason (2017-07-13). ”What is the Difference Between Test and Validation Datasets?”. Retrieved 2017-10-12.

²⁹ Allen, David M (1974). ”The Relationship between Variable Selection and Data Agumentation and a Method for Prediction”. Technometrics. 16 (1): 125127. doi:10.2307/1267500. JSTOR 1267500.

³⁰ Galkin, Alexander (November 28, 2011). ”What is the difference between test set and validation set?”. Retrieved 10 October 2018.

³¹ Cawley, Gavin C.; Talbot, Nicola L. C. (2010). ”On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation” (PDF). 11. Journal of Machine Learning Research: 20792107.

³² Grossman, Robert; Seni, Giovanni; Elder, John; Agarwal, Nitin; Liu, Huan (2010). ”Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions”. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan Claypool. 2: 1126.

Chapter 3
Artificial Neural Networks

3.1 Neural Networks

A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. ³³ Thus a neural network is either a biological neural network, made up of biological neurons, or an artificial neural network, for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled as weights. A positive weight reflects an excited connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred to as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be ?1 and 1. These artificial networks may be used for predictive modeling, adaptive control and applications where they can be trained via a dataset. Self-learning resulting from experience can occur within networks, which can derive conclusions from a complex and seemingly unrelated set of information.

Figure 3.1:ANN model. Obtained from Wikipedia.

3.2 Overview

A biological neural network is composed of a groups of chemically connected or functionally associated neurons. A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive. Connections, called synapses, are usually formed from axons to dendrites. Apart from the electrical signaling, there are other forms of signaling that arise from neurotransmitter diffusion.

Artificial intelligence, cognitive modeling, and neural networks are information processing paradigms inspired by the way biological neural systems process data. Artificial intelligence and cognitive modeling try to simulate some properties of biological neural networks. In the artificial intelligence field, artificial neural networks have been applied successfully to speech recognition, image analysis and adaptive control, in order to construct software agents (in computer and video games) or autonomous robots.

Historically, digital computers evolved from the von Neumann model, and operate via the execution of explicit instructions via access to memory by a number of processors. On the other hand, the origins of neural networks are based on efforts to model information processing in biological systems. Unlike the von Neumann model, neural network computing does not separate memory and processing. Neural network theory has served both to better identify how the neurons in the brain function and to provide the basis for efforts to create artificial intelligence.

3.3 History

The preliminary theoretical base for contemporary neural networks was independently proposed by Alexander Bain (1873) and William James (1890). ³⁴ ³⁵ In their work, both thoughts and body activity resulted from interactions among neurons within the brain. For Bain, every activity led to the firing of a certain set of neurons. When activities were repeated, the connections between those neurons strengthened. According to his theory, this repetition was what led to the formation of memory. The general scientific community at the time was skeptical of Bain’s theory because it required what appeared to be an inordinate number of neural connections within the brain. It is now apparent that the brain is exceedingly complex and that the same brain wiring can handle multiple problems and inputs. James’s theory was similar to Bain’s, however, he suggested that memories and actions resulted from electrical currents flowing among the neurons in the brain. His model, by focusing on the flow of electrical currents, did not require individual neural connections for each memory or action.

C. S. Sherrington (1898) conducted experiments to test James’s theory. He ran electrical currents down the spinal cords of rats. However, instead of demonstrating an increase in electrical current as projected by James, Sherrington found that the electrical current strength decreased as the testing continued over time. Importantly, this work led to the discovery of the concept of habituation. ³⁶

McCulloch and Pitts (1943) created a computational model for neural networks based on mathematics and algorithms. They called this model threshold logic. The model paved the way for neural network research to split into two distinct approaches. One approach focused on biological processes in the brain and the other focused on the application of neural networks to artificial intelligence. ³⁷

In the late 1940s psychologist Donald Hebb created a hypothesis of learning based on the mechanism of neural plasticity that is now known as Hebbian learning. Hebbian learning is considered to be a typical unsupervised learning rule and its later variants were early models for long term potentiation. These ideas started being applied to computational models in 1948 with Turing’s B-type machines. Hebb, Donald (1949). The Organization of Behavior. New York: Wiley. Farley and Clark (1954) first used computational machines, then called calculators, to simulate a Hebbian network at MIT. Other neural network computational machines were created by Rochester, Holland, Habit, and Duda (1956). ³⁸ ³⁹

Rosenblatt (1958) created the perceptron, an algorithm for pattern recognition based on a two-layer learning computer network using simple addition and subtraction. With mathematical notation, Rosenblatt also described circuitry not in the basic perceptron, such as the exclusive-or circuit, a circuit whose mathematical computation could not be processed until after the backpropagation algorithm was created by Werbos (1975). ⁴⁰ ⁴¹

Neural network research stagnated after the publication of machine learning research by Marvin Minsky and Seymour Papert (1969). They discovered two key issues with the computational machines that processed neural networks. The first issue was that single-layer neural networks were incapable of processing the exclusive-or circuit. The second significant issue was that computers were not sophisticated enough to effectively handle the long run time required by large neural networks. Neural network research slowed until computers achieved greater processing power. Also key in later advances was the backpropagation algorithm which effectively solved the exclusive-or problem (Werbos 1975). ⁴² The parallel distributed processing of the mid-1980s became popular under the name connectionism. The text by Rumelhart and McClelland (1986) provided a full exposition on the use of connectionism in computers to simulate neural processes. ⁴³

3.4 Artificial intelligence

A neural network (NN), in the case of artificial neurons called artificial neural network (ANN) or simulated neural network (SNN), is an interconnected group of natural or artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network. In more practical terms neural networks are non-linear statistical data modeling or decision making tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data.

An artificial neural network involves a network of simple processing elements (artificial neurons) which can exhibit complex global behavior, determined by the connections between the processing elements and element parameters. Artificial neurons were first proposed in 1943 by Warren McCulloch, a neurophysiologist, and Walter Pitts, a logician, who first collaborated at the University of Chicago. ⁴⁴

One classical type of artificial neural network is the recurrent Hopfield network. ⁴⁵ The concept of a neural network appears to have first been proposed by Alan Turing in his 1948 paper Intelligent Machinery in which he called them ”B-type unorganised machines”. ⁴⁶

The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations and also to use it. Unsupervised neural networks can also be used to learn representations of the input that capture the salient characteristics of the input distribution, e.g., see the Boltzmann machine (1983), and more recently, deep learning algorithms, which can implicitly learn the distribution function of the observed data. Learning in neural networks is particularly useful in applications where the complexity of the data or task makes the design of such functions by hand impractical.

3.5 Applications

Neural networks can be used in different fields. The tasks to which artificial neural networks are applied tend to fall within the following broad categories:

Function approximation, or regression analysis, including time series prediction and modeling.
Classification, including pattern and sequence recognition, novelty detection and sequential decision making.
Data processing, including filtering, clustering, blind signal separation and compression.

Application areas of ANNs include nonlinear system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications, data mining (or knowledge discovery in databases, “KDD”), visualization and e-mail spam filtering. For example, it is possible to create a semantic profile of user’s interests emerging from pictures trained for object recognition.

3.6 von Neumann architecture

The von Neumann architecturealso known as the von Neumann model or Princeton architecture, is a computer architecture based on a 1945 description by John von Neumann and others in the First Draft of a Report on the EDVAC. ⁴⁷ ¹ That document describes a design architecture for an electronic digital computer with these components:

A processing unit that contains an arithmetic logic unit and processor registers
A control unit that contains an instruction register and program counter
Memory that stores data and instructions
External mass storage
Input and output mechanisms

The term “von Neumann architecture” has evolved to mean any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time because they share a common bus. This is referred to as the von Neumann bottleneck and often limits the performance of the system.

3.7 Turing’s B-type machines

Also known as unorganized machine is a concept mentioned in a 1948 report in which Alan Turing suggested that the infant human cortex was what he called an ”unorganized machine”. ⁴⁸

Turing defined the class of unorganized machines as largely random in their initial construction, but capable of being trained to perform particular tasks. Turing’s unorganized machines were in fact very early examples of randomly connected, binary neural networks, and Turing claimed that these were the simplest possible model of the nervous system.

In his 1948 paper Turing defined two examples of his unorganized machines. The first were A-type machines these being essentially randomly connected networks of NAND logic gates. The second were called B-type machines, which could be created by taking an A-type machine and replacing every inter-node connection with a structure called a connection modifier which itself is made from A-type nodes. The purpose of the connection modifiers were to allow the B-type machine to undergo “appropriate interference, mimicking education” in order to organize the behavior of the network to perform useful work. Before the term genetic algorithm was coined, Turing even proposed the use of what he called a genetic search to configure his unorganized machines. Turing claimed that the behavior of B-type machines could be very complex when the number of nodes in the network was large, and stated that the “picture of the cortex as an unorganized machine is very satisfactory from the point of view of evolution and genetics”. Turing had been interested in the possibility of simulating neural systems for at least the previous two years. ²

3.8 Hebbian Learning

Also known as Hebbian theory is a neuroscientific theory claiming that an increase in synaptic efficacy arises from a presynaptic cell’s repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior. ⁴⁹ The theory is also called Hebb’s rule, Hebb’s postulate, and cell assembly theory.

The theory is often summarized as “Cells that fire together wire together.” However, Hebb emphasized that cell A needs to ”take part in firing” cell B, and such causality can occur only if cell A fires just before, not at the same time as, cell B. This aspect of causation in Hebb’s work foreshadowed what is now known about spike-timing-dependent plasticity, which requires temporal precedence.[3]

The theory attempts to explain associative or Hebbian learning, in which simultaneous activation of cells leads to pronounced increases in synaptic strength between those cells. It also provides a biological basis for errorless learning methods for education and memory rehabilitation. In the study of neural networks in cognitive function, it is often regarded as the neuronal basis of unsupervised learning.

3.9 Python Practice

3.9.1 Neural Networks

An Artificial Neural Network (ANN) is an information processing paradigm that is inspired the brain. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning largely involves adjustments to the synaptic connections that exist between the neurons.

The brain consists of hundreds of billion of cells called neurons. These neurons are connected together by synapses which are nothing but the connections across which a neuron can send an impulse to another neuron. When a neuron sends an excitatory signal to another neuron, then this signal will be added to all of the other inputs of that neuron. If it exceeds a given threshold then it will cause the target neuron to fire an action signal forward this is how the thinking process works internally.

In Computer Science, we model this process by creating networks on a computer using matrices. These networks can be understood as abstraction of neurons without all the biological complexities taken into account. To keep things simple, we will just model a simple NN, with two layers capable of solving linear classification problem.

Lets say we have a problem where we want to predict output given a set of inputs and outputs as training example like so

Note that the output is directly related to third column i.e. the values of input 3 is what the output is in every training example in fig. 2. So for the test example output value should be 1. The training process consists of the following steps:

Forward Propagation: Take the inputs, multiply by the weights (just use random numbers as weights) Let $Y = W_{i} I_{i} = W_{1} I_{1} + W_{2} I_{2} + W_{3} I_{3}$ . Pass the result through a sigmoid formula to calculate the neurons output. The Sigmoid function is used to normalise the result between 0 and 1: $1 ∕ 1 + e^{- y}$
Back Propagation: Calculate the error i.e the difference between the actual output and the expected output. Depending on the error, adjust the weights by multiplying the error with the input and again with the gradient of the Sigmoid curve: $W e i g h t + = (E r r o r \times I n p u t) \times [O u t p u t \times (1 - O u t p u t)]$ ,here $O u t p u t \times (1 - O u t p u t)$ is derivative of sigmoid curve.

Repeat the whole process for a few thousands iterations. Lets code up the whole process in Python. Well be using Numpy library to help us with all the calculations on matrices easily. Youd need to install numpy library on your system to run the code

1from joblib.numpy_pickle_utils import xrange
2from numpy import *
3
4
5class NeuralNet(object):
6    def __init__(self):
7        # Generate random numbers
8        random.seed(1)
9
10        # Assign random weights to a 3 x 1 matrix,
11        self.synaptic_weights = 2 * random.random((3, 1)) - 1
12
13    # The Sigmoid function
14    def __sigmoid(self, x):
15        return 1 / (1 + exp(-x))
16
17    # The derivative of the Sigmoid function.
18    # This is the gradient of the Sigmoid curve.
19    def __sigmoid_derivative(self, x):
20        return x * (1 - x)
21
22    # Train the neural network and adjust the weights each time.
23    def train(self, inputs, outputs, training_iterations):
24        for iteration in xrange(training_iterations):
25            # Pass the training set through the network.
26            output = self.learn(inputs)
27
28            # Calculate the error
29            error = outputs - output
30
31            # Adjust the weights by a factor
32            factor = dot(inputs.T, error * self.__sigmoid_derivative(output))
33            self.synaptic_weights += factor
34
35        # The neural network thinks.
36
37    def learn(self, inputs):
38        return self.__sigmoid(dot(inputs, self.synaptic_weights))
39
40
41if __name__ == "__main__":
42    # Initialize
43    neural_network = NeuralNet()
44
45    # The training set.
46    inputs = array([[0, 1, 1], [1, 0, 0], [1, 0, 1]])
47    outputs = array([[1, 0, 1]]).T
48
49    # Train the neural network
50    neural_network.train(inputs, outputs, 10000)
51
52    # Test the neural network with a test example.
53    print(neural_network.learn(array([1, 0, 1])))

After 10 iterations our neural network predicts the value to be 0.65980921. It looks not good as the answer should really be 1. If we increase the number of iterations to 100, we get 0.87680541. Our network is getting smarter! Subsequently, for 10000 iterations we get 0.9897704 which is pretty close and indeed a satisfactory output.

3.10 Feedforward networks

A feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle. ⁵⁰ As such, it is different from its descendant: recurrent neural networks. The feedforward neural network was the first and simplest type of artificial neural network devised. ⁵¹ In this network, the information moves in only one directionforwardfrom the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.

3.10.1 Single-layer perceptron

The simplest kind of neural network is a single-layer perceptron network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated in each node, and if the value is above some threshold (typically 0) the neuron fires and takes the activated value (typically 1); otherwise it takes the deactivated value (typically -1). Neurons with this kind of activation function are also called artificial neurons or linear threshold units. In the literature the term perceptron often refers to networks consisting of just one of these units. A similar neuron was described by Warren McCulloch and Walter Pitts in the 1940s.

A perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two. Perceptrons can be trained by a simple learning algorithm that is usually called the delta rule. It calculates the errors between calculated output and sample output data, and uses this to create an adjustment to the weights, thus implementing a form of gradient descent.

Single-layer perceptrons are only capable of learning linearly separable patterns; in 1969 in a famous monograph entitled Perceptrons, Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function (nonetheless, it was known that multi-layer perceptrons are capable of producing any possible boolean function).

Although a single threshold unit is quite limited in its computational power, it has been shown that networks of parallel threshold units can approximate any continuous function from a compact interval of the real numbers into the interval [-1,1]. This result can be found in Peter Auer, Harald Burgsteiner and Wolfgang Maass ”A learning rule for very simple universal approximators consisting of a single layer of perceptrons”. ⁵²

A single-layer neural network can compute a continuous output instead of a step function. A common choice is the so-called logistic function:

$f (x) = \frac{1}{1 + e^{- x}}$

With this choice, the single-layer network is identical to the logistic regression model, widely used in statistical modeling. The logistic function is one of the family of functions called sigmoid functions because their S-shaped graphs resemble the final-letter lower case of the Greek letter Sigma. It has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated:

$f^{'} (x) = f (x) (1 - f (x))$

(The fact that $f$ satisfies the differential equation above can easily be shown by applying the chain rule.)

If single-layer neural network activation function is modulo 1, then this network can solve XOR problem with a single neuron.

$f (x) = x 1$
$f^{'} (x) = 1$

3.10.2 Multi-layer perceptron

This class of networks consists of multiple layers of computational units, usually interconnected in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer. In many applications the units of these networks apply a sigmoid function as an activation function. However sigmoidal activation functions have very small derivative values outside a small range and do not work well in deep neural networks due to the vanishing gradient problem. Alternatives to sigmoidal activation functions that alleviate the vanishing gradient problems and allow deep networks to be trained have been proposed.

The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds for a wide range of activation functions, e.g. for the sigmoidal functions.

Multi-layer networks use a variety of learning techniques, the most popular being back-propagation. Here, the output values are compared with the correct answer to compute the value of some predefined error-function. By various techniques, the error is then fed back through the network. Using this information, the algorithm adjusts the weights of each connection in order to reduce the value of the error function by some small amount. After repeating this process for a sufficiently large number of training cycles, the network will usually converge to some state where the error of the calculations is small. In this case, one would say that the network has learned a certain target function. To adjust weights properly, one applies a general method for non-linear optimization that is called gradient descent. For this, the network calculates the derivative of the error function with respect to the network weights, and changes the weights such that the error decreases (thus going downhill on the surface of the error function). For this reason, back-propagation can only be applied on networks with differentiable activation functions.

In general, the problem of teaching a network to perform well, even on samples that were not used as training samples, is a quite subtle issue that requires additional techniques. This is especially important for cases where only very limited numbers of training samples are available. The danger is that the network overfits the training data and fails to capture the true statistical process generating the data. Computational learning theory is concerned with training classifiers on a limited amount of data. In the context of neural networks a simple heuristic, called early stopping, often ensures that the network will generalize well to examples not in the training set.

Other typical problems of the back-propagation algorithm are the speed of convergence and the possibility of ending up in a local minimum of the error function. Today, there are practical methods that make back-propagation in multi-layer perceptrons the tool of choice for many machine learning tasks. One also can use a series of independent neural networks moderated by some intermediary, a similar behavior that happens in brain. These neurons can perform separably and handle a large task, and the results can be finally combined.

3.10.3 Other feedforward networks

More generally, any directed acyclic graph may be used for a feedforward network, with some nodes (with no parents) designated as inputs, and some nodes (with no children) designated as outputs. These can be viewed as multilayer networks where some edges skip layers, either counting layers backwards from the outputs or forwards from the inputs. Various activation functions can be used, and there can be relations between weights, as in convolutional neural networks. Examples of other feedforward networks include radial basis function networks, which use a different activation function. Sometimes multi-layer perceptron is used loosely to refer to any feedforward neural network, while in other cases it is restricted to specific ones (e.g., with specific activation functions, or with fully connected layers, or trained by the perceptron algorithm).

3.10.4 Python practice

The goal of neural networks in general is to approximate some function $f$ . The Universal Approximation Theorem says neural networks have the capacity to accomplish this for a large class of functions. In this case, we seek to approximate the function that generates a binary classification problem. We can model this with the binary random variable $Y | X : ω \to {0, 1}$ with conditional distribution

p(x) = P(Y = y — X = x) = P(Y = 1 — X = x)ˆy (1 - P(Y = 1 — X = x))ˆ1 - y

which we seek to approximate. Instead of approximating this directly, we divide the function up and approximate each value y takes. Since this is binary classification, we only need to approximate

P(Y = 1 — X = x) : Rˆ2 $\to [0, 1] .$

Since this is a probability function, we will use relative entropy as a measure of distance, which is typically called a loss function. In order to approximate the relative entropy, we sample i.i.d.

(y˙i, X˙i)˙i = 1ˆn

with

D(p —— p^) = $\int_{R^{2}} \log (\frac{p}{\hat{p}}) p d m$

$\approx \sum_{i = 1}^{n} \log (\frac{p (X_{i})}{\hat{p} (X_{i})}) p (X_{i})$

$\approx \sum_{i = 1}^{n} \log (\frac{p (X_{i})}{\hat{p} (X_{i})}) \frac{1}{n}$

where $m$ is Lebesgue measure. Since p is unknown, we minimize the estimated loss,

D^(p —— p^) = - $\sum_{i = 1}^{n} \log (\hat{p} (X_{i})) \frac{1}{n}$

= -1 $n$ ∑ ⁡ i=1nyi log ⁡(p^(Xi))+(1−yi) log ⁡(1−p^(Xi)).

We define the single hidden layer neural network. Let W denote the weights and b be the intercept or bias term. We use sigmoid activations denoted by $σ$ , but $R e l u$ is generally preferred. The hidden layer

H(X) = $σ (X W_{h} + b_{h})$

and output layer

O(X) = $σ (h (X) W_{o} + b_{o})$

The neural network is fit using gradient descent with the gradients computed using algorithmic differentiation. These are implemented naively for simplicity and are not using advanced algorithms as would be in practice. We generate data using Sklearn’s $m a k e_c i r c l e s$ function, split it into a training and test set, and plot with Matplotlib.

 
1import numpy as np 
2import matplotlib.pyplot as plt 
3from sklearn.datasets import make_circles 
4 
5n_samples = 1000 
6n_features = 2 
7n_ouputs = 1 
8X, y = make_circles(n_samples = n_samples, factor = .01, noise = .2) 
9 
10n_TRAIN = int(.75 * n_samples) 
11X_train = X[0:n_TRAIN, :] 
12y_train = y[0:n_TRAIN] 
13X_test = X[n_TRAIN:n_samples, :] 
14y_test = y[n_TRAIN:n_samples] 
15 
16fig = plt.figure(figsize=(8, 8)) 
17plt.scatter(X[:,0], X[:,1], c = y) 
18plt.xlabel("X1") 
19plt.ylabel("X2") 
20plt.savefig(’nn_plot.pdf’, bbox_inches=’tight’)
 
21

The neural network implementation

 
1class NN(): 
2  def __init__(self, n_samples, n_features, n_outputs, n_hidden = 1): 
3    self.n_samples = n_samples 
4    self.n_features = n_features 
5    self.n_hidden = n_hidden 
6    self.n_outputs = n_outputs 
7 
8    self.W_h = np.random.randn(n_features, n_hidden) 
9    self.b_h = .01 + np.zeros((1, n_hidden)) 
10    self.W_o = np.random.randn(n_hidden, n_outputs) 
11    self.b_o = .01 + np.zeros((1, n_outputs)) 
12 
13  def sigmoid(self, x): 
14    return 1/(1 + np.exp(-x)) 
15 
16  def loss(self, y, p_pred): 
17    return -1/y.shape[0] * (np.sum(y * np.log(p_pred) + (1 - y) * (np.log(1 - p_pred)))) 
18 
19  def predict(self, X): 
20    return np.squeeze(np.round(self.forward_prop(X)["O"])) 
21 
22  def forward_prop(self, X): 
23    # Hidden layer 
24    A_h = X @ self.W_h + self.b_h 
25    H = self.sigmoid(A_h) 
26 
27    # Output layer 
28    A_o = H @ self.W_o + self.b_o 
29    O = self.sigmoid(A_o) 
30    return { 
31      "A_h": A_h, 
32      "H": H, 
33      "A_o": A_o, 
34      "O": O 
35    } 
36 
37  # This is not a true implmentation of backprop 
38  def backward_prop(self, X, y_, forward): 
39    one_n = np.ones(self.n_samples) 
40    y = (y_[np.newaxis]).T # convert to column vector 
41 
42    dA_o = (y - forward["O"]) 
43    dL_dW_o = 1/self.n_samples * forward["H"].T @ dA_o 
44    dL_db_o = 1/self.n_samples * one_n.T @ dA_o 
45 
46    dA_h = (dA_o @ self.W_o.T) * (self.sigmoid(forward["A_h"]) * (1 - self.sigmoid(forward["A_h"]))) 
47    dL_dW_h = 1/self.n_samples * X.T @ dA_h 
48    dL_db_h = 1/self.n_samples * one_n.T @ dA_h 
49 
50    return { 
51      "dL_dW_h": dL_dW_h, 
52      "dL_db_h": dL_db_h, 
53      "dL_dW_o": dL_dW_o, 
54      "dL_db_o": dL_db_o 
55    } 
56 
57  def train(self, X, y, learning_rate = .5, max_iter = 1001): 
58    for i in range(0, max_iter): 
59      forward_prop_dict = self.forward_prop(X) 
60      G = self.backward_prop(X, y, forward_prop_dict) 
61 
62      # Gradient step 
63      self.W_h = self.W_h + learning_rate * G["dL_dW_h"] 
64      self.b_h = self.b_h + learning_rate * G["dL_db_h"] 
65 
66      self.W_o = self.W_o + learning_rate * G["dL_dW_o"] 
67      self.b_o = self.b_o + learning_rate * G["dL_db_o"] 
68 
69      if i % 100 == 0: 
70        print(f"Iteration: {i}, Training Loss: {self.loss(y, np.squeeze(forward_prop_dict[’O’]))}")
 
71

We use 10 hidden units in the hidden layer and report the 0-1 accuracy on both the training and test sets.

 
1nn = NN(n_samples = n_TRAIN, n_features = n_features, n_outputs = n_ouputs, n_hidden = 10) 
2nn.train(X_train, y_train) 
3 
4print("Train accuracy:", 1/X_train.shape[0] * np.sum(nn.predict(X_train) == y_train)) 
5print("Test accuracy:", 1/X_test.shape[0] * np.sum(nn.predict(X_test) == y_test))
 
6

Output

Iteration: 0, Training Loss: 0.7119608071592811
Iteration: 100, Training Loss: 0.639833196437629
Iteration: 200, Training Loss: 0.5215796424999427
Iteration: 300, Training Loss: 0.368528633036507
Iteration: 400, Training Loss: 0.25561713311943096
Iteration: 500, Training Loss: 0.1887819942694729
Iteration: 600, Training Loss: 0.14915555078792483
Iteration: 700, Training Loss: 0.12416122322345967
Iteration: 800, Training Loss: 0.10734573047261009
Iteration: 900, Training Loss: 0.0953975198921105
Iteration: 1000, Training Loss: 0.08652396074499336
Train accuracy: 0.9906666666666666
Test accuracy: 0.992

3.11 Backpropagation network

In machine learning, backpropagation is a widely used algorithm for training feedforward neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions generally. These classes of algorithms are all referred to generically as “backpropagation”. In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single inputoutput example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually. This efficiency makes it feasible to use gradient methods for training multilayer networks, updating weights to minimize loss; gradient descent, or variants such as stochastic gradient descent, are commonly used. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this is an example of dynamic programming.

The term backpropagation strictly refers only to the algorithm for computing the gradient, not how the gradient is used; however, the term is often used loosely to refer to the entire learning algorithm, including how the gradient is used, such as by stochastic gradient descent. Backpropagation generalizes the gradient computation in the delta rule, which is the single-layer version of backpropagation, and is in turn generalized by automatic differentiation, where backpropagation is a special case of reverse accumulation (or “reverse mode”). The term backpropagation and its general use in neural networks was announced in Rumelhart, Hinton & Williams (1986a), then elaborated and popularized in Rumelhart, Hinton & Williams (1986b), but the technique was independently rediscovered many times, and had many predecessors dating to the 1960s; see History. A modern overview is given in the deep learning textbook by Goodfellow, Bengio Courville (2016).

3.11.1 Overview

Backpropagation computes the gradient in weight space of a feedforward neural network, with respect to a loss function. Denote:

$x$ : input (vector of features)
$y$ : target output For classification, output will be a vector of class probabilities (e.g., $(0.1, 0.7, 0.2)$ , and target output is a specific class, encoded by the one-hot/dummy variable (e.g., $(0, 1, 0)$ .
$C$ : loss function or “cost function” For classification, this is usually cross entropy (XC, log loss), while for regression it is usually squared error loss (SEL).
$L$ : the number of layers
$W^{l} = (w_{j k}^{l})$ : the weights between layer $l - 1$ and $l$ , where $w_{j k}^{l}$ is the weight between the $k - t h$ node in layer $l - 1$ and the $j - t h$ node in layer $l$
$f^{l}$ : activation functions at layer $l$ For classification the last layer is usually the logistic function for binary classification, and softmax (softargmax) for multi-class classification, while for the hidden layers this was traditionally a sigmoid function (logistic function or others) on each node (coordinate), but today is more varied, with rectifier (ramp, ReLU) being common.

In the derivation of backpropagation, other intermediate quantities are used; they are introduced as needed below. Bias terms are not treated specially, as they correspond to a weight with a fixed input of 1. For the purpose of backpropagation, the specific loss function and activation functions do not matter, as long as they and their derivatives can be evaluated efficiently. Traditional activation functions include but are not limited to sigmoid, tanh, and ReLU. Since, swish, mish, and other activation functions were proposed as well. The overall network is a combination of function composition and matrix multiplication:

$g (x) : = f^{L} (W^{L} f^{L - 1} (W^{L - 1} \dots f^{1} (W^{1} x) \dots))$

For a training set there will be a set of inputoutput pairs, ${(x_{i}, y_{i})}$ . For each inputoutput pair $(x_{i}, y_{i})$ in the training set, the loss of the model on that pair is the cost of the difference between the predicted output $g (x_{i})$ and the target output $y_{i}$ :

$C (y_{i}, g (x_{i}))$

During model evaluation, the weights are fixed, while the inputs vary (and the target output may be unknown), and the network ends with the output layer (it does not include the loss function). During model training, the inputoutput pair is fixed, while the weights vary, and the network ends with the loss function.

Backpropagation computes the gradient for a fixed inputoutput pair $(x_{i}, y_{i})$ , where the weights $w_{j k}^{l}$ can vary. Each individual component of the gradient, $\partial C ∕ \partial w_{j k}^{l},$ can be computed by the chain rule; however, doing this separately for each weight is inefficient. Backpropagation efficiently computes the gradient by avoiding duplicate calculations and not computing unnecessary intermediate values, by computing the gradient of each layer specifically, the gradient of the weighted input of each layer, denoted by $δ^{l}$ from back to front.

Informally, the key point is that since the only way a weight in $W^{l}$ affects the loss is through its effect on the next layer, and it does so linearly, $δ^{l}$ are the only data you need to compute the gradients of the weights at layer $l$ , and then you can compute the previous layer $δ^{l - 1}$ and repeat recursively. This avoids inefficiency in two ways. Firstly, it avoids duplication because when computing the gradient at layer $l$ , you do not need to recompute all the derivatives on later layers $l + 1, l + 2, \dots$ each time. Secondly, it avoids unnecessary intermediate calculations because at each stage it directly computes the gradient of the weights with respect to the ultimate output (the loss), rather than unnecessarily computing the derivatives of the values of hidden layers with respect to changes in weights $a_{j^{'}}^{l^{'}} ∕ \partial w_{j k}^{l}$ .

Backpropagation can be expressed for simple feedforward networks in terms of matrix multiplication, or more generally in terms of the adjoint graph.

3.11.2 Learning as an optimization problem

To understand the mathematical derivation of the backpropagation algorithm, it helps to first develop some intuition about the relationship between the actual output of a neuron and the correct output for a particular training example. Consider a simple neural network with two input units, one output unit and no hidden units, and in which each neuron uses a linear output (unlike most work on neural networks, in which mapping from inputs to outputs is non-linear)[g] that is the weighted sum of its input.

Initially, before training, the weights will be set randomly. Then the neuron learns from training examples, which in this case consist of a set of tuples $(x_{1}, x_{2}, t)$ where $x_{1}$ and $x_{2}$ are the inputs to the network and t is the correct output (the output the network should produce given those inputs, when it has been trained). The initial network, given $x_{1}$ and $x_{2}$ , will compute an output $y$ that likely differs from t (given random weights). A loss function $L (t, y)$ is used for measuring the discrepancy between the target output t and the computed output y. For regression analysis problems the squared error can be used as a loss function, for classification the categorical crossentropy can be used.

As an example consider a regression problem using the square error as a loss:

$L (t, y) = {(t - y)}^{2} = E,$

where E is the discrepancy or error.

Consider the network on a single training case: $(1, 1, 0)$ . Thus, the input $x_{1}$ and $x_{2}$ are 1 and 1 respectively and the correct output, t is 0. Now if the relation is plotted between the network’s output y on the horizontal axis and the error E on the vertical axis, the result is a parabola. The minimum of the parabola corresponds to the output y which minimizes the error E. For a single training case, the minimum also touches the horizontal axis, which means the error will be zero and the network can produce an output y that exactly matches the target output t. Therefore, the problem of mapping inputs to outputs can be reduced to an optimization problem of finding a function that will produce the minimal error.

However, the output of a neuron depends on the weighted sum of all its inputs:

$y = x_{1} w_{1} + x_{2} w_{2},$

where $w_{1}$ and $w_{2}$ are the weights on the connection from the input units to the output unit. Therefore, the error also depends on the incoming weights to the neuron, which is ultimately what needs to be changed in the network to enable learning.

In this example, upon injecting the training data the loss function becomes

$E = {(t - y)}^{2} = y^{2} = {(x_{1} w_{1} + x_{2} w_{2})}^{2} = {(w_{1} + w_{2})}^{2} .$

Then, the loss function $E$ takes the form of a parabolic cylinder with its base directed along $w_{1} = - w_{2}$ . Since all sets of weights that satisfy $w_{1} = - w_{2}$ minimize the loss function, in this case additional constraints are required to converge to a unique solution. Additional constraints could either be generated by setting specific conditions to the weights, or by injecting additional training data.

One commonly used algorithm to find the set of weights that minimizes the error is gradient descent. By backpropagation, the steepest descent direction is calculated of the loss function versus the present synaptic weights. Then, the weights can be modified along the steepest descent direction, and the error is minimized in an efficient way.

3.11.3 Loss function

The loss function is a function that maps values of one or more variables onto a real number intuitively representing some ”cost” associated with those values. For backpropagation, the loss function calculates the difference between the network output and its expected output, after a training example has propagated through the network.

Assumptions

The mathematical expression of the loss function must fulfill two conditions in order for it to be possibly used in backpropagation. The first is that it can be written as an average $E = \frac{1}{n} \sum_{x} E_{x}$ over error functions $E_{x}$ , for $n$ individual training examples, $x$ . The reason for this assumption is that the backpropagation algorithm calculates the gradient of the error function for a single training example, which needs to be generalized to the overall error function. The second assumption is that it can be written as a function of the outputs from the neural network.

Example loss function

Let $y, y^{'}$ be vectors in $R^{n}$ .

Select an error function $E (y, y^{'})$ measuring the difference between two outputs. The standard choice is the square of the Euclidean distance between the vectors $y$ and $y^{'}$ :

$E (y, y^{'}) = 12 y - {y^{'}}^{2}$

The error function over $n$ training examples can then be written as an average of losses over individual examples:

$E = \frac{1}{2 n} \sum_{x} {(y (x) - y^{'} (x))}^{2}$

3.11.4 Python practice

BPN was discovered by Rumelhart, Williams Honton in 1986. The core concept of BPN is to backpropagate or spread the error from units of output layer to internal hidden layers in order to tune the weights to ensure lower error rates. It is considered a practice of fine-tuning the weights of neural networks in each iteration. Proper tuning of the weights will make a sure minimum loss and this will make a more robust, and generalizable trained neural network.

BPN learns in an iterative manner. In each iteration, it compares training examples with the actual target label. target label can be a class label or continuous value. The backpropagation algorithm works in the following steps:

Initialize Network: BPN randomly initializes the weights.
Forward Propagate: After initialization, we will propagate into the forward direction. In this phase, we will compute the output and calculate the error from the target output.
Back Propagate Error: For each observation, weights are modified in order to reduce the error in a technique called the delta rule or gradient descent. It modifies weights in a backward direction to all the hidden layers.

Importing libraries

Lets import the required modules and libraries such as numpy, pandas, scikit-learn, and matplotlib.

 
1# Import Libraries 
2import numpy as np 
3import pandas as pd 
4from sklearn.datasets import load_iris 
5from sklearn.model_selection import train_test_split 
6import matplotlib.pyplot as plt
 
7

Load Dataset

Lets first load the Iris dataset using $l o a d_i r i s ()$ function of scikit-learn library and seprate them in features and target labels. This data set has three classes Iris-setosa, Iris-versicolor, and Iris-virginica.

 
1# Load dataset 
2data = load_iris() 
3 
4# Get features and target 
5X=data.data 
6y=data.target
 
7

Prepare Dataset

Create dummy variables for class labels using $g e t_d u m m i e s ()$ function

 
1# Get dummy variable 
2y = pd.get_dummies(y).values 
3 
4print(y[:3])
 
5

Output:

 
1array([[1, 0, 0], 
2       [1, 0, 0], 
3       [1, 0, 0]], dtype=uint8)
 
4

Split train and test set

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Let’s split dataset by using function $t r a i n_t e s t_s p l i t ()$ . you need to pass basically 3 parameters features, target, and $t e s t_s e t$ size. Additionally, you can use $r a n d o m_s t a t e$ in order to get the same kind of train and test set.

 
1#Split data into train and test data 
2X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20, random_state=4)
 
3

Initialize Hyperparameters and Weights

Lets initialize the hyperparameters such as learning rate, iterations, input size, number of hidden layers, and number of output layers.

 
1# Initialize variables 
2learning_rate = 0.1 
3iterations = 5000 
4N = y_train.size 
5 
6# number of input features 
7input_size = 4 
8 
9# number of hidden layers neurons 
10hidden_size = 2 
11 
12# number of neurons at the output layer 
13output_size = 3 
14 
15results = pd.DataFrame(columns=["mse", "accuracy"])
 
16

Lets initialize the weights for hidden and output layers with random values.

 
1# Initialize weights 
2np.random.seed(10) 
3 
4# initializing weight for the hidden layer 
5W1 = np.random.normal(scale=0.5, size=(input_size, hidden_size)) 
6 
7# initializing weight for the output layer 
8W2 = np.random.normal(scale=0.5, size=(hidden_size , output_size))
 
9

Helper Functions

Lets create helper functions such as sigmoid, $m e a n_s q u a r e_e r r o r$ , and accuracy.

 
1def sigmoid(x): 
2    return 1 / (1 + np.exp(-x)) 
3 
4def mean_squared_error(y_pred, y_true): 
5    return ((y_pred - y_true)**2).sum() / (2*y_pred.size) 
6 
7def accuracy(y_pred, y_true): 
8    acc = y_pred.argmax(axis=1) == y_true.argmax(axis=1) 
9    return acc.mean()
 
10

Backpropagation Neural Network

In this phase, we will create backpropagation neural network in three steps feedforward propagation, error calculation and backpropagation phase. Here , we will create a for loop for given number of iterations that execute the three steps(feedforward propagation, error calculation and backpropagation phase) and update the weights in each iteration.

 
1for itr in range(iterations): 
2 
3    # feedforward propagation 
4    # on hidden layer 
5    Z1 = np.dot(x_train, W1) 
6    A1 = sigmoid(Z1) 
7 
8    # on output layer 
9    Z2 = np.dot(A1, W2) 
10    A2 = sigmoid(Z2) 
11 
12 
13    # Calculating error 
14    mse = mean_squared_error(A2, y_train) 
15    acc = accuracy(A2, y_train) 
16    results=results.append({"mse":mse, "accuracy":acc},ignore_index=True ) 
17 
18    # backpropagation 
19    E1 = A2 - y_train 
20    dW1 = E1 * A2 * (1 - A2) 
21 
22    E2 = np.dot(dW1, W2.T) 
23    dW2 = E2 * A1 * (1 - A1) 
24 
25    # weight updates 
26    W2_update = np.dot(A1.T, dW1) / N 
27    W1_update = np.dot(x_train.T, dW2) / N 
28 
29    W2 = W2 - learning_rate * W2_update 
30    W1 = W1 - learning_rate * W1_update
 
31

Plot MSE and Accuracy

Lets plot mean squared error in each iteration using pandas plot() funciton.

 
1results.mse.plot(title="Mean Squared Error")
 
2

Output:

Lets plot accuracy in each iteration using pandas plot() funciton.

 
1results.accuracy.plot(title="Accuracy")
 
2

Output:

Predict for Test Data and Evaluate the Performance

Lets make prediction for the test data and assess the performance of Backpropagation neural network.

 
1# feedforward 
2Z1 = np.dot(x_test, W1) 
3A1 = sigmoid(Z1) 
4 
5Z2 = np.dot(A1, W2) 
6A2 = sigmoid(Z2) 
7 
8acc = accuracy(A2, y_test) 
9print("Accuracy: {}".format(acc))
 
10

Output:

 
1Accuracy: 0.8
 
2

you can see in the above output, we are getting 80% accuracy on test dataset.

3.12 Agents in Artificial Intelligence

An AI system can be defined as the study of the rational agent and its environment. The agents sense the environment through sensors and act on their environment through actuators. An AI agent can have mental properties such as knowledge, belief, intention, etc.

3.12.1 What is an Agent?

An agent can be anything that perceiveits environment through sensors and act upon that environment through actuators. An Agent runs in the cycle of perceiving, thinking, and acting. An agent can be:

Human-Agent: A human agent has eyes, ears, and other organs which work for sensors and hand, legs, vocal tract work for actuators.
Robotic Agent: A robotic agent can have cameras, infrared range finder, NLP for sensors and various motors for actuators.
Software Agent: Software agent can have keystrokes, file contents as sensory input and act on those inputs and display output on the screen.

Hence the world around us is full of agents such as thermostat, cellphone, camera, and even we are also agents. we should first know about sensors, effectors, and actuators.

Sensor: Sensor is a device which detects the change in the environment and sends the information to other electronic devices. An agent observes its environment through sensors.
Actuators: Actuators are the component of machines that converts energy into motion. The actuators are only responsible for moving and controlling a system. An actuator can be an electric motor, gears, rails, etc.
Effectors: Effectors are the devices which affect the environment. Effectors can be legs, wheels, arms, fingers, wings, fins, and display screen.

3.12.2 Intelligent Agents

An intelligent agent is an autonomous entity which act upon an environment using sensors and actuators for achieving goals. An intelligent agent may learn from the environment to achieve their goals. A thermostat is an example of an intelligent agent. Following are the main four rules for an AI agent:

: Rule 1: An AI agent must have the ability to perceive the environment.
: Rule 2: The observation must be used to make decisions.
: Rule 3: Decision should result in an action.
: Rule 4: The action taken by an AI agent must be a rational action.

3.12.3 Rational Agent

A rational agent is an agent which has clear preference, models uncertainty, and acts in a way to maximize its performance measure with all possible actions. A rational agent is said to perform the right things. AI is about creating rational agents to use for game theory and decision theory for various real-world scenarios. For an AI agent, the rational action is most important because in AI reinforcement learning algorithm, for each best possible action, agent gets the positive reward and for each wrong action, an agent gets a negative reward.

Rationality

The rationality of an agent is measured by its performance measure. Rationality can be judged on the basis of following points:

Performance measure which defines the success criterion.
Agent prior knowledge of its environment.
Best possible actions that an agent can perform.
The sequence of percepts.

3.12.4 Structure of an AI Agent

The task of AI is to design an agent program which implements the agent function. The structure of an intelligent agent is a combination of architecture and agent program. It can be viewed as:

Agent = Architecture + Agent program

Following are the main three terms involved in the structure of an AI agent:

Architecture: Architecture is machinery that an AI agent executes on.
Agent Function: Agent function is used to map a percept to an action.
Agent program: Agent program is an implementation of agent function. An agent program executes on the physical architecture to produce function f.

3.12.5 PEAS Representation

PEAS is a type of model on which an AI agent works upon. When we define an AI agent or rational agent, then we can group its properties under PEAS representation model. It is made up of four words:

: P: Performance measure
: E: Environment
: A: Actuators
: S: Sensors

Example

Let’s suppose a self-driving car then PEAS representation will be:

: Performance: Safety, time, legal drive, comfort
: Environment: Roads, other vehicles, road signs, pedestrian
: Actuators: Steering, accelerator, brake, signal, horn
: Sensors: Camera, GPS, speedometer, odometer, accelerometer, sonar.

3.12.6 Types of agents

Agents can be grouped into five classes based on their degree of perceived intelligence and capability. All these agents can improve their performance and generate better action over the time. These are given below:

Simple Reflex Agent
Model-based reflex agent
Goal-based agents
Utility-based agent
Learning agent

Simple Reflex agent

The Simple reflex agents are the simplest agents. These agents take decisions on the basis of the current percepts and ignore the rest of the percept history. These agents only succeed in the fully observable environment. The Simple reflex agent does not consider any part of percepts history during their decision and action process. The Simple reflex agent works on Condition-action rule, which means it maps the current state to action. Such as a Room Cleaner agent, it works only if there is dirt in the room.

Problems for the simple reflex agent design approach:

They have very limited intelligence
They do not have knowledge of non-perceptual parts of the current state
Mostly too big to generate and to store.
Not adaptive to changes in the environment.

Model-based reflex agent

The Model-based agent can work in a partially observable environment, and track the situation. A model-based agent has two important factors:

Model: It is knowledge about ”how things happen in the world,” so it is called a Model-based agent.
Internal State: It is a representation of the current state based on percept history.

These agents have the model, ”which is knowledge of the world” and based on the model they perform actions. Updating the agent state requires information about: How the world evolves and How the agent’s action affects the world.

Goal-based agents

The knowledge of the current state environment is not always sufficient to decide for an agent to what to do. The agent needs to know its goal which describes desirable situations. Goal-based agents expand the capabilities of the model-based agent by having the ”goal” information. They choose an action, so that they can achieve the goal. These agents may have to consider a long sequence of possible actions before deciding whether the goal is achieved or not. Such considerations of different scenario are called searching and planning, which makes an agent proactive.

Utility-based agents

These agents are similar to the goal-based agent but provide an extra component of utility measurement which makes them different by providing a measure of success at a given state. Utility-based agent act based not only goals but also the best way to achieve the goal. The Utility-based agent is useful when there are multiple possible alternatives, and an agent has to choose in order to perform the best action. The utility function maps each state to a real number to check how efficiently each action achieves the goals.

Learning Agents

A learning agent in AI is the type of agent which can learn from its past experiences, or it has learning capabilities. It starts to act with basic knowledge and then able to act and adapt automatically through learning.

A learning agent has mainly four conceptual components, which are:

Learning element: It is responsible for making improvements by learning from environment
Critic: Learning element takes feedback from critic which describes that how well the agent is doing with respect to a fixed performance standard.
Performance element: It is responsible for selecting external action
Problem generator: This component is responsible for suggesting actions that will lead to new and informative experiences.

Hence, learning agents are able to learn, analyze performance, and look for new ways to improve the performance.

3.13 Agent Environment in AI

An environment is everything in the world which surrounds the agent, but it is not a part of an agent itself. An environment can be described as a situation in which an agent is present. The environment is where agent lives, operate and provide the agent with something to sense and act upon it. An environment is mostly said to be non-feministic.

3.13.1 Features of Environment

As per Russell and Norvig, an environment can have various features from the point of view of an agent:

Fully observable vs Partially Observable
Static vs Dynamic
Discrete vs Continuous
Deterministic vs Stochastic
Single-agent vs Multi-agent
Episodic vs sequential
Known vs Unknown
Accessible vs Inaccessible

Fully observable vs Partially Observable

If an agent sensor can sense or access the complete state of an environment at each point of time then it is a fully observable environment, else it is partially observable. A fully observable environment is easy as there is no need to maintain the internal state to keep track history of the world. An agent with no sensors in all environments then such an environment is called as unobservable.

Deterministic vs Stochastic

If an agent’s current state and selected action can completely determine the next state of the environment, then such environment is called a deterministic environment. A stochastic environment is random in nature and cannot be determined completely by an agent. In a deterministic, fully observable environment, agent does not need to worry about uncertainty.

Episodic vs Sequential

In an episodic environment, there is a series of one-shot actions, and only the current percept is required for the action. However, in Sequential environment, an agent requires memory of past actions to determine the next best actions.

Single-agent vs Multi-agent

If only one agent is involved in an environment, and operating by itself then such an environment is called single agent environment. However, if multiple agents are operating in an environment, then such an environment is called a multi-agent environment. The agent design problems in the multi-agent environment are different from single agent environment.

Static vs Dynamic

If the environment can change itself while an agent is deliberating then such environment is called a dynamic environment else it is called a static environment. Static environments are easy to deal because an agent does not need to continue looking at the world while deciding for an action. However for dynamic environment, agents need to keep looking at the world at each action. Taxi driving is an example of a dynamic environment whereas Crossword puzzles are an example of a static environment.

Discrete vs Continuous

If in an environment there are a finite number of percepts and actions that can be performed within it, then such an environment is called a discrete environment else it is called continuous environment. A chess gamecomes under discrete environment as there is a finite number of moves that can be performed. A self-driving car is an example of a continuous environment.

Known vs Unknown

Known and unknown are not actually a feature of an environment, but it is an agent’s state of knowledge to perform an action. In a known environment, the results for all actions are known to the agent. While in unknown environment, agent needs to learn how it works in order to perform an action. It is quite possible that a known environment to be partially observable and an Unknown environment to be fully observable.

Accessible vs Inaccessible

If an agent can obtain complete and accurate information about the state’s environment, then such an environment is called an Accessible environment else it is called inaccessible. An empty room whose state can be defined by its temperature is an example of an accessible environment. Information about an event on earth is an example of Inaccessible environment.

3.14 Turing Test in AI

In 1950, Alan Turing introduced a test to check whether a machine can think like a human or not, this test is known as the Turing Test. In this test, Turing proposed that the computer can be said to be an intelligent if it can mimic human response under specific conditions. Turing Test was introduced by Turing in his 1950 paper, ”Computing Machinery and Intelligence,” which considered the question, ”Can Machine think?”

The Turing test is based on a party game ”Imitation game,” with some modifications. This game involves three players in which one player is Computer, another player is human responder, and the third player is a human Interrogator, who is isolated from other two players and his job is to find that which player is machine among two of them. Consider, Player A is a computer, Player B is human, and Player C is an interrogator. Interrogator is aware that one of them is machine, but he needs to identify this on the basis of questions and their responses.

The conversation between all players is via keyboard and screen so the result would not depend on the machine’s ability to convert words as speech. The test result does not depend on each correct answer, but only how closely its responses like a human answer. The computer is permitted to do everything possible to force a wrong identification by the interrogator.

The questions and answers can be like:

: Interrogator: Are you a computer?
: PlayerA (Computer): No
: Interrogator: Multiply two large numbers such as $(256896489 * 456725896)$
: Player A: Long pause and give the wrong answer.

In this game, if an interrogator would not be able to identify which is a machine and which is human, then the computer passes the test successfully, and the machine is said to be intelligent and can think like a human.

”In 1991, the New York businessman Hugh Loebner announces the prize competition, offering a $100,000 prize for the first computer to pass the Turing test. However, no AI program to till date, come close to passing an undiluted Turing test”.

3.14.1 Chatbots to attempt the Turing test

ELIZA: ELIZA was a Natural language processing computer program created by Joseph Weizenbaum. It was created to demonstrate the ability of communication between machine and humans. It was one of the first chatterbots, which has attempted the Turing Test.
Parry: Parry was a chatterbot created by Kenneth Colby in 1972. Parry was designed to simulate a person with Paranoid schizophrenia(most common chronic mental disorder). Parry was described as ”ELIZA with attitude.” Parry was tested using a variation of the Turing Test in the early 1970s.
Eugene Goostman: Eugene Goostman was a chatbot developed in Saint Petersburg in 2001. This bot has competed in the various number of Turing Test. In June 2012, at an event, Goostman won the competition promoted as largest-ever Turing test content, in which it has convinced 29% of judges that it was a human.Goostman resembled as a 13-year old virtual boy.

3.15 The Chinese Room Argument

There were many philosophers who really disagreed with the complete concept of Artificial Intelligence. The most famous argument in this list was ”Chinese Room.” In the year 1980, John Searle presented ”Chinese Room” thought experiment, in his paper ”Mind, Brains, and Program,” which was against the validity of Turing’s Test. According to his argument, ”Programming a computer may make it to understand a language, but it will not produce a real understanding of language or consciousness in a computer.” He argued that Machine such as ELIZA and Parry could easily pass the Turing test by manipulating keywords and symbol, but they had no real understanding of language. So it cannot be described as ”thinking” capability of a machine such as a human.

Features required for a machine to pass the Turing test

Natural language processing: NLP is required to communicate with Interrogator in general human language like English.
Knowledge representation: To store and retrieve information during the test.
Automated reasoning: To use the previously stored information for answering the questions.
Machine learning: To adapt new changes and can detect generalized patterns.
Vision (For total Turing test): To recognize the interrogator actions and other objects during a test.
Motor Control (For total Turing test): To act upon objects if requested.

Chapter 4
Natural language processing

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of ”understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

4.1 History

Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled ”Computing Machinery and Intelligence” which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence.

4.1.1 Symbolic NLP (1950s early 1990s)

The premise of symbolic NLP is well-summarized by John Searle’s Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it is confronted with.

1950s: The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed.
1960s: Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted ”blocks worlds” with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the ”patient” exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to ”My head hurts” with ”Why do you say your head hurts?”.
1970s: During the 1970s, many programmers began to write ”conceptual ontologies”, which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, the first many chatterbots were written (e.g., PARRY).
1980s: The 1980s and early 1990s mark the hey-day of symbolic methods in NLP. Focus areas of the time included research on rule-based parsing (e.g., the development of HPSG as a computational operationalization of generative grammar), morphology (e.g., two-level morphology), semantics (e.g., Lesk algorithm), reference (e.g., within Centering Theory) and other areas of natural language understanding (e.g., in the Rhetorical Structure Theory). Other lines of research were continued, e.g., the development of chatterbots with Racter and Jabberwacky. An important development (that eventually led to the statistical turn in the 1990s) was the rising importance of quantitative evaluation in this period. ⁵³ ⁵⁴ ⁵⁵

4.1.2 Statistical NLP (1990s2010s)

Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore’s law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.

1990s: Many of the notable early successes on statistical methods in NLP occurred in the field of machine translation, due especially to work at IBM Research. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
2000s: With the growth of the web, increasing amounts of raw (unannotated) language data has become available since the mid-1990s. Research has thus increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.

4.1.3 Neural NLP (present)

In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques can achieve state-of-the-art results in many natural language tasks, for example in language modeling, parsing, and many others. This is increasingly important in medicine and healthcare, where NLP is being used to analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care. ⁵⁶

4.1.4 Components of NLP

There are the following two components of NLP

4.1.5 Natural Language Understanding (NLU)

Natural Language Understanding (NLU) helps the machine to understand and analyse human language by extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic roles. NLU mainly used in Business applications to understand the customer’s problem in both spoken and written language.

NLU involves the following tasks:

It is used to map the given input into useful representation. It is used to analyze different aspects of the language.

4.1.6 Natural Language Generation (NLG)

Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural language representation. It mainly involves Text planning, Sentence planning, and Text Realization.

4.1.7 Difference between NLU and NLG


NLU	NLG

NLU is the process of reading	NLG is the process of writing
and interpreting language.	or generating language.

It produces non-linguistic	It produces constructing natural language
outputs from natural language inputs.	outputs from non-linguistic inputs.

Table 4.1:Difference between NLU and NLG

4.2 Applications of NLP

The following are the applications of NLP

Question Answering: Question Answering focuses on building systems that automatically answer the questions asked by humans in a natural language.
Spam Detection: Spam detection is used to detect unwanted e-mails getting to a user’s inbox.
Sentiment Analysis: Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the attitude, behaviour, and emotional state of the sender. This application is implemented through a combination of NLP (Natural Language Processing) and statistics by assigning the values to the text (positive, negative, or natural), identify the mood of the context (happy, sad, angry, etc.)
Machine Translation: Machine translation is used to translate text or speech from one natural language to another natural language.
Spelling correction: Microsoft Corporation provides word processor software like MS-word, PowerPoint for the spelling correction.
Speech Recognition: Speech recognition is used for converting spoken words into text. It is used in applications, such as mobile, home automation, video recovery, dictating to Microsoft Word, voice biometrics, voice user interface, and so on.
Chatbot: Implementing the Chatbot is one of the important applications of NLP. It is used by many companies to provide the customer’s chat services.
Information extraction: Information extraction is one of the most important applications of NLP. It is used for extracting structured information from unstructured or semi-structured machine-readable documents.
Natural Language Understanding (NLU): It converts a large set of text into more formal representations such as first-order logic structures that are easier for the computer programs to manipulate notations of the natural language processing.

4.3 Speech recognition using Python

I am going to explain as how to write a small program that convert speech to text. There are many ways to do this i.e., either using online speech recognition tools like Google Speech Recognition, Google Cloud Speech API, Wit.ai or offline tools like CMU Sphinx. I will be using CMU Sphinx for I expect my computer convert speech even in the absence of internet connection. The following are steps for this task.

Install prerequisites: Sphinx needs two libraries they are swig and pyaudio
Install speech recognition library: I shall be using Python’s SpeechRecognition library.

Prerequisites

Swig and pyaudio are two important libraries that assists recording audio (speech).

What is swig

SWIG is an interface compiler that connects programs written in C and C++ with scripting languages such as Perl, Python, Ruby, and Tcl. It works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code. In addition, SWIG provides a variety of customization features that let you tailor the wrapping process to suit your application. Visit http://www.swig.org/exec.html for more details as how swig helps in developing extension modules using several other scripting languages.

Anybody can download swig from http://www.swig.org/download.html. There are instructions as how to obtain swig and install in different operating systems. For instance, swig for windows is available at http://prdownloads.sourceforge.net/swig/swigwin-4.0.2.zip. Sourceforge is a platform for creating, collaborating and distributing open-source and business related software. Visit https://sourceforge.net/ for more details.

After downloading swig just extract to one of the folders, preferably c: in Windows C drive. Then create environmental variable using “advanced system settings” feature in Windows.

Now its time to install the other requirement i.e., pyaudio. Installing pyaudio is not straight forward in Windows for pyaudio needs pipwin instead of pip.

 
1pip install pipwin 
2pipwin install pyaudio
 
3

The last requirement is the Python’s package speechrecognition the one which is main for recording and translating speech into text.

 
1pip install SpeechRecognition
 
2

Now we are ready to write a basic script useful for speech recognition. The following is the entire code required to create a script required to listen and print the voice in text.

 
1import speech_recognition as sr 
2 
3# obtain audio from the microphone 
4r = sr.Recognizer() 
5with sr.Microphone() as source: 
6    print("Please wait. Calibrating microphone...") 
7    # listen for 5 seconds and create the ambient noise energy level 
8    r.adjust_for_ambient_noise(source, duration=5) 
9    print("Say something!") 
10    audio = r.listen(source) 
11 
12    # recognize speech using Sphinx 
13try: 
14    print("Sphinx thinks you said ’" + r.recognize_sphinx(audio) + "’") 
15 
16except sr.UnknownValueError: 
17    print("Sphinx could not understand audio") 
18except sr.RequestError as e: 
19    print("Sphinx error; {0}".format(e))
 
20

We first need to create a speech recognition object which is sr in the above code. This object is used to record audio using object r through its object level method .listen(). Once the audio is being recorded, the stetement r.recognize_sphinx() converts the audio or speech into text.

4.4 How to build an NLP pipeline

There are the following steps to build an NLP pipeline

Step1: Sentence Segmentation: Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph into separate sentences.
Step2: Word Tokenization: Word Tokenizer is used to break the sentence into separate words or tokens.
Step3: Stemming: Stemming is used to normalize words into its base form or root form. For example, celebrates, celebrated and celebrating, all these words are originated with a single root word ”celebrate.” The big problem with stemming is that sometimes it produces the root word which may not have any meaning.
Step 4: Lemmatization: Lemmatization is quite similar to the Stamming. It is used to group different inflected forms of the word, called Lemma. The main difference between Stemming and lemmatization is that it produces the root word, which has a meaning.
Step 5: Identifying Stop Words: In English, there are a lot of words that appear very frequently like ”is”, ”and”, ”the”, and ”a”. NLP pipelines will flag these words as stop words. Stop words might be filtered out before doing any statistical analysis.
Step 6: Dependency Parsing: Dependency Parsing is used to find that how all the words in the sentence are related to each other.
Step 7: POS tags: POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It indicates that how a word functions with its meaning as well as grammatically within the sentences. A word has one or more parts of speech based on the context in which it is used.
Step 8: Named Entity Recognition (NER): Named Entity Recognition (NER) is the process of detecting the named entity such as person name, movie name, organization name, or location.
Step 9: Chunking: Chunking is used to collect the individual piece of information and grouping them into bigger pieces of sentences.

4.5 Python practice for NLP pipelines

Python has many libraries for NLP practice and development. Following are few packages for NLP.

I shall show as how to practice NLP pipelines using spaCy.

4.5.1 spaCy for NLP

The main portal for spaCy is https://spacy.io/. This has lot of information related to documentation comprising manuals. The website boasts that spaCy is “Industrial-Strength Natural Language Processing”. SpaCy supports various things say, multilingual support, machine learning, visualization, packaging and deployments. Above all, spaCy supports certain standard machine learning libraries such as PyTorch, TensorFlow etc. There are number of things in NLP, but for the sake of simplicity, I shall explain only few steps of NLP pipeline using spaCy plus Python native code.

Segmentation

Sentence segmentation also called as segmentation in NLP is just dividing a given text into sentenses. A Doc objects sentences are available via the Doc.sents property. To view a Doc’s sentences, you can iterate over the Doc.sents, a generator that yields Span objects. You can check whether a Doc has sentence boundaries by calling Doc.has_annotation with the attribute name “SENT_START”.

 
1import spacy 
2 
3nlp = spacy.load("en_core_web_sm") 
4doc = nlp("This is a sentence. This is another sentence.") 
5 
6for sent in doc.sents: 
7    print(sent.text)
 
8

The output will be as follows:

 
1This is a sentence. 
2This is another sentence.
 
3

There is nothing great about this. This can be achieved using Python basic function known as split().

 
1>>> txt = ’This is sentence one. This is sentence two.’ 
2>>> txt.split(’.’) 
3[’This is sentence one’, ’ This is sentence two’, ’’] 
4>>> out = txt.split(’.’) 
5>>> out[0] 
6’This is sentence one’ 
7>>> out[1] 
8’ This is sentence two’
 
9

Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input words and characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

As per spaCy’s documentation, the tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string. During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language.

 
1doc = nlp("Apple is looking at buying U.K. startup for \$1 billion") 
2 
3    for token in doc: 
4        if not token.is\_stop and token.is\_alpha: 
5            print(token.text)
 
6

There isn’t any special command to creates tokens from a given string in spaCy. The if condition checks for tokens that are alpha-numeric but not a stop word. As result we get the following output.

 
1Apple 
2looking 
3buying 
4startup 
5billion
 
6

Output is clean from stop words such as is, at etc., also other symbols such as periods (.) and dollar sign ($).

Lemmatization

Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research.

In spaCy; the Lemmatizer is a pipeline component that provides lookup and rule-based lemmatization methods in a configurable component. An individual language can extend the Lemmatizer as part of its language data.

 
1doc = nlp("I was reading the paper.") 
2    for token in doc: 
3        print((token.text, token.lemma_))
 
4

The output is as follows.

 
1(’I’, ’-PRON-’) 
2(’was’, ’be’) 
3(’reading’, ’read’) 
4(’the’, ’the’) 
5(’paper’, ’paper’) 
6(’.’, ’.’)
 
9

Three of the words namely, I, was and reading were lemmatized.

Named-entity recognition

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

A named entity is a “real-world object” thats assigned a name for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case. Named entities are available as the ents property of a Doc:

 
1doc = nlp("Apple is looking at buying U.K. startup for \$1 billion") 
2    for ent in doc.ents: 
3        print((ent.text, ent.label_))
 
4

The output is as follows.

 
1(’Apple’, ’ORG’) 
2(’U.K.’, ’GPE’) 
3(’\$1 billion’, ’MONEY’)
 
4

4.5.2 Phases of NLP

Following are the five phases of NLP.

Lexical Analysis and Morphological: The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream of characters and converts it into meaningful lexemes. It divides the whole text into paragraphs, sentences, and words.
Syntactic Analysis (Parsing): Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship among the words.
Semantic Analysis: Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal meaning of words, phrases, and sentences.
Discourse Integration: Discourse Integration depends upon the sentences that proceeds it and also invokes the meaning of the sentences that follow it.
Pragmatic Analysis: Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by applying a set of rules that characterize cooperative dialogues.

Notes

³³ Hopfield, J. J. (1982). ”Neural networks and physical systems with emergent collective computational abilities”. Proc. Natl. Acad. Sci. U.S.A. 79 (8): 25542558. doi:10.1073/pnas.79.8.2554. PMC 346238. PMID 6953413.

³⁴ Bain (1873). Mind and Body: The Theories of Their Relation. New York: D. Appleton and Company.

³⁵ James (1890). The Principles of Psychology. New York: H. Holt and Company.

³⁶ Sherrington, C.S. (1898). ”Experiments in Examination of the Peripheral Distribution of the Fibers of the Posterior Roots of Some Spinal Nerves”. Proceedings of the Royal Society of London. 190: 45186. doi:10.1098/rstb.1898.0002.

³⁷ McCulloch, Warren; Walter Pitts (1943). ”A Logical Calculus of Ideas Immanent in Nervous Activity”. Bulletin of Mathematical Biophysics. 5 (4): 115133. doi:10.1007/BF02478259.

³⁸ Farley, B.; W.A. Clark (1954). ”Simulation of Self-Organizing Systems by Digital Computer”. IRE Transactions on Information Theory. 4 (4): 7684. doi:10.1109/TIT.1954.1057468.

³⁹ Rochester, N.; J.H. Holland, L.H. Habit and W.L. Duda (1956). ”Tests on a cell assembly theory of the action of the brain, using a large digital computer”. IRE Transactions on Information Theory. 2 (3): 8093. doi:10.1109/TIT.1956.1056810.

⁴⁰ Rosenblatt, F. (1958). ”The Perceptron: A Probalistic Model For Information Storage And Organization In The Brain”. Psychological Review. 65 (6): 386408. CiteSeerX 10.1.1.588.3775. doi:10.1037/h0042519. PMID 13602029.

⁴¹ Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.

⁴² Minsky, M.; S. Papert (1969). An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-63022-1.

⁴³ Rumelhart, D.E.; James McClelland (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge: MIT Press.

⁴⁴ McCulloch, Warren; Pitts, Walter (1943). ”A Logical Calculus of Ideas Immanent in Nervous Activity”. Bulletin of Mathematical Biophysics. 5 (4): 115133. doi:10.1007/BF02478259.

⁴⁵ A Hopfield network (or Ising model of a neural network or IsingLenzLittle model) is a form of recurrent artificial neural network and a type of spin glass system popularised by John Hopfield in 1982 as described earlier by Little in 1974 based on Ernst Ising’s work with Wilhelm Lenz on the Ising model. Hopfield networks serve as content-addressable (”associative”) memory systems with binary threshold nodes. Hopfield networks also provide a model for understanding human memory.

⁴⁶ Copeland, B. Jack, ed. (2004). The Essential Turing. Oxford University Press. p. 403. ISBN 978-0-19-825080-7.

⁴⁷ von Neumann, John (1945), First Draft of a Report on the EDVAC (PDF), archived from the original (PDF) on March 14, 2013, retrieved August 24, 2011

⁴⁸ Turing’s 1948 paper has been re-printed as Turing AM. Intelligent Machinery. In: Ince DC, editor. Collected works of AM Turing Mechanical Intelligence. Elsevier Science Publishers, 1992.

⁴⁹ Hebb, D.O. (1949). The Organization of Behavior. New York: Wiley Sons.

⁵⁰ Zell, Andreas (1994). Simulation Neuronaler Netze [Simulation of Neural Networks] (in German) (1st ed.). Addison-Wesley. p. 73. ISBN 3-89319-554-8.

⁵¹ Schmidhuber, Jrgen (2015-01-01). ”Deep learning in neural networks: An overview”. Neural Networks. 61: 85117. arXiv:1404.7828.

⁵² Auer, Peter; Harald Burgsteiner; Wolfgang Maass (2008). ”A learning rule for very simple universal approximators consisting of a single layer of perceptrons” (PDF). Neural Networks. 21 (5): 786795. doi:10.1016/j.neunet.2007.12.036.

⁵³ Koskenniemi, Kimmo (1983), Two-level morphology: A general computational model of word-form recognition and production (PDF), Department of General Linguistics, University of Helsinki

⁵⁴ Joshi, A. K., Weinstein, S. (1981, August). Control of Inference: Role of Some Aspects of Discourse Structure-Centering. In IJCAI (pp. 385-387).

⁵⁵ Guida, G.; Mauri, G. (July 1986). ”Evaluation of natural language processing systems: Issues and approaches”. Proceedings of the IEEE. 74 (7): 10261035. doi:10.1109/PROC.1986.13580. ISSN 1558-2256. S2CID 30688575.

⁵⁶ Turchin, Alexander; Florez Builes, Luisa F. (2021-03-19). ”Using Natural Language Processing to Measure and Improve Quality of Diabetes Care: A Systematic Review”. Journal of Diabetes Science and Technology. 15 (3): 553560.

Excercises

1. What is von Neumann architecture? Explain with the help of a neat diagram. 2. Write about Sherrington’s habituation and its significance in Artificial Neural Networks. 3. What is Turing’s B-type machine? Write about its significance in Artificial Neural Networks. 4. What is Hebbian learning. Write about its significance in Artificial Neural Networks. 5. Write about the following.

Back-propagation networks
Feed-forward networks

6. “Artificial neurons coupled with the theory of connectionism gave raise to the field of artificial neural networks...” - Comment. 7. Find out different types of Agents with their PEAS representation. Few user cases: Medical Diagnose, Vacuum Cleaner, Part -picking Robot, etc. 8. Read about speech recognition using Python. Write a small Python script that can convert speech to text. Read a bit of the procedure at https://realpython.com/python-speech-recognition/.

Chapter 5
Machine Learning

5.1 Supervised learning

Supervised learning (SL) is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a ”reasonable” way (see inductive bias). This statistical quality of an algorithm is measured through the so-called generalization error.⁵⁷ ⁵⁸

To solve a given problem of supervised learning, one has to perform the following steps:

Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, an entire sentence of handwriting or perhaps a full paragraph of handwriting.
Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.
Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support-vector machines or decision trees.
Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.

Algorithms

The most widely used learning algorithms are:

Support-vector machines
Linear regression
Logistic regression
Naive Bayes
Linear discriminant analysis
Decision trees
K-nearest neighbor algorithm
Neural networks (Multilayer perceptron)
Similarity learning

Applications

Bioinformatics
Cheminformatics
Quantitative structureactivity relationship
Database marketing
Handwriting recognition
Information retrieval
Learning to rank
Information extraction recognition in computer vision character recognition detection
Pattern recognition
Speech recognition
Supervised learning is a special case of downward causation in biological systems
Landform classification using satellite imagery

5.1.1 Logistic regression

In statistics logistic regression is used to model the probability of a certain class or event. I will be focusing more on the basics and implementation of the model, and not go too deep into the math part. Logistic regression is similar to linear regression because both of these involve estimating the values of parameters used in the prediction equation based on the given training data. Linear regression predicts the value of some continuous, dependent variable. Whereas logistic regression predicts the probability of an event or class that is dependent on other factors. Thus the output of logistic regression always lies between 0 and 1. Because of this property it is commonly used for classification purpose.

Logistic Model

Consider a model with features x1, x2, x3 ... xn. Let the binary output be denoted by Y , that can take the values 0 or 1. Let p be the probability of Y = 1, we can denote it as p = P(Y=1). The mathematical relationship between these variables can be denoted as:

\log_{b} \frac{p}{1 - p} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + . . . + β_{n} x_{n}

(5.1)

Here the term p/(1-p) is known as the odds and denotes the likelihood of the event taking place. Thus ln(p/(1-p)) is known as the log odds and is simply used to map the probability that lies between 0 and 1 to a range between $(- \infty, + \infty)$ . The terms b0, b1, b2 ... are parameters (or weights) that we will estimate during training. So this is just the basic math behind what we are going to do. We are interested in the probability p in this equation. So we simplify the equation to obtain the value of p:

The log term ln on the LHS can be removed by raising the RHS as a power of e:
$\frac{p}{1 - p} = b^{β_{0} + β_{1} x_{1} + β_{2} x_{2}} .$ (5.2)

Now we can easily simplify to obtain the value of p:

p = \frac{b^{β_{0} + β_{1} x_{1} + β_{2} x_{2}}}{b^{β_{0} + β_{1} x_{1} + β_{2} x_{2}} + 1} = \frac{1}{1 + b^{- (β_{0} + β_{1} x_{1} + β_{2} x_{2})}} = S_{b} (β_{0} + β_{1} x_{1} + β_{2} x_{2}) .

(5.3)

This actually turns out to be the equation of the Sigmoid Function which is widely used in other machine learning applications. The Sigmoid Function is given by:

σ (t) = \frac{e^{t}}{e^{t} + 1} = \frac{1}{1 + e^{- t}}

(5.4)

Now we will be using the above derived equation to make our predictions. Before that we will train our model to obtain the values of our parameters b0, b1, b2 ... that result in least error. This is where the error or loss function comes in.

Loss Function

The loss is basically the error in our predicted value. In other words it is a difference between our predicted value and the actual value. We will be using the L2 Loss Function to calculate the error. Theoretically you can use any function to calculate the error. This function can be broken down as:

Let the actual value be $y_{i}$ . Let the value predicted using our model be denoted as $\hat{y_{i}}$ . Find the difference between the actual and predicted value.
Square this difference.
Find the sum across all the values in training data.
$L = \sum_{i = 1}^{n} {(y + i - \bar{y_{i}})}^{2}$ (5.5)

Now that we have the error, we need to update the values of our parameters to minimize this error. This is where the learning actually happens, since our model is updating itself based on its previous output to obtain a more accurate output in the next step. Hence with each iteration our model becomes more and more accurate. We will be using the Gradient Descent Algorithm to estimate our parameters. Another commonly used algorithm is the Maximum Likelihood Estimation.

The Gradient Descent Algorithm

You might know that the partial derivative of a function at its minimum value is equal to 0. So gradient descent basically uses this concept to estimate the parameters or weights of our model by minimizing the loss function. For simplicity, for the rest of this tutorial let us assume that our output depends only on a single feature x. So we can rewrite our equation as:

y_{i} = p = \frac{1}{1 + e^{- (β_{0} + β_{1} x)}}

(5.6)

Thus we need to estimate the values of weights b0 and b1 using our given training data.

Initially let b0=0 and b1=0. Let L be the learning rate. The learning rate controls by how much the values of b0 and b1 are updated at each step in the learning process. Here let L=0.001.
Calculate the partial derivative with respect to b0 and b1. The value of the partial derivative will tell us how far the loss function is from its minimum value. It is a measure of how much our weights need to be updated to attain minimum or ideally 0 error. In case you have more than one feature, you need to calculate the partial derivative for each weight b0, b1 ... bn where n is the number of features.
$D_{b 0} = - 2 \sum_{i = 1}^{n} (y_{i} - \bar{y_{i}}) \times y_{i} \times (1 - \bar{y_{i}}) D_{b 1} = - 2 \sum_{i = 1}^{n} (y_{i} - \bar{y_{i}}) \times y_{i} \times (1 - \bar{y_{i}}) \times x_{i}$ (5.7)
Next we update the values of b0 and b1:

$b_{0} = b_{0} - L \times D_{b 0} b_{1} = b_{1} - L \times D_{b 1}$ (5.8)
We repeat this process until our loss function is a very small value or ideally reaches 0 (meaning no errors and 100% accuracy). The number of times we repeat this learning process is known as iterations or epochs.

Python practice

 
1>>> from sklearn.datasets import load_iris 
2>>> from sklearn.linear_model import LogisticRegressionCV 
3>>> X, y = load_iris(return_X_y=True) 
4>>> clf = LogisticRegressionCV(cv=5, random_state=0).fit(X, y) 
5>>> clf.predict(X[:2, :]) 
6array([0, 0]) 
7>>> clf.predict_proba(X[:2, :]).shape 
8(2, 3) 
9>>> clf.score(X, y) 
100.98...
 
11

5.1.2 K-Nearest Neighbors Algorithm (k-NN)

The k-nearest neighbors algorithm (k-NN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. ⁵⁹ ⁶⁰ It is used for classification and regression. In both cases, the input consists of the k closest training examples in data set. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizing the training data can improve its accuracy dramatically. ⁶¹ ⁶²

Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the data.

Suppose we have pairs $(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})$ taking values in $R^{d} \times {1, 2}$ , where Y is the class label of X, so that $X | Y = r \sim P_{r}$ for $r = 1, 2$ (and probability distributions $P_{r}$ . Given some norm $∥ \cdot ∥$ on $R^{d}$ and a point $x \in R^{d}$ , let $(X_{(1)}, Y_{(1)}), \dots, (X_{(n)}, Y_{(n)})$ be a reordering of the training data such that $∥ X_{(1)} - x ∥ \leq \dots \leq ∥ X_{(n)} - x ∥$ .

Algorithm

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric. Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

A drawback of the basic ”majority voting” classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. K-NN can then be applied to the SOM.

Figure 5.1:Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).

Python practice

Demonstrate the resolution of a regression problem using a k-Nearest Neighbor and the interpolation of the target using both barycenter and constant weights.

 
1import numpy as np 
2import matplotlib.pyplot as plt 
3from sklearn import neighbors 
4 
5np.random.seed(0) 
6X = np.sort(5 * np.random.rand(40, 1), axis=0) 
7T = np.linspace(0, 5, 500)[:, np.newaxis] 
8y = np.sin(X).ravel() 
9 
10# Add noise to targets 
11y[::5] += 1 * (0.5 - np.random.rand(8)) 
12 
13# Fit regression model 
14n_neighbors = 5 
15 
16for i, weights in enumerate([’uniform’, ’distance’]): 
17    knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights) 
18    y_ = knn.fit(X, y).predict(T) 
19 
20    plt.subplot(2, 1, i + 1) 
21    plt.scatter(X, y, color=’darkorange’, label=’data’) 
22    plt.plot(T, y_, color=’navy’, label=’prediction’) 
23    plt.axis(’tight’) 
24    plt.legend() 
25    plt.title("KNeighborsRegressor (k = %i, weights = ’%s’)" % (n_neighbors, weights)) 
26 
27plt.tight_layout() 
28plt.show()
 
29

Figure 5.2:KNN for both Uniform and distance based weights.

5.1.3 Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation. For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.

Some advantages of decision trees are:

Simple to understand and to interpret. Trees can be visualised.
Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
Able to handle both numerical and categorical data. However scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
Able to handle multi-output problems.
Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation.
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Classification

DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset. As with other classifiers, DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of shape $(n_s a m p l e s, n_f e a t u r e s)$ holding the training samples, and an array Y of integer values, shape $(n_s a m p l e s,)$ , holding the class labels for the training samples:

 
1>>> from sklearn import tree 
2>>> X = [[0, 0], [1, 1]] 
3>>> Y = [0, 1] 
4>>> clf = tree.DecisionTreeClassifier() 
5>>> clf = clf.fit(X, Y)
 
6

After being fitted, the model can then be used to predict the class of samples:

 
1>>> clf.predict([[2., 2.]]) 
2array([1])
 
3

In case that there are multiple classes with the same and highest probability, the classifier will predict the class with the lowest index amongst those classes. As an alternative to outputting a specific class, the probability of each class can be predicted, which is the fraction of training samples of the class in a leaf:

 
1>>> clf.predict_proba([[2., 2.]]) 
2array([[0., 1.]])
 
3

DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0,..., K-1]) classification. Using the Iris dataset, we can construct a tree as follows:

 
1>>> from sklearn.datasets import load_iris 
2>>> from sklearn import tree 
3>>> iris = load_iris() 
4>>> X, y = iris.data, iris.target 
5>>> clf = tree.DecisionTreeClassifier() 
6>>> clf = clf.fit(X, y)
 
7

Once trained, you can plot the tree with the plot_tree function:

 
1>>> tree.plot_tree(clf)
 
2

Figure 5.4:Decision Fit on Iris Data Set

We can also export the tree in Graphviz format using the export_graphviz exporter. If you use the conda package manager, the graphviz binaries and the python package can be installed with conda install python-graphviz. ⁶³ Alternatively binaries for graphviz can be downloaded from the graphviz project homepage, and the Python wrapper installed from pypi with pip install graphviz. Below is an example graphviz export of the above tree trained on the entire iris dataset; the results are saved in an output file iris.pdf :

 
1>>> import graphviz 
2>>> dot_data = tree.export_graphviz(clf, out_file=None) 
3>>> graph = graphviz.Source(dot_data) 
4>>> graph.render("iris")
 
5

The export_graphviz exporter also supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. Jupyter notebooks also render these plots inline automatically:

 
1>>> dot_data = tree.export_graphviz(clf, out_file=None, 
2...                      feature_names=iris.feature_names, 
3...                      class_names=iris.target_names, 
4...                      filled=True, rounded=True, 
5...                      special_characters=True) 
6>>> graph = graphviz.Source(dot_data) 
7>>> graph
 
8

Regression

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class. As in the classification setting, the fit method will take as argument arrays X and y, only that in this case y is expected to have floating point values instead of integer values:

 
1>>> from sklearn import tree 
2>>> X = [[0, 0], [2, 2]] 
3>>> y = [0.5, 2.5] 
4>>> clf = tree.DecisionTreeRegressor() 
5>>> clf = clf.fit(X, y) 
6>>> clf.predict([[1, 1]]) 
7array([0.5])
 
8

5.1.4 Support Vector Machines

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. The advantages of support vector machines are:

Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

The support vector machines in scikit-learn support both dense (numpy.ndarray and convertible to that by numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-ordered numpy.ndarray (dense) or scipy.sparse.csr_matrix (sparse) with dtype=float64.

Classification

SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification on a dataset. SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulations (see section Mathematical formulation). On the other hand, LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel. Note that LinearSVC does not accept parameter kernel, as this is assumed to be linear. It also lacks some of the attributes of SVC and NuSVC, like support_. As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of shape $(n_s a m p l e s, n_f e a t u r e s)$ holding the training samples, and an array y of class labels (strings or integers), of shape $(n_s a m p l e s)$ :

 
1>>> from sklearn import svm 
2>>> X = [[0, 0], [1, 1]] 
3>>> y = [0, 1] 
4>>> clf = svm.SVC() 
5>>> clf.fit(X, y) 
6SVC()
 
7

After being fitted, the model can then be used to predict new values:

 
1>>> clf.predict([[2., 2.]]) 
2array([1])
 
3

SVMs decision function (detailed in the Mathematical formulation) depends on some subset of the training data, called the support vectors. Some properties of these support vectors can be found in attributes $s u p p o r t_v e c t o r s_, s u p p o r t_a n d n_s u p p o r t_$ :

 
1>>> # get support vectors 
2>>> clf.support_vectors_ 
3array([[0., 0.], 
4       [1., 1.]]) 
5>>> # get indices of support vectors 
6>>> clf.support_ 
7array([0, 1]...) 
8>>> # get number of support vectors for each class 
9>>> clf.n_support_ 
10array([1, 1]...)
 
11

Regression

The method of Support Vector Classification can be extended to solve regression problems. This method is called Support Vector Regression. The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function ignores samples whose prediction is close to their target.

There are three different implementations of Support Vector Regression: SVR, NuSVR and LinearSVR. LinearSVR provides a faster implementation than SVR but only considers the linear kernel, while NuSVR implements a slightly different formulation than SVR and LinearSVR. As with classification classes, the fit method will take as argument vectors X, y, only that in this case y is expected to have floating point values instead of integer values:

 
1>>> from sklearn import svm 
2>>> X = [[0, 0], [2, 2]] 
3>>> y = [0.5, 2.5] 
4>>> regr = svm.SVR() 
5>>> regr.fit(X, y) 
6SVR() 
7>>> regr.predict([[1, 1]]) 
8array([1.5])
 
9

5.2 Unsupervised learning

Unsupervised learning (UL) is a type of algorithm that learns patterns from untagged data. The hope is that, through mimicry, the machine is forced to build a compact internal representation of its world and then generate imaginative content. In contrast to supervised learning (SL) where data is tagged by a human, e.g. as ”car” or ”fish” etc, UL exhibits self-organization that captures patterns as neuronal predilections or probability densities. The other levels in the supervision spectrum are reinforcement learning where the machine is given only a numerical performance score as its guidance, and semi-supervised learning where a smaller portion of the data is tagged. Two broad methods in UL are Neural Networks and Probabilistic Methods. ⁶⁴

5.2.1 Approaches

Some of the most common algorithms used in unsupervised learning include: (1) Clustering, (2) Anomaly detection, (3) Neural Networks, and (4) Approaches for learning latent variable models. Each approach uses several methods as follows:

Clustering methods include: hierarchical clustering, k-means, mixture models, DBSCAN, and OPTICS algorithm
Anomaly detection methods include: Local Outlier Factor, and Isolation Forest
Approaches for learning latent variable models such as Expectationmaximization algorithm (EM), Method of moments, and Blind signal separation techniques (Principal component analysis, Independent component analysis, Non-negative matrix factorization, Singular value decomposition)
Neural Networks methods include: Autoencoders, Deep Belief Nets, Hebbian Learning, Generative adversarial networks, and Self-organizing map

5.2.2 Clustering

Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and introduced to psychology by Joseph Zubin in 1938 and Robert Tryon in 1939 and famously used by Cattell beginning in 1943 for trait theory classification in personality psychology. ⁶⁵ ⁶⁶ ⁶⁷ ⁶⁸

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology, typological analysis, and community detection. The subtle differences are often in the use of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification the resulting discriminative power is of interest.

5.2.3 Algorithms for clustering

Clustering algorithms can be categorized based on their cluster model. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms.

There is no objectively “correct” clustering algorithm, but as it was noted, “clustering is in the eye of the beholder.” The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. An algorithm that is designed for one kind of model will generally fail on a data set that contains a radically different kind of model. For example, k-means cannot find non-convex clusters. ⁶⁹ Following are some of the algorithms available for clustering.

Connectivity models: for example, hierarchical clustering builds models based on distance connectivity.
Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector.
Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the expectation-maximization algorithm.
Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
Subspace models: in biclustering (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.
Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm.
Signed graph models: Every path in a signed graph has a sign from the product of the signs on the edges. Under the assumptions of balance theory, edges may change sign and result in a bifurcated graph. The weaker ”clusterability axiom” (no cycle has exactly one negative edge) yields results with more than two clusters, or subgraphs with only positive edges.
Neural models: the most well known unsupervised neural network is the self-organizing map and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of Principal Component Analysis or Independent Component Analysis.

A “clustering” is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:

Hard clustering: each object belongs to a cluster or not
Soft clustering (fuzzy clustering): each object belongs to each cluster to a certain degree (for example, a likelihood of belonging to the cluster)

There are also finer distinctions possible, for example:

Strict partitioning clustering: each object belongs to exactly one cluster
Strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers Overlapping clustering (also: alternative clustering, multi-view clustering): objects may belong to more than one cluster; usually involving hard clusters
Hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
Subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap

5.2.4 K-Means clustering

Also known as centroid-based clustering. In centroid-based clustering clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximate method is Lloyd’s algorithm, often just referred to as “k-means algorithm”. It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means).

Most k-means-type algorithms require the number of clusters - k - to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders of clusters (which is not surprising since the algorithm optimizes cluster centers, not cluster borders).

K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based clustering, and Lloyd’s algorithm as a variation of the Expectation-maximization algorithm for this model discussed below.

Centroid-based clustering problems such as k-means and k-medoids are special cases of the uncapacitated, metric facility location problem, a canonical problem in the operations research and computational geometry communities. In a basic facility location problem (of which there are numerous variants that model more elaborate settings), the task is to find the best warehouse locations to optimally service a given set of consumers. One may view “warehouses” as cluster centroids and “consumer locations” as the data to be clustered. This makes it possible to apply the well-developed algorithmic solutions from the facility location literature to the presently considered centroid-based clustering problem.

5.2.5 K-Means clustering using Python

Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.

clustering algorithms in scikit-learn

K-Means
Affinity propagation
Mean-shift
Spectral clustering
Ward hierarchical clustering
Agglomerative clustering
DBSCAN
OPTICS
Gaussian mixtures
BIRCH

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields. The k-means algorithm divides a set of samples into disjoint clusters , each described by the mean of the samples in the cluster. The means are commonly called the cluster centroids; note that they are not, in general, points from , although they live in the same space. The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion:

\sum_{i = 0}^{n} \frac{m i n}{μ_{j \in c}} (| | x_{i} - μ_{i} | |^{2})

(5.9)

Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks. ⁷⁰. The k-means problem is solved using either Lloyds or Elkans algorithm. The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration. The worst case complexity is given by $O (n^{k + 2 ∕ p})$ with n = samples, p = features. In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. Thats why it can be useful to restart it several times. If the algorithm stops before fully converging (because of tol or $max_iter$ ), $labels_$ and $cluster_centers_$ will not be consistent, i.e. the $cluster_centers_$ will not be the means of the points in each cluster. Also, the estimator will reassign $labels_$ after the last iteration to make $labels_$ consistent with predict on the training set.

 
1>>> from sklearn.cluster import KMeans 
2>>> import numpy as np 
3>>> X = np.array([[1, 2], [1, 4], [1, 0], 
4...               [10, 2], [10, 4], [10, 0]]) 
5>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X) 
6>>> kmeans.labels_ 
7array([1, 1, 1, 0, 0, 0], dtype=int32) 
8>>> kmeans.predict([[0, 0], [12, 3]]) 
9array([1, 0], dtype=int32) 
10>>> kmeans.cluster_centers_ 
11array([[10.,  2.], 
12       [ 1.,  2.]])
 
13

5.2.6 Hierarchical clustering

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. ⁷¹ ⁷²

The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of $𝒪 (n^{3})$ and requires $Ω (n^{2})$ memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity $𝒪 (n^{2}))$ are known: SLINK for single-linkage and CLINK for complete-linkage clustering. With a heap, the runtime of the general case can be reduced to $𝒪 (n^{2} \log n)$ , an improvement on the aforementioned bound of $𝒪 (n^{3})$ , at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable. Except for the special case of single-linkage, none of the algorithms (except exhaustive search in $𝒪 (2^{n})$ can be guaranteed to find the optimum solution. Divisive clustering with an exhaustive search is $𝒪 (2^{n})$ , but it is common to use faster heuristics to choose splits, such as k-means. ⁷³ ⁷⁴

In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

The choice of an appropriate metric will influence the shape of the clusters, as some elements may be relatively closer to one another under one metric than another. For example, in two dimensions, under the Manhattan distance metric, the distance between the origin (0,0) and (0.5, 0.5) is the same as the distance between the origin and (0, 1), while under the Euclidean distance metric the latter is strictly greater. Some commonly used metrics for hierarchical clustering are:


Names	Formula

Euclidean distance	$∥ a - b ∥_{2} = \sqrt{\sum_{i} {(a_{i} - b_{i})}^{2}}$
Squared Euclidean distance	$∥ a - b ∥_{2}^{2} = \sum_{i} {(a_{i} - b_{i})}^{2}$
Manhattan distance	$∥ a - b ∥_{1} = \sum_{i} \| a_{i} - b_{i} \|$
Maximum distance	$∥ a - b ∥_{\infty} = \max_{i} \| a_{i} - b_{i} \|$
Mahalanobis distance	$\sqrt{{(a - b)}^{⊤} S^{- 1} (a - b)}$
	where S is the Covariance matrix

Table 5.1:Distance metric for HAC.

Python practice

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for more details. The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.
Average linkage minimizes the average of the distances between all observations of pairs of clusters.
Single linkage minimizes the distance between the closest observations of pairs of clusters.

AgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.

 
1>>> from sklearn.cluster import AgglomerativeClustering 
2>>> import numpy as np 
3>>> X = np.array([[1, 2], [1, 4], [1, 0], 
4...               [4, 2], [4, 4], [4, 0]]) 
5>>> clustering = AgglomerativeClustering().fit(X) 
6>>> clustering 
7AgglomerativeClustering() 
8>>> clustering.labels_ 
9array([1, 1, 1, 0, 0, 0])
 
10

Plotting Dendrogram

This example plots the corresponding dendrogram of a hierarchical clustering using AgglomerativeClustering and the dendrogram method available in scipy.

 
1import numpy as np 
2 
3from matplotlib import pyplot as plt 
4from scipy.cluster.hierarchy import dendrogram 
5from sklearn.datasets import load_iris 
6from sklearn.cluster import AgglomerativeClustering 
7 
8 
9def plot_dendrogram(model, **kwargs): 
10    # Create linkage matrix and then plot the dendrogram 
11 
12    # create the counts of samples under each node 
13    counts = np.zeros(model.children_.shape[0]) 
14    n_samples = len(model.labels_) 
15    for i, merge in enumerate(model.children_): 
16        current_count = 0 
17        for child_idx in merge: 
18            if child_idx < n_samples: 
19                current_count += 1  # leaf node 
20            else: 
21                current_count += counts[child_idx - n_samples] 
22        counts[i] = current_count 
23 
24    linkage_matrix = np.column_stack( 
25        [model.children_, model.distances_, counts] 
26    ).astype(float) 
27 
28    # Plot the corresponding dendrogram 
29    dendrogram(linkage_matrix, **kwargs) 
30 
31 
32iris = load_iris() 
33X = iris.data 
34 
35# setting distance_threshold=0 ensures we compute the full tree. 
36model = AgglomerativeClustering(distance_threshold=0, n_clusters=None) 
37 
38model = model.fit(X) 
39plt.title("Hierarchical Clustering Dendrogram") 
40# plot the top three levels of the dendrogram 
41plot_dendrogram(model, truncate_mode="level", p=3) 
42plt.xlabel("Number of points in node (or index of point if no parenthesis).") 
43plt.show()
 
44