Part 1 — A gentle introduction

Image credit: Seanbatty

When most people hear the term Artificial Intelligence they think of the Terminator movies or the general notion of machines that can “think” the way a biological brain does. It’s safe to say that we don’t need to worry about robots enslaving humankind for the foreseeable future but chances are you’re still curious about what exactly is AI used for in real life and how it works. Some of the foundations of AI include:

Search problems — Used in games, among other areas.
Constraint satisfaction — Some examples could be scheduling a university’s offerings for all teachers, courses, rooms, and equipment or designing a factory floor layout.
Logic and reasoning — Planning the sequence of actions that will achieve a goal such as a plan to take cargo shipments to their destinations in an efficient way.
Inference — Using probability to answer questions given the available evidence.

Building on these foundations we can create systems and applications for computer vision, natural language processing, voice user interfaces, and self-driving cars. Many of these domains have something in common: pattern recognition through time. And we need to do this in noisy environments, meaning we don’t have access to the true information we’re after (such as the word someone meant to say) but instead we can only make inferences from the data available to our sensors (such as an audio signal).

For this type of problems we want to identify some basic units that combine in sequences to form larger units. These can be sounds that form words, which in turn form sentences. Or image sequences that combine to depict sign language words and sentences. It is this last application that we’ll use to dive deeper into AI. We’ll see how we can go about building an American Sign Language recognizer. But first, we’ll go through a simpler example.

A very widely used technique for identifying signals is something called a Hidden Markov Model or HMM.

There are different ways to represent HMMs so let’s start with a simple one. Let’s say you’re spending several days inside a lab with no windows, working really hard and you’d like to know whether it’s raining. Not being able to look outside, the only evidence you have access to is whether the advisor coming in every day has an umbrella with her. Let’s designate each day as a state, which could be rainy or not. Similarly, we’ll say the umbrella is the evidence, which is the presence or absence of an umbrella.

We’ll use arrows to show when a node in our diagram influences another. For example, whether it rains today tells us something about the likelihood of getting rain tomorrow and rain influences the probability that the advisor will bring an umbrella that day. So for days 1, 2, and 3 we have:

Figure 1. HMM for rainy days

State zero at the far left side will be useful to us for bookkeeping purposes later so don’t worry about it for now. Remember when we said our observations were noisy? The fact that it rained today doesn’t guarantee that it will tomorrow and the fact that it rained doesn’t guarantee that the advisor will bring an umbrella either. She can forget it at home or she could bring it on a day when it turns out it doesn’t rain after all. That’s where probabilities come in.

Let’s now look at the probabilities. We’re interested in the probability of:

Rain on a given day or time $t$ ,
Rain on the next time $t+1$ , and
Rain given that it rained the previous day.

On the table below we can see how the second row tells us that the probability that it doesn’t rain ( $-r$ ) tomorrow given that it rained ( $+r$ ) today is 0.3, or 30%.

$R_t$	$R_{t+1}$	$P(R_{t+1} \| R_t)$
$+r$	$+r$	$0.7$
$+r$	$-r$	$0.3$
$-r$	$+r$	$0.3$
$-r$	$-r$	$0.7$

Figure 2. Probability distribution for Rain given the previous day’s conditions

Similarly, we can have the following probability distribution for the advisor bringing an umbrella given the weather conditions on day $t$ . For example, there’s a 90% probability of seeing the advisor bring an umbrella when it rained and a 10% probability of her not bringing an umbrella on a rainy day.

$R_t$	$U_t$	$P(U_t \| R_t)$
$+r$	$+u$	$0.9$
$+r$	$-u$	$0.1$
$-r$	$+u$	$0.2$
$-r$	$-u$	$0.8$

Figure 3. Probability distribution for Umbrella given Rain

Notice how we don’t really know whether it rains or not but we can make intelligent inferences based on the available evidence, namely the presence or absence of an umbrella. Now, From the arrows in figure 1 we can see that variables are not directly affected by all the other variables, only by the ones at the other end of an incoming arrow. This is important because we’ll think of the probability of an event as something that happens given an event on which it depends. For example, the probability that there’s an umbrella on day 2 given that it rained on day 2, or the probability that it rained on day 2 given that it did not rain on day 1.

As each day goes by and we see if there is an umbrella that day, we alternate between two events: incorporating new evidence into our knowledge and accounting for the passage of time.

Let’s say we observed an umbrella on day 1 and on day 2. What’s the probability that it rained on day 2? Before you keep on reading try to think about what information you would need to make that calculation.

Let’s settle on a nomenclature to use. We’ll call the rain nodes in Figure 1 our state variables $X$ and the umbrella nodes our evidence variables $e$ . They will all be indexed by day or time frame $t$ . Also, we’ll refer to our belief about variable $X$ at time $t$ as $B(X_t)$ after seeing that day’s evidence. Our belief before seeing the day’s evidence will be $B'(X_t)$ . The day following $t$ will of course be $t+1$ . To wrap things up let’s clarify what we mean by a belief: the probability of an event given the available evidence.

This way our belief about variable $X$ at time $t+1$ is the probability of $X$ at time $t+1$ given evidence $1$ through $t$ . This is expressed like this:

B'(X_{t+1})=P(X_{t+1} | e_{1:t})

Belief before seeing the evidence

Feel free to zoom in or out on your browser (command +, command -) to see the formulas comfortably.

💡 We'll be using Bayes’ Theorem for calculating the probability of an event given that another event happened:

1) $P(A|B)={P(B|A)P(A) \over P(B)}$

ℹ️ Where did that come from? It can be derived from the conditional probability that tells us in how many cases do both events $A$ and $B$ happen out of the cases where $B$ happens:

2) $P(A|B)={P(A \cap B) \over P(B)}$

Similarly, $P(B|A)={P(A \cap B) \over P(A)}$

which means

3) $P(A \cap B)=P(B|A)P(A)$

and substituting in the expression for conditional probability yields Bayes' Theorem:

$P(A|B)={P(B|A)P(A) \over P(B)}$

All the operations going forward are based on Bayes’ Theorem to derive the probability of an event given another and go down the path that leads us to the variables we’re interested in. Figuring out this path takes some intuition but we can also just try all the ways to play with it until we get to where we want to go.

After the evidence for time frame $t+1$ comes in, we want the probability of $X$ including the new evidence, that is, the evidence at $t+1$ given all the previous evidence.

So from Bayes' Theorem (2) now we have:

P(X_{t+1}|(e_{t+1}|e_{1:t}))={P(X_{t+1},(e_{t+1}|e_{1:t})) \over P(e_{t+1}|e_{1:t})}

Probability of the variable after the evidence comes in

At this point we can see that nothing on the denominator depends on the $X$ variable so we can get rid of it with the understanding that this will no longer be an equality but instead the term on the left will be proportional to the term on the right and that’s what the new $\propto$ symbol means:

P(X_{t+1}|(e_{t+1}|e_{1:t})) \propto P(X_{t+1},(e_{t+1}|e_{1:t}))

P(X_{t+1}|(e_{t+1}|e_{1:t})) \propto P(X_{t+1},e_{t+1}|e_{1:t})

And by Bayes' Theorem (3):

P(X_{t+1}|(e_{t+1}|e_{1:t})) \propto P(((e_{t+1}|X_{t+1}),X_{t+1})|e_{1:t})

P(X_{t+1}|(e_{t+1}|e_{1:t})) \propto P(e_{t+1}|X_{t+1},e_{1:t})P(X_{t+1}|e_{1:t})

Now take a look at Figure 1 again. All the previous evidence is independent of the new evidence given the state variable. That is, the old evidence doesn’t affect the new one except indirectly through the $X$ variable at $t+1$ . In other words:

e_{1:t} ⫫ e_{t+1}|X_{t+1}

Conditional independence. Note that the independence symbol ⫫ has higher precedence than the conditional symbol |

That means we can eliminate the old evidence from the expression after the conditional bar. That leaves us with our final formula for calculating the probability that it’s raining today given all the evidence now available:

P(X_{t+1}|e_{1:t+1}) \propto P(e_{t+1}|X_{t+1})P(X_{t+1}|e_{1:t})

Finally we’re ready to calculate the probability distribution over the Rain variable at day 2 when all we have is:

The observation that the advisor brought an umbrella on days 1 and 2.
The probability distribution table for Rain given Rain the day before.
The probability distribution table for Umbrella given Rain.

Since at $t=0$ we have no evidence yet, let’s assume that there’s a 50% chance of rain on the first day. From then on, we can use our formula to derive the probabilities as we gather evidence every day.

So, rounding to three decimal places we find that if we saw the advisor bring an umbrella on days 1 and 2, the probability that it’s raining on day 2 is $0.883$ or $88.3\%$ . That implies a probability of $11.7\%$ that it’s not raining on day 2.

That’s it! This wraps up our introduction to applications of Artificial Intelligence. Next, we’ll use a slightly more complex model to use probabilistic inference to recognize American Sign Language given video frame data.

How does AI work? Part 1