RNNs (recurrent neural networks) handle sequential data where the order of sequence matters. For example, RNNs deal with time-series data, languages (sequence of words), and so forth.

Explanations of RNNs usually include a lot of boxes and arrows, which may be confusing. The reason for many diagrams is because RNNs come in many different shapes and forms. So, the purpose of this article is to describe the core concepts in RNNs and show various diagrams in a step-by-step manner for better understanding.

We discuss the following topics:

- Sequential Data
- Simple RNN
- Deeper RNN
- Bidirectional RNN
- RNN Encoder-Decoder

Before discussing sequential…

This article reviews a paper titled: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling by Shaojie Bai, J. Zico Kolter, and Vladlen Koltun.

Before TCNs, we often associated RNNs like LSTMs and GRUs for a new sequence modeling task. However, the paper shows that TCNs (Temporal Convolutional Networks) can efficiently handle sequence modeling tasks and even outperform other models. The authors also demonstrated that TCNs maintain more extended memory than LSTMs.

We discuss the architecture of TCNs with the following topics:

- Sequence Modeling
- Causal Convolutions
- Dilated Convolutions
- Residual Connections
- Advantages and Disadvantages
- Performance Comparisons

Although the…

In this article, we discuss the following:

- Eigen Decomposition
- Singular Value Decomposition
- Pseudo-inverse Matrix

These three subjects are related to each other.

Once we know how Eigen Decomposition works, we can understand how Singular Value Decomposition works. Once we know SVD, we can understand Psuedo-inverse Matrix.

We discuss the following topics in the order:

- Square Matrix
- Eigenvalue and Eigenvector
- Symmetric Matrix
- Eigen Decomposition
- Orthogonal Matrix
- Singular Value Decomposition
- PSeudo-inverse Matrix

Eigen Decomposition works only with square matrices.

As a quick reminder, let’s have a look at what a square matrix is.

In square matrices, the number of rows and the…

In machine learning, we aim to minimize the difference between predicted and label values. In other words, we want to minimize loss functions.

If a loss function is a simple parabola with one parameter, it has one minimum and we can even solve it with a pen and a paper.

We often attribute the success of deep learning algorithms to the increase in computing power. The fact that we can calculate the gradients of deep neural networks so fast made it a lot more practical to train our models using the backpropagation algorithm.

In supervised learning, we identify how each weight in a network contributes to the final loss by using chains of gradients. Once we calculate a partial derivative of the final loss per weight, we can adjust each weight to reduce their contribution to the loss value. …

“Artificial Intelligence” is a catch-all term for anything related to …wait for it… “Artificial Intelligence”.

Joking aside, John McCarthy intentionally kept the term broad when proposing a summer research project to discuss so-called “thinking machines”.

In 1955, various research directions existed for controlling machines. For example, there were cybernetics and automata theories. However, they aimed at a specific approach to machine behavior, and they did not directly address machine intelligence.

After all, what we now call “Artificial Intelligence” was still a brand new area of research. John McCarthy felt that researchers needed to collaborate and solidify the orientation of the…

Previously, a high school student Ken, and his math teacher Dr. Demystifier (Dr. D), discussed the Bayes theorem. This time, Lily — a friend of both Ken and Dr. D — challenges them with the Monty Hall problem.

They will discuss the following topics:

- Monty Hall Problem
- Bayesian Solution
- Subjective Prior Belief

Ken went to a cafe near his high school to buy lunch. He ordered a tall cappuccino and a tuna sandwich and stood by the coffee machine, hearing the sound of steam coming out as if from a bull’s nostrils.

Lily — a barista at the cafe —…

This is a fictional story of a high school student Ken, and his math teacher Dr. Demystifier (Dr. D). Ken has just learned the Bayes theorem but he was completely mystified by the formula, not knowing how to make use of it.

They will discuss the following topics:

- Bayes Theorem Derivation
- Belief Update Once
- Belief Update Twice
- Belief Update Forever

Ken asked Dr. D, “May I ask a question about the Bayes theorem?”

Dr. D nodded and said, “Please do”.

Ken continued, “The Bayes theorem says:

Have you ever wondered why we often use the normal distribution?

How do we derive it anyway?

Why do many probability distributions have the exponential term?

Are they related to each other?

If any of the above questions make you wonder, you are in the right place.

I will demystify it for you.

Suppose we want to predict if the weather of some place is fine or not.

We use the calculus of variations to optimize **functionals**.

You read it right: functionals not functions.

But what are functionals? What does a functional really look like?

Moreover, there is this thing called the **Euler-Lagrange equation**.

What is it? How is it useful?

How do we derive such equation?

If you have any of the above questions, you are in the right place.

I’ll demystify it for you.

Suppose we want to find out the shortest path from the point A to the point B.