Understanding RNN, Deeper RNN, Bidirectional RNN, and RNN Encoder-Decoder (Sequence-to-sequence, aka seq2seq)

RNNs (recurrent neural networks) handle sequential data where the order of sequence matters. For example, RNNs deal with time-series data, languages (sequence of words), and so forth.

Explanations of RNNs usually include a lot of boxes and arrows, which may be confusing. The reason for many diagrams is because RNNs come in many different shapes and forms. So, the purpose of this article is to describe the core concepts in RNNs and show various diagrams in a step-by-step manner for better understanding.

We discuss the following topics:

  • Sequential Data
  • Simple RNN
  • Deeper RNN
  • Bidirectional RNN
  • RNN Encoder-Decoder

Sequential Data

Before discussing sequential…


Can CNNs handle sequential data and maintain a more effective history than LSTM?

This article reviews a paper titled: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling by Shaojie Bai, J. Zico Kolter, and Vladlen Koltun.

Before TCNs, we often associated RNNs like LSTMs and GRUs for a new sequence modeling task. However, the paper shows that TCNs (Temporal Convolutional Networks) can efficiently handle sequence modeling tasks and even outperform other models. The authors also demonstrated that TCNs maintain more extended memory than LSTMs.

We discuss the architecture of TCNs with the following topics:

  • Sequence Modeling
  • Causal Convolutions
  • Dilated Convolutions
  • Residual Connections
  • Advantages and Disadvantages
  • Performance Comparisons

Sequence Modeling

Although the…


Eigen Decomposition, SVD, and Pseudo-inverse Matrix

In this article, we discuss the following:

  • Eigen Decomposition
  • Singular Value Decomposition
  • Pseudo-inverse Matrix

These three subjects are related to each other.

Once we know how Eigen Decomposition works, we can understand how Singular Value Decomposition works. Once we know SVD, we can understand Psuedo-inverse Matrix.

We discuss the following topics in the order:

  • Square Matrix
  • Eigenvalue and Eigenvector
  • Symmetric Matrix
  • Eigen Decomposition
  • Orthogonal Matrix
  • Singular Value Decomposition
  • PSeudo-inverse Matrix

Square Matrix

Eigen Decomposition works only with square matrices.

As a quick reminder, let’s have a look at what a square matrix is.

In square matrices, the number of rows and the…


Understanding SGD, Momentum, Nesterov Momentum, AdaGrad, RMSprop, AdaDelta, and ADAM

In machine learning, we aim to minimize the difference between predicted and label values. In other words, we want to minimize loss functions.

If a loss function is a simple parabola with one parameter, it has one minimum and we can even solve it with a pen and a paper.


Image by author

How Activation Functions Have Evolved

We often attribute the success of deep learning algorithms to the increase in computing power. The fact that we can calculate the gradients of deep neural networks so fast made it a lot more practical to train our models using the backpropagation algorithm.

In supervised learning, we identify how each weight in a network contributes to the final loss by using chains of gradients. Once we calculate a partial derivative of the final loss per weight, we can adjust each weight to reduce their contribution to the loss value. …


What happened in the first AI boom? How did it end, and why?

“Artificial Intelligence” is a catch-all term for anything related to …wait for it… “Artificial Intelligence”.

Joking aside, John McCarthy intentionally kept the term broad when proposing a summer research project to discuss so-called “thinking machines”.

In 1955, various research directions existed for controlling machines. For example, there were cybernetics and automata theories. However, they aimed at a specific approach to machine behavior, and they did not directly address machine intelligence.

After all, what we now call “Artificial Intelligence” was still a brand new area of research. John McCarthy felt that researchers needed to collaborate and solidify the orientation of the…


Image by author

The famous Monty Hall Problem requires a non-intuitive solution — revisited in a story of three math fans

Previously, a high school student Ken, and his math teacher Dr. Demystifier (Dr. D), discussed the Bayes theorem. This time, Lily — a friend of both Ken and Dr. D — challenges them with the Monty Hall problem.

They will discuss the following topics:

  • Monty Hall Problem
  • Bayesian Solution
  • Subjective Prior Belief

Monty Hall Problem

Ken went to a cafe near his high school to buy lunch. He ordered a tall cappuccino and a tuna sandwich and stood by the coffee machine, hearing the sound of steam coming out as if from a bull’s nostrils.

Lily — a barista at the cafe —…


Image by author

Do we really need the Bayes theorem? Can we do everything with the conditionals and marginals?

This is a fictional story of a high school student Ken, and his math teacher Dr. Demystifier (Dr. D). Ken has just learned the Bayes theorem but he was completely mystified by the formula, not knowing how to make use of it.

They will discuss the following topics:

  • Bayes Theorem Derivation
  • Belief Update Once
  • Belief Update Twice
  • Belief Update Forever

Bayes Theorem Derivation

Ken asked Dr. D, “May I ask a question about the Bayes theorem?”

Dr. D nodded and said, “Please do”.

Ken continued, “The Bayes theorem says:


Image by author

Understanding the Maximum Entropy Principle

Have you ever wondered why we often use the normal distribution?

How do we derive it anyway?

Why do many probability distributions have the exponential term?

Are they related to each other?

If any of the above questions make you wonder, you are in the right place.

I will demystify it for you.

Fine or Not Fine

Suppose we want to predict if the weather of some place is fine or not.


Image by author

How to derive the Euler-Lagrange equation

We use the calculus of variations to optimize functionals.

You read it right: functionals not functions.

But what are functionals? What does a functional really look like?

Moreover, there is this thing called the Euler-Lagrange equation.

What is it? How is it useful?

How do we derive such equation?

If you have any of the above questions, you are in the right place.

I’ll demystify it for you.

The shortest path problem

Suppose we want to find out the shortest path from the point A to the point B.

Naoki

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store