# MNIST Dataset

### Also known as `digits`

if you're familiar with `sklearn`

:

```
from sklearn.datasets import digits
```

## Problem Definition

*Recognize handwritten digits*

## Data

The MNIST database (link) has a database of handwritten digits.

The training set has $60,000$ samples. The test set has $10,000$ samples.

The digits are size-normalized and centered in a fixed-size image.

The data page has description on how the data was collected. It also has reports the benchmark of various algorithms on the test dataset.

### Load the data

The data is available in the repo's `data`

folder. Let's load that using the `keras`

library.

For now, let's load the data and see how it looks.

```
import numpy as np
import keras
from keras.datasets import mnist
```

```
# Load the datasets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
```

# Basic data analysis on the dataset

```
# What is the type of X_train?
```

```
# What is the type of y_train?
```

```
# Find number of observations in training data
```

```
# Find number of observations in test data
```

```
# Display first 2 records of X_train
```

```
# Display the first 10 records of y_train
```

```
# Find the number of observations for each digit in the y_train dataset
```

```
# Find the number of observations for each digit in the y_test dataset
```

```
# What is the dimension of X_train?. What does that mean?
```

### Display Images

Let's now display some of the images and see how they look

We will be using `matplotlib`

library for displaying the image

```
from matplotlib import pyplot
import matplotlib as mpl
%matplotlib inline
```

```
# Displaying the first training data
```

```
fig = pyplot.figure()
ax = fig.add_subplot(1,1,1)
imgplot = ax.imshow(X_train[0], cmap=mpl.cm.Greys)
imgplot.set_interpolation('nearest')
ax.xaxis.set_ticks_position('top')
ax.yaxis.set_ticks_position('left')
pyplot.show()
```

```
# Let's now display the 11th record
```