MNIST + Softmax Regression + Multilayered Convolution Layer + TensorFlow
So I'm attempting to learn some machine learning algorithms + data science, because, well they are the future. I am a professional amateur, as in I'm a professional and being amateur and a whole lot of things, but soon...very soon I will master this! Alternating between this and a mobile app I'm building called Weeks.
I'm actually also simultaneously taking a few courses... Harvard's CS109, ColumbiaX: DS101X Statistical Thinking for Data Science and Analytics, BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark. My approach to learning is tabula rasa aka blank state learning as to not bring in any biases and well my memory sucks and google is pretty much responsible for my degree anyways (<---Pretty sure every engineer agrees, but few will admit). So tag along for the ride. It's gonna be rough, but it sure will be pretty...thanks for matplotlib. weeehehehehehehehe
So I'm starting with the Mixing National Institute of Standard and Technology (MNIST) database found here. "They" say a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
MNIST is a simple computer vision dataset. It consists of images of handwritten digits like these:
It also includes labels for each image, telling us which digit it is. For example, the labels for the above images are 5, 0, 4, and 1.
andddddd Google recently (Nov 2015) open sourced TensorFlow. Think of it as a second generation/newly improved Google Search, Google's speech recognition systems (the "OK, Google), Google Photos, Google Maps and StreetView, Google Translate, and YouTube, which were built on DistBelief.
TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well....The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
The part that gets me excited is the integration of GPUs in the desktop, because this is where everybody will be processing. Well soon i guess..."GPUs have far more processor cores than CPUs, but because each GPU core runs significantly slower than a CPU core and do not have the features needed for modern operating systems, they are not appropriate for performing most of the processing in everyday computing. They are most suited to compute-intensive operations such as video processing and physics simulations."
Anyways, TensorFlow is cool and thats what I'm going to learn on.
So I'm starting with the tutorial here.
MNIST For ML Beginners
(sounds a lot like me. woohoo)
after installing TensorFlow, which I hit quite a few hurdles to install. I wrecked my conda env, because something was broken during the first install of 'pip install tensorflow' and setuptools got destroyed killing everything in the env.... sooo dumb. I remade a new env and poof it worked. -____- yay. (Setting up a programming environment is by far the least favorite thing for me to do. LEAST!)
So I'm reading through the tutorial and go through it all and copy and paste code like a good ctrl+c, ctrl+v ninja I am. It all works and I have no idea what I just did. So now that I've gotten it to work. Time to understand the intricacies. So the first hurdle of this 100 m dash towards understanding MNIST I'm going to attack is softmax regression.
If we were playing a game as to guess what softmax regression is I would say, the maximum fluffy (soft, get it? ha ha) regression, I have no idea... (but after I read this, and this, this is what I've got)
So we have to start with the basics... basically machine learning breaks down:
- Get data
- Train a model on data
- Use trained model to make predictions on new data
- Repeat and tweak till its good and ready
So its been shown that the human brain stacks data in a sense: The first hierarchy of neurons that receives information in the visual cortex are sensitive to specific edges and blobs while brain regions further down the visual pipeline are sensitive to more complex structures such as faces. This what it we base our models off of.
So in hierarchical feature learning (the art of extracting useful patterns from data), we:
- extract multiple layers of non-linear features
- pass them to a classifier that combines all the features to make predictions. We stack these non-linear elements into a deep hierarchies of features because we cannot learn complex features from a few layers. We break down the images into single layer of blobs or edges. Why? Because, mathematically, they contain the most information to extract. Then we transform our first features (edges and blobs) to get more complex features that contain more information to distinguish between classes.
Now enter in the problem of vanishing gradients. This is where the gradients became too small to provide a learning signal for very deep layers, thus making these architectures perform poorly when compared to shallow learning algorithms (such as support vector machines).
Deep learning was born to overcome the vanishing gradients so that we can train architectures with dozens of layers of non-linear hierarchical features. In the early 2010s, it was shown that combining GPUs with activation functions that offered better gradient flow was sufficient to train deep architectures without major difficulties. From here the interest in deep learning grew steadily. Also deep learning is associated with detecting very long non-linear time dependencies in sequential data.)
Then to take it a step further, if we mimic the brains neuron network. An artificial neural network would:
- take some input data,
- transforms this input data by calculating a weighted sum over the inputs
- applies a non-linear function to this transformation to calculate an intermediate state.
The three steps above constitute what is known as a layer, and the transformative function is often referred to as a unit. The intermediate states—often termed features—are used as the input into another layer.
Through repetition of these steps, the artificial neural network learns multiple layers of non-linear features, which it then combines in a final layer to create a prediction.
So softmax is part of the Deep Neural Networks, it is actually softmax layer in the neural network. Wow thats a lot of information just to realize that. Next post will explain what the softmax layer is actually doing.