Softmax Regression (Maths + Neural Networks)
Disclaimer: This is just how I understand softmax regression. You really shouldn't use this as a math book, but rather a way to understand how I approach it. Also no equations will be presented, thats linked here and here and at the end of the article, when hopefully you will become more curious about it all.
So to understand softmax regression, we should actually should take a step back and understand logistic regression. Why? Because softmax is just logistic regression with multiple classes instead of the binary logistic regression. i.e. [0, 1] => k-classes.
Essentially its easier to scale up our understand from a binary class to multiple classes. This diverges from linear regression in the sense that for a binary case [0, 1] case, it wouldn't make sense to try and fit it to a line.
You can see from the images, how a logistic/sigmoid function would be better at calculating a binary signal. Its an S-shaped function that "squashes" the value into the range [0, 1]. It's kinda cool right? A normalization of sorts. We could get into the maths behind all this, but you can read another website from that. Simply put (maybe too simply), you are searching for values where the probability of it being a 1 or a 0 is large. Then you use a cost function to measure how well it did. By taking the differential, it will point in the direction of greatest increase for the function. So its easy to see how an optimization algorithm that takes advantage of this to make that small change that minimize the cost function.
Now to extend it to softmax and instead of [0, 1], we have [1,2...K]. I imagine it like a sphere, (might be a poor visual representation, but it works for me), where you are "squashing" the values in panels on the sphere. Then search for the values where the probability of it being on one of those panels is large. Then optimize by taking the gradient to it, the same as logistic regression.
An unusual point for softmax regression is that it has "redundant" parameter, In a nutshell, subtracting a fixed vector from the parameter vector won't affect the hypothesis prediction at all. In maths terms, its overparameterized. In layman terms, for any solutions we get to fit the data, there are a bunch of ways (parameters) that will get us that solution. Lots of combinations to the same answer and the answers will always add up to one.
Also the minimizer of the cost function isn't unique either, but the cost function is still convex. All this means is that during a gradient descent you won't get local optima problems.
Yay for less problems!
Also if you take the a square matrix of second order partial derivatives of the cost function aka the Hessian, you will run into problems if you try a straightforward implementation of Newton's method, its gonna have uber problems. Newton's method is a method for finding successively better approximations to the roots of a real-valued function. You start with a guess which is close enough the root, then take the derivative and calculate the x-intercept of the derivative. The x-intercept is usually a better guess and then you interate till you get the right answer. But don't do this for softmax regression. Just don't. This is because its singular/non-invertible. So just solve it another way.
Softmax might not make sense for you, but it does for me now that I wrote it out trying to explain it to you.
So in relation to the original problem and MNIST.
For MNIST, K=10 for the 10 numbers, zero to nine. We will be giving the images probabilties it is a each digit. I.e. A picture of a 9, might be 80% a nine, 5% an eight for the top loop, and 1% its a 0 for the top loop and messy bottom. Its the classic and perfect example for softmax regression.
- Add parameters for our image to be in a certain number class, ie a "0" class or "1" class.
- Convert it into probabilities.
- Tally up the images for a certain class with a weighted sum of pixel intensities.
- Add in extra parameter for the bias or things we know are independent of the image
- Convert the tallies into prediction probabilities with the softmax function.