TensorFlow and Machine Learning Basics

Every TensorFlow Python program generally imports the tensorflow package in as tf:

import tensorflow as tf

Tensors

The basic building block in TensorFlow is called a tensor, oddly enough. The easiest example of a tensor is just a simple constant node which always emits the same value:

# constant nodes 
node1 = tf.constant(3.0, dtype=tf.float32) 
node2 = tf.constant(4.0) 
# prints 'Tensor("Const:0", shape=(), dtype=float32)
#         Tensor("Const_1:0", shape=(), dtype=float32)'
print(node1, node2)

As you can see, what gets created is two tensor objects. In order to actually get the values out of them, we must run them through a session:

sess = tf.Session()
print(sess.run([node1, node2])) # prints [3.0, 4.0]

You can use existing tensors to create other tensors. For example, you can make a simple adder that adds the values of two nodes together:

# add the nodes to make a new node 
node3 = tf.add(node1, node2) 
# prints 'Tensor("Add:0", shape=(), dtype=float32)'
print(node3)  
print(sess.run(node3)) # prints 7.0

The next type of tensor is a placeholder. It essentially puts a name on a value that must be specified when the network of tensors is run, so essentially it is an input.

# placeholders - values required as inputs 
a = tf.placeholder(tf.float32) 
b = tf.placeholder(tf.float32)

Using placeholders, we can create an adder function that can add values specified at runtime: # lambda function that adds its inputs

adder = a + b 
# sets a to 3, b to 4.5, runs the adder, prints 7.5 
print(sess.run(adder, {a: 3, b: 4.5})) 
# sets a to [1,2] and b to [3,4], 
# runs the adder, prints [4. 6.] 
print(sess.run(adder, {a: [1,2], b: [3,4]}))

With this example, I hope you can start to see the power of TensorFlow. Not only did I make an adder capable of adding simple numbers, it can also add vectors (or even matrices). In Machine Learning, working with vectors and matrices is a key part of making computations more efficient. In this way, TensorFlow is a lot like Matlab, Octave, and other numerical programming frameworks that easily work with these higher dimension ways of representing data.

You can wrap the output of the adder with another tensor to easily chain operations together:

# wraps the adder and triples the result 
add_and_triple = adder * 3 
# prints 22.5 (3 + 4.5 = 7.5, 7.5 * 3 = 22.5) 
print(sess.run(add_and_triple, {a: 3, b: 4.5}))

The next type of tensor is called a variable. Developers are quite familiar with variables, but they have a special purpose in TensorFlow. We’ll describe them in the next section.

Linear Regression Example

Let’s build a simple linear regression model:

# variables - initial value and type 
W = tf.Variable([.3], dtype=tf.float32) 
b = tf.Variable([-.3], dtype=tf.float32) 
x = tf.placeholder(tf.float32) 
linearModel = W * x + b # Theta0 * X + Theta1 = y

In this case, we have built a model with one feature, x (the input), and two parameters W and b (although machine learning scientists like to call these Theta0 and Theta1). The linearModel tensor is the output tensor, i.e., it is the value we are looking for. In order to correctly predict the output value, we need to train the model by adjusting the values of W and b. We give them initial values of 0.3 and -0.3, but during training these values will change. In order to actually do the training, we must first initialize the variables to their default values (we can run the same initialize again if we want to reset the values back to the defaults):

# initialize the variables 
init = tf.global_variables_initializer() 
sess.run(init)

Alright. Now in every linear regression problem, we have some input data, often called training data, that specifies what we are trying to calculate with our model. Say we have the following training data:

x_train = [1,2,3,4] 
y_train = [0,-1,-2,-3]

In this case, x_train is a set of values that are sent into the model, where as y_train is the expected output of the model. Recall from above that our model is basically y = W * x + b. Our goal is to find the values for and b such that when we set x to 1will be set to 0. If x is set to 2y will be set to -1, etc. W and b have initial values of 0.3 and -0.3, so we can just run the model as is with the x_train input and it will spit out values for each of the input values:

# run a linear model for all the values of x 
# prints [ 0. 0.30000001 0.60000002 0.90000004] 
print(sess.run(linearModel, {x: x_train}))

The first output value is 0, which is correct. The other three, however, are wrong. So we need to change W and b. In order to train our model, we need to be able to input the expected values, so we define a placeholder to take these in:

# y is our labels which are inputs 
y = tf.placeholder(tf.float32)

In machine learning, when specifying expected outputs, these outputs are called labels. Some machine learning models don’t have labels – you pass in input and what is output is something that you didn’t know but are hoping to learn (a good example of this is classification models where you are trying to divide people up into groups, but you don’t necessarily know what the groups are – you are searching for possible correlations between them that may be hidden).

In order to adjust our parameters W and b, we need something that tells us if we are going in the right direction. This thing is called a cost or loss function, and it essentially calculates how far off we are from accurate predictions. One typical way of calculating the cost function is to take the difference between the predicted value and the expected value and then to square it (the reason we square the value is explained below when we talk about how gradient descent works). When you sum all the squared differences between the predictions and expected values, you get a single number that gives you an idea of how far off you are. We do this in TensorFlow by first using the square function on the difference between the linearModel (which computes the predicted value given x) and y (which is our expected values):

# compute loss/cost function as square of 
# diff between predictions and labels 
squared_deltas = tf.square(linearModel - y)

Then we sum all the deltas together using the tensorflow reduce_sum function:

# produces the loss function by summing the deltas 
loss = tf.reduce_sum(squared_deltas)

Don’t get tripped up by the name reduce_sum – this function isn’t trying to minimize the sum, it is reducing a vector down to a scalar by summing all the values together. The resulting tensor, which we called loss, outputs the value that we want to try to make as small as possible. We can compute the current loss value given the input and expected output values:

# print the current error (23.66) given the 
# current values of W and b 
print(sess.run(loss, {x: x_train, y: y_train}))

We will ultimately find that the optimal values of W and b are -1 and 1, and we can run our loss function again with these values to see that our cost goes to 0:

# optimal values of W and b are -1 and 1 
sess.run([tf.assign(W, [-1.]), tf.assign(b, [1.])]) 
# prints 0.0 
print(sess.run(loss, {x: x_train, y: y_train}))

Training

Up to this point, all we have really done with TensorFlow is build up a series of calculations based on some inputs. We could have done this in any programming language, but TensorFlow made it relatively simple, even if the inputs are vectors or matrices. Now we get to the real fun of machine learning where we train the model to find optimal parameters. For this example, we are going to use a method called gradient descent. The way this method works is to take a partial derivative of the loss function in order to determine how to adjust the parameters and b. If you remember your calculus, a derivative essentially tells you how fast something is changing – acceleration is the derivative of velocity, and velocity is the derivative of position. In this case, the derivative is essentially the slope of a point on the graph of the loss function which could look something like this:

GradientDescent

What we are looking for are the points lowest on the graph which are shown in blue and represent the points where the error is lowest. The gradient descent method basically works the same way you would try to find the fastest way down a mountain. You would look around you and follow the slope that leads you downwards. The derivative tells gradient descent the way to adjust the value of a given variable that will cause the overall loss to go down. You take a step, and then you do it again, over and over until the slope is flat or begins to slope in the wrong direction (i.e., you have it the bottom). As you can see from the graph above, depending on where you start, you might find a local minimum that isn’t the global minimum, so you might have to run the method many times starting at different points to see if you find a better result. Another factor that is part of this method is called the learning rate, which is essentially the size of the step you take each time. There is a tradeoff with the size of the learning rate – a learning rate too small will make finding the minimum slow because you are taking tiny steps, a learning rate too big can miss the minimum and can even diverge away from the minimum. Sometimes it is necessary to check while the model is training to ensure that the value of the loss function keeps going down so that you know you are converging on the minimum.

TensorFlow has a builtin implementation of gradient descent and all we need to do is specify the learning rate. This optimizer will go through our tensors and find the variables that need to be adjusted in order to minimize our loss:

# create gradient descent optimizer 
learningRate = 0.01 
optimizer = 
  tf.train.GradientDescentOptimizer(learningRate) 
# construct training function around our loss function 
train = optimizer.minimize(loss)

You might be asking yourself “how do I know what a good learning rate is?”. This part is really about math, but essentially each step in gradient descent computes a small difference to add to one of the parameters and the formula looks essentially like:

new_value = old_value + alpha * 1/m * 
    sum(predicted_value - expected_value)

In this formula, alpha is the learning rate, m is the number of samples we have, and the predicted_value and expected_value are the sets of predictions and expected output – we take the differences and sum them. This will result in a small value that adjusts the parameter we are currently training. Remember how our loss function squared the differences? The derivative of x^2 is 2x, so therefore the derivative of the square of the differences becomes 2 times the differences. The differences here are adjusted differently because we are only taking a small step and so we instead multiply by the learning rate and 1/m. If the number of samples you have is large, 1/m will be small, so the learning rate can be larger (which is good because you have a lot more calculations to do for each step with all your training samples). If the number of samples is small, you will likely need to make the learning rate smaller in order to not miss the minimum.

Now we actually have to run the training, so first we reset the variables back to their initial values:

# reset variables
sess.run(init)

Then we run our training data through the gradient descent a bunch of times (i.e., we take a lot of steps – in this case 1000):

# do 1000 iterations of training 
for i in range(1000): 
  sess.run(train, {x: x_train, y: y_train})

When this is done, we have new values in W and b that should allow us to correctly predict the output based on the input:

# see the final values for W and b 
# prints '[array([-0.9999969], dtype=float32), 
#          array([ 0.99999082], dtype=float32)]' 
print(sess.run([W, b]))

As you can see, we didn’t exactly hit the minimal values -1 and 1, but we are pretty darn close. How close? Let’s check the value of the loss function now:

# prints 5.69997e-11
print(sess.run(loss, {x:x_train, y:y_train}))

That’s really small. We could train even more and get the loss smaller, but at this point our accuracy is over 99% and we are only talking about a small fraction of a percent improvement at best.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s