Scheme – Where I first learned AI and Machine Learning

Machine Learning has been all the rage lately, but some of us were introduced to it decades ago. I first encountered it in my artificial intelligence class at MIT.

Image result for patrick winston artificial intelligence

When I took the class, it was taught by Patrick Winston, who wrote the book that was considered the definitive guide on artificial intelligence at the time. That book is still used in a lot of places, but I’m sure it has been replaced by other books that have expanded on the ideas and include more discussion about modern frameworks that have made this subject more tangible.

The class was taught in a language called Scheme. For those who don’t know about scheme, it is a dialect of LISP (also known as Lots of Irritating and Silly Parentheses). LISP and its dialects use parentheses to enclose expressions. The syntax is sometimes known as Polish Notation, but LISP is more explicit about the grouping of operators so that programmers don’t have to think so hard about the calculation stack. For example, this is an expression written in Polish Notation:

− × ÷ 15 − 7 + 1 1 3 + 2 + 1 1

These expressions are generally evaluated from right to left, and the interpreter that interprets this statement uses a stack. So the first thing it does is push 1 onto the stack, followed by pushing another 1. Then it encounters the plus, so it pops both 1’s off the stack and adds them, pushing 2 onto the stack. It translates to this:

push 1 onto stack [1]
push 1 onto stack [1 1]
pop two operands (1, 1) and add []
push 2 onto stack [2]
push 2 onto stack [2 2]
pop two operands (2, 2) and add []
push 4 onto stack [4]
push 3 onto stack [3 4]
push 1 onto stack [1 3 4]
push 1 onto stack [1 1 3 4]
pop two operands (1, 1) and add [3 4]
push 2 onto stack [2 3 4]
push 7 onto stack [7 2 3 4]
pop two operands (7, 2) and subtract [3 4]
push 5 onto stack [5 3 4]
push 15 onto stack [15 5 3 4]
pop two operands (15, 5) and divide [3 4]
push 3 onto stack [3 3 4]
pop two operands (3, 3) and multiply [4]
push 9 onto stack [9 4]
pop two operands (9, 4) and subtract []
Result is 5

Developers used to write expressions like this all the time, and it was slow and cumbersome, and so LISP added the parentheses to make it clearer what was happening. The above expression could instead be written like this:

(- (× (÷ 15 (- 7 (+ 1 1))) 3) (+ 1 1 2))

At first glance, this may not seem much better, but you can easily see how things get reduced:

(- (× (÷ 15 (- 7 (+ 1 1))) 3) (+ 1 1 2))
(- (× (÷ 15 (- 7 2)) 3) (+ 1 1 2))
(- (× (÷ 15 (- 7 2)) 3) 4)
(- (× (÷ 15 5) 3) 4)
(- (× 3 3) 4)
(- 9 4)

Because of this, it was much easier to write and understand what a LISP program was doing versus the old stack based expression.

The above example demonstrates numerical computations in LISP, but it also has functions as well. Data structures in Scheme are made up of lists. For example, I can define a variable x to be the list of the first five integers:

(define x (1 2 3 4 5))

Accessing the data inside this structure uses two main keywords, car and cdr. These functions return the first element of a list, and the rest of the list, respectively. This means:

(car x) -> 1
(cdr x) -> (2 3 4 5)

You can probably see that accessing certain elements of the list can get kind of tricky. For example, to get the 4th element of the list (4), you need to do this:

(car (cdr (cdr (cdr x)))) -> 4

The expression works in the reverse order of how it reads from left to right – you have to remove 1, 2, and 3 using cdr and then use car to get the first element of the list (4 5).

Suffice to say that this language comes from a time when programming languages were much more primitive. There was, however, some concepts in the Scheme language that are just now reemerging into programming languages. One such example of this is called a lambda expression. These expressions were introduced to Java in the 1.8 JDK, but they actually have been around as a programming construct for more than 30 years. A lambda expression is essentially an expression that doesn’t evaluate immediately but instead requires values for placeholders to be provided in order to be evaluated. For example, I can define a lambda expression in scheme and set it to a variable x:

(define x (lambda (x y) (+ x y)))

This essentially defines an expression that adds its inputs together. It is kind of like defining a function except that it actually gets stored in a variable and can be passed around (a pointer to a function is kind of a good analogy, but lambda expressions are more like data that can be stored on the stack as opposed to just pointing to some instruction in the program data space). You can then evaluate the lambda expression like this:

(x 1 2) -> 3

So why would anyone do this? A good example is the factory pattern. Say I want to build a custom expression that multiplies one argument by a multiplier that is a parameter to the factory. For example:

(define factory 
  (lambda (multiplier) 
    (lambda (x) (× x multiplier))))

The factory itself is a lambda expression that returns another lambda expression. I could make a triple multiplier and use it like this:

(define tripler (factory 3)) 
    -> (lambda (x) (× x 3))
(tripler 2) -> 6

Or imagine an even more complex example where the factory creates an expression that does any operation with a parameter:

(define factory2 
  (lambda (operator value) 
    (lambda (x) (operator x value))))
(define divideBySix (factory ÷ 6)) 
    -> (lambda (x) (÷ x 6))
(divideBySix 18) -> 3

This is a simple example, but you can imagine building expressions that efficiently compute expressions based on configuration parameters. In the machine learning world, we are often doing complex computations with parameters that require tuning, so we needed a way to be able to define the computations with placeholders that could be changed as we trained the network to be more accurate.

Python of course allows us to do this in a much more succinct way, but the principles are essentially the same:

def factory2(operator, value):
    def evaluator(input):
        return operator(input, value)
    return evaluator



Diversity in the Workplace

So I saw on CNN today a news article about a manifesto floating around Google about gender diversity in the workplace. Take a look:

The basic premise of the manifesto is that men and women are different biologically, and that workplace programs that try to give more jobs to women are unfair because they bias against men who are possibly more qualified than the women who are chosen. He goes into great detail about opinions about how men are more driven to succeed due to biological factors like testosterone.

One big problem with the manifesto is that it is all based on stereotypes. He literally uses the phrase “women in general”, which is dangerous. History is full of examples of oppression that are simply based on simple stereotypes where all the members of a group of people are all labelled as being inferior simply because of the actions of a subset of the group. On one hand, there may in fact be some individuals in the group that are not as skilled or intelligent, but on the other hand, it also stereotypes that individuals who are not in that group are all superior, which is most certainly not true.

I attended the Massachusetts Institute of Technology, and although there were more men than women, the percentages were reasonably close. A majority of the women that I studied alongside are just as capable of achievement as anyone that I have ever met. They were not hindered by the fact that they didn’t have testosterone or a penis hanging between their legs. The notion that some biological factor can limit you is not only insulting, it just doesn’t match up to reality. There are countless examples of people who were limited by some biological factor but were able to overcome those challenges and succeed to even greater degrees than people that didn’t have those challenges.

There was a similar debate on this years ago when California had affirmative action, which aimed to fill spots on college campuses with students from minority groups. It was argued that minorities were filling spots that should have been given to white students who were more deserving. There is definitely some merit to this argument, but it leaves out some harsh truths about our world – a lot of folks in those minority groups simply don’t have access to the same resources that white people do. I won’t say that every black school is bad or that all Mexican families are poor, but a lot of them are. Programs like affirmative action attempt to compensate for failures of our society in other areas. Is it the right approach? Probably not, but at least it was attempting to address the issue.

I will say that this is where the manifesto has some merit – the solution doesn’t fit the problem. Giving a job to a woman over a man because certain quotas need to be met can be unfair. In an ideal world, people would be hired based only on their abilities to fulfill the duties of the job, but we need to do more than just fix hiring practices. We need to shift as a culture away from assuming that just because someone is of a certain race, sex, political affiliation, etc., that they are inferior or incapable of doing something.

We have made incredible progress in the last 100 years, but we still have so far to go, and in the last few years, it feels like we are going backwards. Muslims are stereotyped as being terrorists, despite the fact that many of the terrorist events that have happened recently were committed by Christians. Blacks are viewed as being violent and criminal, despite the fact that many of them are outstanding citizens while plenty of white folks are out committing crimes. We have let the media distort the truth and control us by generating fear which distracts us from the real problems of this world.

While I can agree that companies may be unfairly filling roles by taking gender into account to fill some quotas, implying that women are biologically unfit for the roles simply goes too far. There can be many reasons why a male candidate ends up as a better fit than a woman, but I simply can’t believe that it would be because the man has a penis and the woman does not.


Brains Work In Mysterious Ways

Yesterday I was talking a lot about neural networks and the brain. Later that night, I came across an article about a funny optical illusion.

When you look at this picture, it looks like two wavy shapes, one inside the other. Would it surprise you to learn that they are actually both circles? Check it out:

If you want, you can take the original image and put it in Paint and draw your own circles if you don’t believe me.

The interesting thing is that even though you now know they are circles, you still really don’t see circles when you look at the original image. This is what makes optical illusions so awesome – they reveal something about how the brain really works. Here is another example:

Image result for cube on checkerboard illusion

When you look at this image, which appears to be a cube sitting at a 45° angle on the ground, you see that one face of the cube is a dark gray while the other is light gray. Now look at this:


It is the same image, only I have covered the middle region between the faces with a rectangle that has the same shade as the top part of the cube, and now you can clearly see that the two shades of gray are the same! Feel free to copy this one to paint and try it out for yourself. And just like with the circles, even though you know the shades are the same, you just don’t see it that way.

So what the heck is going on? How is our brain being fooled into seeing things not as they actually are? It turns out that the answer lies in the world of psychology. There are a number of experiments that you can do visually that show that what the brain actually does is to take the information that it gets from your eyes and then processes it before you actually “see” it.

One interesting example of this phenomenon that we see almost every day is when we look at a TV, computer monitor, or even a movie screen.

Image result for tv refresh flash rate

This is an image that depicts different ways that computer monitors update what is on the screen. As you can see from the image above, the monitor is actually flashing light at us continually. If we speed up the rate the image is refreshed, the brain will eventually fill in the dark time periods with the image that it previously saw so that we become unaware that there was a gap between the refreshes. Old movie projectors work the same way – what you are actually seeing is a series of pictures (usually 24 per second) being flashed on the screen, but your brain connects the images so that they appear to be continuous.

Another great example is if you stare at this image of a spiral for a few moments (10 – 15 seconds) and then look at a different part of the screen (like the words above it):

Image result for spiral inward

Did it look like the words above were expanding? You know that they were not, but your brain made it look that way for a few seconds.

Your brain uses little tricks like this to compensate for motion in what it is seeing. The reason it does this is to make those images appear more stable than they really are. If you stare at a flowing waterfall for a few moments and then look elsewhere, you will see things rising up.

Most likely, this is a survival mechanism, because as you are scanning around you, looking for predators, it would be a lot harder to spot them if you couldn’t really see them in the blur while your head was turning. So your brain compensates and this allows you to see things more clearly while your head is moving, but when your head stops moving, it takes a moment for your brain to stop compensating, and so you might see effects like the one above.

This just goes to show that your brain processes visual information much in the same way it processes other signals from your other senses. When you hear a sound you have heard before, you can usually identify the type of sound almost immediately without even thinking about it – again, a survival mechanism because sometimes you don’t really have the time to stop and think about whether the sound is a predator or not. How some of this all works is still a mystery, but as I talked about in my last post, it is likely due to connections in the brain that happen so that certain neurons fire given certain inputs so that we can almost immediately perceive threats and react quickly. In a way, we are smart beings because we had to be to survive.

It almost makes you wonder if Idiocracy is becoming true because we simply don’t have to use our brains for survival as much any more. Who really knows.

Explaining Neural Networks

So I’ve been studying and making a lot of noise about machine learning.


I know that a lot of people aren’t as excited about it as I am, but some people definitely are starting to get excited about it, so I’m glad that I’m getting in on it. I wouldn’t exactly say that I’m getting in on it early, because folks at MIT and other universities and even some companies have been doing machine learning for decades.

What is really creating the excitement these days is the frameworks and resources that are making machine learning available to almost everyone. It used to be that models like neural nets were limited to folks with large clusters of machines that were able to do the massive calculations required to train the model. Now with companies like Amazon and Google offering computing power that you can easily take advantage of, almost anyone can train a complicated model to make predictions. Add to that simple language frameworks like TensorFlow and suddenly it seems like everyone is using machine learning.

While using machine learning has become easy, understanding what it is actually doing is still complicated. My current supervisor and I were talking about machine learning, and he raised a great question: how do you explain what the machine learning model actually does to an executive who has no technical background in machine learning? Particularly, how do you explain what is going on in the hidden layers of a neural network?

Let’s start by talking about how neural networks got their name. Obviously it refers to neurons, but why? Picture your brain:

Image result for human brain

What the heck is going on in there? Scientists actually don’t entirely know. Studies are still going on to map out different regions of the brain and figure out how it actually all works. We do know, however, that it starts with neurons, which are specialized cells that link together to exchange information:

Image result for neuron

Most of us have seen this image before, and we understand that a neuron receives signals through its dendrites and then relays more signals through the axon terminals at the end. We know how these neurons carry information to muscles all throughout our bodies.

The real question is how does the brain do image recognition? When you look at a photo, you can easily spot particular things within it. For example, look at the following photo and find at least four children:

Image result for people in a crowd

How long did that take you? It might take you a few seconds, but it was pretty fast. How would we train a computer to do this task? Obviously we would first teach the computer to recognize faces, and then we would train the computer to differentiate a young face from an old face. The interesting thing about this problem is that we are even able to detect young faces when they are far away and not clear. For example, near the top is a girl in red who is clearly a child. How can we tell? Because she’s sitting on her father’s shoulders. It was actually some other complex information that clued us in to the fact that she was a child. The other thing to note is that you didn’t look at every single face in the photo. In fact, you most likely didn’t even look at faces to begin with – you probably looked for smaller people or smaller heads. You were able to completely ignore most of the faces in the photo because they didn’t meet some criteria, and that allowed your brain to quickly zero in on the children.

How did we get so good at this? Training. When you were born, you didn’t know how to recognize children from a picture. You didn’t even know what a child was. Your brain starts out with some simple abilities (such as breathing), and the rest pretty much has to be figured out. You literally have to train your brain to do almost everything. Just the way that you learned to walk, to ride a bike, and to even speak. It didn’t happen overnight, and in a lot of cases, you had to keep on training and correcting for months in order to be able to do it correctly.

So what is happening when we train our brains? There is a lot of studying going on with this, but the evidence points to the brain forming new connections as things are learned. Your brain is literally rewiring itself continually every day. There is information being stored away, and we aren’t quite sure exactly how. This idea of connections, however, is a big part of how neural networks work.

The most basic examples of a neural network is one that can identify handwritten numbers. There is a data set called the MNIST database of handwritten images that you can freely download. In the database are thousands of images of handwritten numbers. There are many types of machine learning models that are able to be trained to identify the numbers, but neural networks allow you to increase accuracy much higher.

The idea is pretty simple. You take the image and convert it into a series of numbers, each number representing the intensity of a pixel in the image. You feed these numbers into a well trained neural network and out of the network comes a set of numbers, usually 10, of which one will be 1 (or close to 1) and the rest will be 0 (or very close to 0). In between the inputs and the outputs is a set of one or more layers (called hidden layers) that are connected to each other and also to the input and output layers. It basically looks like this:



Notice that each of the inputs is connected to every unit in the hidden layer in the middle, and that each unit in the hidden layer is connected to the output layer. So when you feed the numbers in through the input layer, each of those values gets distributed to every unit in the hidden layer. The units in the hidden layer will combine the values from the various inputs, multiplying each by a distinct weight. The resulting value is then passed to all the units of the next layer. The output layer works the same way, combining the inputs scaled by distinct weights.

How do we determine the weights? Training, of course, but it is a complicated process. It actually requires a pass through the network and back. First we initialize the weights randomly (the reason has to do with the algorithm used to adjust the weights – if you start out with the weights the same, the algorithm breaks down and the model doesn’t actually learn). Then we pass input in and do all the weight calculations until the output is computed. Now we can compare the output to what we expect to see, and we compute some error for each of the outputs. For example, when we first feed the network an image of the number 7, we might get some output data like this:

output 0: 0.485673
output 1: 0.234857
output 2: 0.578578
output 3: 0.756385
output 4: 0.583758
output 5: 0.227621
output 6: 0.978373
output 7: 0.376482
output 8: 0.856756
output 9: 0.037573

Clearly, this isn’t right. We then calculate the difference between these values and the expected values, which is all outputs being 0 except output 7 which should be 1. This difference is fed backwards through the network through a process called back propagation. Essentially we have a way of figuring out how much each unit in the previous layer contributed to the error, and therefore we can make a small adjustment that should reduce the error on the next run. We then continue farther back to the layer before the one we just adjusted and determine again how much each unit in this layer contributed to the error calculated from the layer after it.

Once all the weights have been adjusted, we run another input through the network, calculate error and adjust weights. How long it takes to make one pass depends a lot on how many layers there are and how many units are in each layer. Adding more layers and more units will increase the accuracy of the neural network, but it will also increase the amount of time it takes to run all the calculations through the network, compute the error, and adjust the weights. This is a big reason why neural networks fell out of favor when they were first developed but are now starting to become popular again – with the increased parallelism and computing power available today, really big neural networks are able to be trained in a fraction of the time it would have taken years ago.

It is possible that this is how the brain actually works as well. Input is fed into the brain, and at first we can’t make heads or tails of what that input is, but somehow the brain makes adjustments and things start getting clearer. Eventually the brain becomes so well tuned that we can recognize complicated shapes out of a photograph in a really short period of time. We do know that some things must be trained early on, such as language, because if these things are not done early they become hard to learn later in life. It could be because things like language are paramount to a lot of the other learning that we do. When you look at a photo, you not only see shapes, you see things and your brain finds words to associate with those things so that you know what you are looking at.

I believe that we are a long way off from having an artificial intelligence that can really work the same way our brains do, but it could be that we just need to keep on training our networks and making them bigger and bigger. It takes the average person almost two decades to achieve what we consider maturity, so how can we expect a computer to learn the same things in just hours or days? We are still in the pioneer days of artificial intelligence and machine learning (computers as we know them have only been around for a few decades), but I believe that as time goes on, we’ll get closer and closer to building smart machines that think like we do.


Should we be afraid of terminators trying to kill us some day? Who knows. A computer will ultimately just be calculating some path to a goal, and it might not be programmed to look out for ethical boundaries. Having said that, those boundaries are something we learned along the way as we grew up, so we’ll just have to make sure that we train the machines to understand those rules as well.



TensorFlow and Machine Learning Basics

Every TensorFlow Python program generally imports the tensorflow package in as tf:

import tensorflow as tf


The basic building block in TensorFlow is called a tensor, oddly enough. The easiest example of a tensor is just a simple constant node which always emits the same value:

# constant nodes 
node1 = tf.constant(3.0, dtype=tf.float32) 
node2 = tf.constant(4.0) 
# prints 'Tensor("Const:0", shape=(), dtype=float32)
#         Tensor("Const_1:0", shape=(), dtype=float32)'
print(node1, node2)

As you can see, what gets created is two tensor objects. In order to actually get the values out of them, we must run them through a session:

sess = tf.Session()
print([node1, node2])) # prints [3.0, 4.0]

You can use existing tensors to create other tensors. For example, you can make a simple adder that adds the values of two nodes together:

# add the nodes to make a new node 
node3 = tf.add(node1, node2) 
# prints 'Tensor("Add:0", shape=(), dtype=float32)'
print( # prints 7.0

The next type of tensor is a placeholder. It essentially puts a name on a value that must be specified when the network of tensors is run, so essentially it is an input.

# placeholders - values required as inputs 
a = tf.placeholder(tf.float32) 
b = tf.placeholder(tf.float32)

Using placeholders, we can create an adder function that can add values specified at runtime: # lambda function that adds its inputs

adder = a + b 
# sets a to 3, b to 4.5, runs the adder, prints 7.5 
print(, {a: 3, b: 4.5})) 
# sets a to [1,2] and b to [3,4], 
# runs the adder, prints [4. 6.] 
print(, {a: [1,2], b: [3,4]}))

With this example, I hope you can start to see the power of TensorFlow. Not only did I make an adder capable of adding simple numbers, it can also add vectors (or even matrices). In Machine Learning, working with vectors and matrices is a key part of making computations more efficient. In this way, TensorFlow is a lot like Matlab, Octave, and other numerical programming frameworks that easily work with these higher dimension ways of representing data.

You can wrap the output of the adder with another tensor to easily chain operations together:

# wraps the adder and triples the result 
add_and_triple = adder * 3 
# prints 22.5 (3 + 4.5 = 7.5, 7.5 * 3 = 22.5) 
print(, {a: 3, b: 4.5}))

The next type of tensor is called a variable. Developers are quite familiar with variables, but they have a special purpose in TensorFlow. We’ll describe them in the next section.

Linear Regression Example

Let’s build a simple linear regression model:

# variables - initial value and type 
W = tf.Variable([.3], dtype=tf.float32) 
b = tf.Variable([-.3], dtype=tf.float32) 
x = tf.placeholder(tf.float32) 
linearModel = W * x + b # Theta0 * X + Theta1 = y

In this case, we have built a model with one feature, x (the input), and two parameters W and b (although machine learning scientists like to call these Theta0 and Theta1). The linearModel tensor is the output tensor, i.e., it is the value we are looking for. In order to correctly predict the output value, we need to train the model by adjusting the values of W and b. We give them initial values of 0.3 and -0.3, but during training these values will change. In order to actually do the training, we must first initialize the variables to their default values (we can run the same initialize again if we want to reset the values back to the defaults):

# initialize the variables 
init = tf.global_variables_initializer()

Alright. Now in every linear regression problem, we have some input data, often called training data, that specifies what we are trying to calculate with our model. Say we have the following training data:

x_train = [1,2,3,4] 
y_train = [0,-1,-2,-3]

In this case, x_train is a set of values that are sent into the model, where as y_train is the expected output of the model. Recall from above that our model is basically y = W * x + b. Our goal is to find the values for and b such that when we set x to 1will be set to 0. If x is set to 2y will be set to -1, etc. W and b have initial values of 0.3 and -0.3, so we can just run the model as is with the x_train input and it will spit out values for each of the input values:

# run a linear model for all the values of x 
# prints [ 0. 0.30000001 0.60000002 0.90000004] 
print(, {x: x_train}))

The first output value is 0, which is correct. The other three, however, are wrong. So we need to change W and b. In order to train our model, we need to be able to input the expected values, so we define a placeholder to take these in:

# y is our labels which are inputs 
y = tf.placeholder(tf.float32)

In machine learning, when specifying expected outputs, these outputs are called labels. Some machine learning models don’t have labels – you pass in input and what is output is something that you didn’t know but are hoping to learn (a good example of this is classification models where you are trying to divide people up into groups, but you don’t necessarily know what the groups are – you are searching for possible correlations between them that may be hidden).

In order to adjust our parameters W and b, we need something that tells us if we are going in the right direction. This thing is called a cost or loss function, and it essentially calculates how far off we are from accurate predictions. One typical way of calculating the cost function is to take the difference between the predicted value and the expected value and then to square it (the reason we square the value is explained below when we talk about how gradient descent works). When you sum all the squared differences between the predictions and expected values, you get a single number that gives you an idea of how far off you are. We do this in TensorFlow by first using the square function on the difference between the linearModel (which computes the predicted value given x) and y (which is our expected values):

# compute loss/cost function as square of 
# diff between predictions and labels 
squared_deltas = tf.square(linearModel - y)

Then we sum all the deltas together using the tensorflow reduce_sum function:

# produces the loss function by summing the deltas 
loss = tf.reduce_sum(squared_deltas)

Don’t get tripped up by the name reduce_sum – this function isn’t trying to minimize the sum, it is reducing a vector down to a scalar by summing all the values together. The resulting tensor, which we called loss, outputs the value that we want to try to make as small as possible. We can compute the current loss value given the input and expected output values:

# print the current error (23.66) given the 
# current values of W and b 
print(, {x: x_train, y: y_train}))

We will ultimately find that the optimal values of W and b are -1 and 1, and we can run our loss function again with these values to see that our cost goes to 0:

# optimal values of W and b are -1 and 1[tf.assign(W, [-1.]), tf.assign(b, [1.])]) 
# prints 0.0 
print(, {x: x_train, y: y_train}))


Up to this point, all we have really done with TensorFlow is build up a series of calculations based on some inputs. We could have done this in any programming language, but TensorFlow made it relatively simple, even if the inputs are vectors or matrices. Now we get to the real fun of machine learning where we train the model to find optimal parameters. For this example, we are going to use a method called gradient descent. The way this method works is to take a partial derivative of the loss function in order to determine how to adjust the parameters and b. If you remember your calculus, a derivative essentially tells you how fast something is changing – acceleration is the derivative of velocity, and velocity is the derivative of position. In this case, the derivative is essentially the slope of a point on the graph of the loss function which could look something like this:


What we are looking for are the points lowest on the graph which are shown in blue and represent the points where the error is lowest. The gradient descent method basically works the same way you would try to find the fastest way down a mountain. You would look around you and follow the slope that leads you downwards. The derivative tells gradient descent the way to adjust the value of a given variable that will cause the overall loss to go down. You take a step, and then you do it again, over and over until the slope is flat or begins to slope in the wrong direction (i.e., you have it the bottom). As you can see from the graph above, depending on where you start, you might find a local minimum that isn’t the global minimum, so you might have to run the method many times starting at different points to see if you find a better result. Another factor that is part of this method is called the learning rate, which is essentially the size of the step you take each time. There is a tradeoff with the size of the learning rate – a learning rate too small will make finding the minimum slow because you are taking tiny steps, a learning rate too big can miss the minimum and can even diverge away from the minimum. Sometimes it is necessary to check while the model is training to ensure that the value of the loss function keeps going down so that you know you are converging on the minimum.

TensorFlow has a builtin implementation of gradient descent and all we need to do is specify the learning rate. This optimizer will go through our tensors and find the variables that need to be adjusted in order to minimize our loss:

# create gradient descent optimizer 
learningRate = 0.01 
optimizer = 
# construct training function around our loss function 
train = optimizer.minimize(loss)

You might be asking yourself “how do I know what a good learning rate is?”. This part is really about math, but essentially each step in gradient descent computes a small difference to add to one of the parameters and the formula looks essentially like:

new_value = old_value + alpha * 1/m * 
    sum(predicted_value - expected_value)

In this formula, alpha is the learning rate, m is the number of samples we have, and the predicted_value and expected_value are the sets of predictions and expected output – we take the differences and sum them. This will result in a small value that adjusts the parameter we are currently training. Remember how our loss function squared the differences? The derivative of x^2 is 2x, so therefore the derivative of the square of the differences becomes 2 times the differences. The differences here are adjusted differently because we are only taking a small step and so we instead multiply by the learning rate and 1/m. If the number of samples you have is large, 1/m will be small, so the learning rate can be larger (which is good because you have a lot more calculations to do for each step with all your training samples). If the number of samples is small, you will likely need to make the learning rate smaller in order to not miss the minimum.

Now we actually have to run the training, so first we reset the variables back to their initial values:

# reset variables

Then we run our training data through the gradient descent a bunch of times (i.e., we take a lot of steps – in this case 1000):

# do 1000 iterations of training 
for i in range(1000):, {x: x_train, y: y_train})

When this is done, we have new values in W and b that should allow us to correctly predict the output based on the input:

# see the final values for W and b 
# prints '[array([-0.9999969], dtype=float32), 
#          array([ 0.99999082], dtype=float32)]' 
print([W, b]))

As you can see, we didn’t exactly hit the minimal values -1 and 1, but we are pretty darn close. How close? Let’s check the value of the loss function now:

# prints 5.69997e-11
print(, {x:x_train, y:y_train}))

That’s really small. We could train even more and get the loss smaller, but at this point our accuracy is over 99% and we are only talking about a small fraction of a percent improvement at best.

Moving on to Machine Learning

It’s been a while since I’ve written, but it isn’t about not having things to talk about. Really, it has been about just finding the time. As a parent, my first priority is my family and my children. At their age, they require a lot of attention, and honestly I think it is important that they get it, because if you don’t give your kids attention, they will find it in other places which you might not approve of.

Anyhow, my day job has certainly kept me busy. For a long time, my focus was largely on Docker and Containers. I’m still considered one of the resident subject matter experts, but I moved into a new organization that is more focused on data. Hadoop is going to be a big part of a lot of what I do in the next few months, but in the mean time I’m helping teams to modernize their infrastructure, development practices, and frameworks. I’ve become the product owner for a platform that will incorporate rules, machine learning, and lots of workflow management.

Machine Learning has become my new passion, and frankly containers just don’t excite me any more. It isn’t that containers don’t have value – in fact, there is definitely some possibility that I’ll be returning to them in the future as a method of distributing computing tasks – it’s just that there isn’t as much to learn there. When I see some new framework around containers, I read about it, but I’ve pretty much absorbed it all within a few minutes. Machine Learning is different. It’s easy to understand on the surface, but when you dive down into the details, things get quite complicated.

Basic machine learning is easy to grasp. One good example is predicting the price of a house based on a few different criteria such as number of rooms, floorspace, location, etc. You could imagine a simple rules based approach based on a table:


This approach certainly can work, but you have to manually adjust the rules as trends change, and it doesn’t quite capture the real correlation between the features (what machine learning folks like to call the inputs) and the resulting price. The prices could actually be the result of a complex combination of the features, something like this:

price = 34 * floorspace + 2700 * numRooms + 818437 * locAvgMonthlyIncome + 7364

I’m just making this up, clearly, but the point is that how certain factors contribute to a price is more than likely something non-trivial. Part of the work of machine learning is to pick a model that can approximate these relationships. Simple options are things like linear regression, and more complex options are things like neural networks. The more complex the option, the more computing power that is required to “train” the model.

The example I provided above is an example of linear regression. Machine Learning folks would write it this way:

y = Theta1 * x1 + Theta2 * x2 + Theta3 * x3 + Theta4

The variables x1, x2, and x3 are the features. Theta1, Theta2, Theta3 and Theta4 are parameters that need to be adjusted to produce a value that is approximately the right value for the given inputs. This is what it means to train a model. So how do you adjust the values? Essentially you build another function that estimates how far off your prediction is. The simple method is to take the difference and square it between the guess and the real value:

error = 0
for every y, actualPrice:
error = error + (y – actualPrice)^2

This gives you an estimate of where you are. This is known as the cost or loss function.

The next thing you do is figure out how to adjust the parameters to reduce the error. One common method is called gradient descent. You compute a partial derivative of the cost function that gives you a graph like this:

Image result for gradient descent

The easiest way of thinking of this is to imagine that you are on a mountain top trying to find your way down. You look all around you to find the steepest slope down, and then you take a step. You again look around for the steepest slope down and take a step. As you gradually get closer to the bottom, the slope gets less and less until it hits a point where it starts sloping up. When you hit this point, you have essentially hit a minimum and further adjustment will only increase your error, not decrease it. This is the goal – to find the minimal amount of error given the model.

When you start looking for that minimum error, you have to start somewhere, so you generally pick some values for Theta1, Theta2, etc, which will place you somewhere in the graph. There is some chance that you might hit a local minimum which isn’t the global minimum, so sometimes you have to run the exercise a few times to see if you have hit the real global minimum. Once you determine the optimal values of the parameters, you likely now have a function that will predict the value you are looking for based on the inputs with a reasonable amount of error.

You can’t be 100% sure, however, based on your known data. Sometimes you never really find a great minimum, and the fit against the training data isn’t very good (the error rate is still high). This is often called underfitting or high bias. Chances are that you might need different features or more training data or even a better model with more complex computations. A different problem is when the model fits the training data well but still has high error when used with data not in the training set. This is called overfitting or high variance. You might need less features or a simpler model because your model is just not general enough to make good predictions.

Image result for high bias vs high variance

There is a method called regularization that adds some extra tuning in to prevent overfitting by basically adding some extra weight to the cost function so that it doesn’t just follow the training data values strictly.

The real work of machine learning is to try different configurations of a model so that you end up with a nice fit that has medium bias and medium variance and gives you reasonable error for predicting outputs based on inputs. There are a number of great tools out there, and in my next post I’ll talk about TensorFlow, one of the best frameworks for developing sophisticated machine learning applications that is also easy to use.

Container Architecture

I am transitioning my role in my organization. This isn’t exactly a surprise, since I have always seemed to do more than be just a software developer. Almost from the first job I had out of college I was poking around, learning new technologies, continually trying to find the best way to do things. Along the way I picked up skills from a number of people around me, and I have become a utility player on my teams, especially for figuring out hard problems. I am not trying to boast; there are many folks around me that excel in other areas. I just happen to be particularly talented at figuring stuff out.

Lately I have ventured far off the path of being a development lead and in to the world of architecture. This is a different world from software development, sometimes populated by people who are not as technical. The reason is pretty clear: architecture is not as much about the little details that require a depth of understanding as it is about the big picture and understanding how to bring together large systems to process volumes of data and make it perform.

Often this requires knowledge of a number of different areas, from databases to UI to messaging to web services. I have worked in all these areas and I have a wide range of knowledge that I bring to the table. Now I can act as a guide to help developers across different teams pick the right components to build their application.

My first task in this new role will be to help architect a container solution. My work with OpenShift goes a long way to helping drive this home, but there are a wide range of requirements that need to be addressed as part of this solution. Some of these requirements are technical, some of them are regulatory, and some are just for the people that will support and run the platform. It is a lot of fun to be a part of building something that will not only have an impact on our end customers (faster delivery means that we can address potential problems faster and minimize downtime), but will also empower development teams to be more effective and spend less time getting new projects up and running. This blog will feature a lot of information about how we are going to make that happen. Hopefully someone gains some insight from it to help their development process to be more effective.