Moving on to Machine Learning

It’s been a while since I’ve written, but it isn’t about not having things to talk about. Really, it has been about just finding the time. As a parent, my first priority is my family and my children. At their age, they require a lot of attention, and honestly I think it is important that they get it, because if you don’t give your kids attention, they will find it in other places which you might not approve of.

Anyhow, my day job has certainly kept me busy. For a long time, my focus was largely on Docker and Containers. I’m still considered one of the resident subject matter experts, but I moved into a new organization that is more focused on data. Hadoop is going to be a big part of a lot of what I do in the next few months, but in the mean time I’m helping teams to modernize their infrastructure, development practices, and frameworks. I’ve become the product owner for a platform that will incorporate rules, machine learning, and lots of workflow management.

Machine Learning has become my new passion, and frankly containers just don’t excite me any more. It isn’t that containers don’t have value – in fact, there is definitely some possibility that I’ll be returning to them in the future as a method of distributing computing tasks – it’s just that there isn’t as much to learn there. When I see some new framework around containers, I read about it, but I’ve pretty much absorbed it all within a few minutes. Machine Learning is different. It’s easy to understand on the surface, but when you dive down into the details, things get quite complicated.

Basic machine learning is easy to grasp. One good example is predicting the price of a house based on a few different criteria such as number of rooms, floorspace, location, etc. You could imagine a simple rules based approach based on a table:

prices

This approach certainly can work, but you have to manually adjust the rules as trends change, and it doesn’t quite capture the real correlation between the features (what machine learning folks like to call the inputs) and the resulting price. The prices could actually be the result of a complex combination of the features, something like this:

price = 34 * floorspace + 2700 * numRooms + 818437 * locAvgMonthlyIncome + 7364

I’m just making this up, clearly, but the point is that how certain factors contribute to a price is more than likely something non-trivial. Part of the work of machine learning is to pick a model that can approximate these relationships. Simple options are things like linear regression, and more complex options are things like neural networks. The more complex the option, the more computing power that is required to “train” the model.

The example I provided above is an example of linear regression. Machine Learning folks would write it this way:

y = Theta1 * x1 + Theta2 * x2 + Theta3 * x3 + Theta4

The variables x1, x2, and x3 are the features. Theta1, Theta2, Theta3 and Theta4 are parameters that need to be adjusted to produce a value that is approximately the right value for the given inputs. This is what it means to train a model. So how do you adjust the values? Essentially you build another function that estimates how far off your prediction is. The simple method is to take the difference and square it between the guess and the real value:

error = 0
for every y, actualPrice:
error = error + (y – actualPrice)^2

This gives you an estimate of where you are. This is known as the cost or loss function.

The next thing you do is figure out how to adjust the parameters to reduce the error. One common method is called gradient descent. You compute a partial derivative of the cost function that gives you a graph like this:

Image result for gradient descent

The easiest way of thinking of this is to imagine that you are on a mountain top trying to find your way down. You look all around you to find the steepest slope down, and then you take a step. You again look around for the steepest slope down and take a step. As you gradually get closer to the bottom, the slope gets less and less until it hits a point where it starts sloping up. When you hit this point, you have essentially hit a minimum and further adjustment will only increase your error, not decrease it. This is the goal – to find the minimal amount of error given the model.

When you start looking for that minimum error, you have to start somewhere, so you generally pick some values for Theta1, Theta2, etc, which will place you somewhere in the graph. There is some chance that you might hit a local minimum which isn’t the global minimum, so sometimes you have to run the exercise a few times to see if you have hit the real global minimum. Once you determine the optimal values of the parameters, you likely now have a function that will predict the value you are looking for based on the inputs with a reasonable amount of error.

You can’t be 100% sure, however, based on your known data. Sometimes you never really find a great minimum, and the fit against the training data isn’t very good (the error rate is still high). This is often called underfitting or high bias. Chances are that you might need different features or more training data or even a better model with more complex computations. A different problem is when the model fits the training data well but still has high error when used with data not in the training set. This is called overfitting or high variance. You might need less features or a simpler model because your model is just not general enough to make good predictions.

Image result for high bias vs high variance

There is a method called regularization that adds some extra tuning in to prevent overfitting by basically adding some extra weight to the cost function so that it doesn’t just follow the training data values strictly.

The real work of machine learning is to try different configurations of a model so that you end up with a nice fit that has medium bias and medium variance and gives you reasonable error for predicting outputs based on inputs. There are a number of great tools out there, and in my next post I’ll talk about TensorFlow, one of the best frameworks for developing sophisticated machine learning applications that is also easy to use.

Container Architecture

I am transitioning my role in my organization. This isn’t exactly a surprise, since I have always seemed to do more than be just a software developer. Almost from the first job I had out of college I was poking around, learning new technologies, continually trying to find the best way to do things. Along the way I picked up skills from a number of people around me, and I have become a utility player on my teams, especially for figuring out hard problems. I am not trying to boast; there are many folks around me that excel in other areas. I just happen to be particularly talented at figuring stuff out.

Lately I have ventured far off the path of being a development lead and in to the world of architecture. This is a different world from software development, sometimes populated by people who are not as technical. The reason is pretty clear: architecture is not as much about the little details that require a depth of understanding as it is about the big picture and understanding how to bring together large systems to process volumes of data and make it perform.

Often this requires knowledge of a number of different areas, from databases to UI to messaging to web services. I have worked in all these areas and I have a wide range of knowledge that I bring to the table. Now I can act as a guide to help developers across different teams pick the right components to build their application.

My first task in this new role will be to help architect a container solution. My work with OpenShift goes a long way to helping drive this home, but there are a wide range of requirements that need to be addressed as part of this solution. Some of these requirements are technical, some of them are regulatory, and some are just for the people that will support and run the platform. It is a lot of fun to be a part of building something that will not only have an impact on our end customers (faster delivery means that we can address potential problems faster and minimize downtime), but will also empower development teams to be more effective and spend less time getting new projects up and running. This blog will feature a lot of information about how we are going to make that happen. Hopefully someone gains some insight from it to help their development process to be more effective.

The Next Big Thing

So a long time ago Bill Gates wrote a book called “The Road Ahead”. He was an intelligent and insightful guy and had a lot of good predictions about where technology was going. It was kind of easy for him though, considering that his job was driving a billion dollar corporation in the direction that he thought would best keep his company profitable. That company is still around and is still a leader in some areas, but in a lot of ways it has fallen behind and is no longer driving the vision of where we are going.

Sure, they have neat technologies that come out from time to time, but it is hardly innovative (whether they sell a lot of something is different from the question of being innovative). There are a number of things that you could point your finger to, but in large part I blame the leadership. They have made some moves to address that, but Microsoft had to roll out three versions of the Surface tablet and practically give them away to sports teams and broadcasters so that enough people saw the product before it actually became a hit. What is so innovative about it? Not a lot. It is still an expensive tablet that only becomes a laptop when you buy the keyboard for extra. There had already been laptops that also worked as tablets, so this was hardly a great leap.

Apple is certainly the company that over the past 30 years has been the standard of innovation. They took a lot of risks and some didn’t pay off at all, but the ones that did really changed the face of technology in our world. The iPod changed everything about the way that we buy and listen to music. The iPhone changed everything about how people communicated over the phone. They haven’t innovated as much since Steve Jobs died, but I am not sure is that is because of his lack of direction, or because the company is playing it safer now that they have such a large customer base, or simply because it is harder to be innovative when everyone else started doing it the same way as you.

One other problem is that it is becoming harder to know what products are going to succeed or fail. When there were fewer tech companies, there was less selection. Customers also were less informed, less knowledgeable and were more susceptible to whatever marketing had the most visibility. The need for computing devices was a lot less and you didn’t necessarily go to buy unless you had a need. Times have changed and now people are well aware of technology and the different offerings and marketing isn’t enough to make a product successful even if it is a well designed product. Personal taste, word of mouth, and first impressions can have as much influence if not more.

So what does all of this have to do with writing software? Just as the product offerings in technology have become vast and plentiful, so have the options for ways to develop those products. In the beginning you wrote software using the tools that came with the computer, which usually were the same tools used by the developers who wrote the software for the computer. As time went on other tools began to appear as well as other languages, a trend that shows no signs of stopping.

But the capabilities of those languages and tools have converged. There are advantages and disadvantages, but it is becoming rarer to find a language that lacks the capabilities offered by another (you may end up writing more code, but generally you can port code around and not lose functionality). What really differentiates those tools and languages now is the effort that it takes to build something in them. Old school developers wrote code in barebones editors that offered no syntax checking or content assist features. Modern IDEs offer compile checking, completions, and quick templates that enable swift development. Languages that offer built in constructs for tasks that are typically coded by hand also offer huge time savings.

As development of software became more and more advanced, a lot of issues came up around keeping code simple and flexible when scaling up from developer laptops to full production servers. Resources are limited and so developers often write and test code on machines that are vastly different from the production hardware. The topology may be different and so the software has to be aware of a number of different possible configurations. Problems that were seen in production can’t be reproduced on developer machines and so precious time is wasted trying to locate the cause.

In the 1990’s we began to see the rise of virtual machines, which allowed users to run programs built for one operating system on a different operating system. Most of us used this to run Windows on a Macintosh or to run Linux on a PC. Over time these virtual machines became streamlined to the point where companies began to be able to easily allocate multiple virtual machines to run their production software on a single physical machine. Technologies emerged that allowed for creating virtual machines from a template, which streamlined the process of setting up environments in which to run software.

Virtual machines have a problem though. They run programs inside them on top of an OS inside the virtual machine which itself is running on top of another OS. This redundancy is expensive and wasteful. Eventually specialized virtual machines were developed that removed the virtual guest operating system and replaced it with calls to the underlying operating system. This is the model that technologies like Docker use for building containers. Docker takes this a step further by allowing the software community to register the templates for Docker containers so that you don’t have to start from scratch. Building a web application with a database now becomes as simple as picking the right template and then adding content. Because this application is in a virtual environment, it is completely portable because it is separated from the real physical hardware it is ruining on.

Docker though is just about defining containers. There is a lot of orchestration that goes on around the containers – building software, testing it, deploying it, starting and stopping, monitoring and diagnosing problems. Companies like Google and Red Hat have addressed these gaps with technologies like Kubernetes and OpenShift. These technologies leverage container technologies to built service platforms that are turn key and allow for rapid development that can scale quickly. Scaling up web servers can be as simple as adding more containers. There are still challenges such as data replication and synchronization, but because developers mostly care about what is inside a container and devops mostly cares about what is outside, containers have created an elegant harmony between them. They become less reliant on each other and are not stuck waiting for tasks to be done.

Terms have emerged for this kind of thing, namely IaaS (Infrastructure as a Service), PaaS (Platform as a Service), and SaaS (Software as a Service). The difference is mostly in the type of resource. IaaS: I want a Linux virtual machine, everything else is on me. PaaS: I want a MySQL database, I am responsible for populating it with data. SaaS: I want a WordPress application. Sometimes you need to combine these, but as software developers, we mostly desire IaaS or PaaS alone, depending on whether we want to install everything or just start out with a good foundation.

Companies like Amazon and Microsoft have built IaaS and PaaS offerings that allow developers to jumpstart projects in a matter of minutes and continue to grow all the way up to full scale enterprise, paying only for the resources that are used. These offerings have been around for many years but they are still in their infancy and a significant portion of the developer community has never worked with them or containers in general. But companies are continuing to look for ways to streamline the development process and containers are starting to become the next big thing.

I am happy to be at the center of that.

Ansible

So containers are just part of the solution that we are building. There is orchestration as well. That is a pretty broad term though. Different teams manage their applications, workflow and deployments differently. There is commonality though, and although there are lots of different terms for it, it basically is like the way a conductor directs an orchestra. Each player has his own duties and part to play and the conductor doesn’t tell you everything you need to do but rather guides the whole. With a large organization there are lots of different parts and pieces that must work in concert. As technology evolves, more and more human processes are being replaced with automation.

One tool we are looking into heavily is called Ansible. This tool is largely about automating the process of installing software on to a machine, starting and stopping processes in that machine, and monitoring their operation. I am not claiming to be an Ansible expert or even experienced in it, so by all means please correct me if I write something incorrect.

This kind of tool is incredibly powerful. I was fortunate enough to have worked with some really talented engineers at Tellme Networks, a platform for wiring voice driven applications, and when the company began to have huge commercial success, we realized that we needed a solution for getting infrastructure up and running faster. Some of this was because there were a lot of different types of servers involved, each with their own software packages and configurations (there were telephony servers for answering calls, reco servers for doing voice recognition, and admin servers for collecting data), but also because we were hosting a live platform that needed to have incredible uptime and we needed to be able to address issues quickly in order to not disrupt service.

We were really ahead of our time in this and there really were not a lot of solutions that could really meet our needs. We were running on Solaris servers at the time (does anyone still use Solaris or HPUX anymore?) and one of the tools that came with it was called stow. It was a simple tool that allowed you to install a software package by creating softlinks in the places where binaries are usually stored to point to directories where the actual files were located. By doing this, you install and uninstall software in seconds. There was a little more to it (starting up daemon processes for example), but this was the general idea.

The next piece of the puzzle was what we called the gold.conf. We called it this because essentially it detailed the packages that needed to be installed on a server and so it was the golden master blueprint for the server. We ran a simple apache server that was included as part of the OS image that could be”blasted” onto the server as a baseline. This server had some mod_perl scripts that exposed a REST API that could be used to set the configuration for that server, push packages to the server, and start and stop services on the server. There was a central repository where the packages were stored so that servers needed only to pull the bits from the repository after startup to initialize themselves. Upgrading a server only required updating the gold.conf so that it world uninstall the old packages, install the new ones, and then restart services.

This was an incredibly effective platform, almost to the point that we could have turned it into a product offering had we wanted to, but we were not in the packaged software business, we were a solution provider and our teams were organized around that.

Tellme was acquired by Microsoft in 2007, and the platform continued operating. Eventually it was spun off into 24/7, which continues to host the services, although I imagine the platform has changed and they may no longer be using gold.conf files any more, but there certainly is no reason why they could not still be using it.

I hope that Ansible can be a similar type solution, where you can basically take a box off the rack, blast some bits onto it and be up and running in a few minutes.

What Keeps Me In Business

Despite what you may have read or heard, writing good software is not easy. It might not be go to MIT hard, but it is pretty hard. How do I know? I haven’t met a whole lot of people that do it as well as me.

Not that I am the best; there are people out there that are much more talented than me. But for every really good software designer I meet, there are at least 20 or 30 who are much more focused on just getting the job done or earning a paycheck or just don’t really believe that good design is that necessary.

I am not trying to insult anyone. It just has become the norm to drive software out the door faster and faster with less concern about the quality of what is being delivered. We talk about testing and quality and things like that, but when we gather in our agile scrums, we mostly talk about what we did yesterday and what we will do today and what is left to be done. I rarely hear anything about how someone needs to go back and improve the code. We don’t generally include writing test cases as a separate task-it is just implied that we are going to do it as part of our development.

What we do talk about a lot is defects. A lot. Hardly a week goes by that we aren’t having to fix defects. And these just aren’t defects that are found in testing or QA or UAT. These are defects that were found in production when our software had bugs that caused real users to have a diminished user experience. Sometimes they can’t get done what they need to and have to wait on us to fix the bug. Mostly it ends up being bugs that users can work around, but it annoys us to no end that we have to work around problems that should have been uncovered by testing or better software design.

The software industry is in a better place in general, because we are getting software out the door faster and we are not throwing away months of our lives only to find that the client no longer wants anything we wrote (anyone who has been on a waterfall project that basically was trashed understands this-one of the biggest wins of agile is that it allows us to fail faster when we are going in the wrong direction). We are able to address problems in a more timely manner, and we are able to show progress even when a feature is far from complete. We aren’t where we want to be though.

The problem, in my opinion, comes down to money. When I started working and the dot com boom was in full swing, companies could afford to take their time to build products because investment dollars were easier to come by and there was less pressure to deliver actual deliverables (not to mention that software was a more exclusive field and teams were given a lot more flexibility). When the bubble burst, those dollars dried up and investors became a lot more demanding on seeing the fruits of the labor they were investing in. The economy fell into recession and companies became more and more focused on the bottom line. The processes around software development came under heavy scrutiny and pressure increased to deliver things faster with less people (or cheaper people).

Offshoring software development began to rise as companies all around the globe recognized the potential of consulting firms to build software where salaries where much lower. As companies began to see savings, they continued to push for even more cost savings. Ultimately, cost savings turns into higher profits.

In a lot of industries, you can get away with finding cheaper ways to produce the same products. Companies like Nike have been producing products overseas for decades because the quality of shoe you get from a factory where the workers only make a handful of dollars a day wasn’t significantly less than when they were making them domestically with workers earning many times that amount. It simply made economic sense to move the manufacturing overseas.

Software can work like that too. There is plenty of talent to go around, and there certainly is enough work that doesn’t require specialized knowledge. The growth of the internet only made it easier to find people to write the software, as what they didn’t know they could find in forums, tutorials, and online books.

The problem is that software isn’t like making shoes or cars or clothes. In that kind of work, someone has come up with a series of steps you follow and you just fall into a routine that leads to a shoe or a car or a dress. Sometimes there are flaws or mistakes, but with shoes or dresses they are easy to catch, and cars go through rigorous testing because safety is key. Software is more complicated. You aren’t making the same thing over and over. Instead, you are problem solving and writing complicated processes to achieve some goal. There are common patterns, but every problem is different and one function might have to handle dozens of possible inputs that must be processed to compute the output. A lot of the defects that come up are because of simply missing a possible input and thus not computing the right output.

It is not a commonly known fact that it is actually impossible to write a computer program that can evaluate a piece of software and determine whether the code is free of bugs or not. The proof comes down to this: take the program for finding bugs and run it through itself. If it finds a bug, then a contradiction arises. If the program is buggy, then clearly it can find bugs, but we can’t trust a program meant to find bugs in other programs if it itself has bugs. If it doesn’t find a bug, how do you know for certain that it doesn’t have a bug which prevented it from finding the bug in itself? Either way we are left in a state where we can’t prove that software is bug free.

If we can’t write a computer program to do it, we certainly can’t expect people to be able to do it. That isn’t too say that we can’t get close. There are situations where the number of inputs are fixed and we can demonstrate that for each input the program results in a given output. However, there can be subtleties to even limited inputs. What if the code takes dates into account in computing the output? This means that we can get different outputs even with the same inputs. If you consider a fragment of code as a black box, you can only test it by repeatedly sending in all possible inputs and verifying the outputs. Without being able to validate the internals, there is always the possibility of an unexpected output.

What I am getting at here is that there is no such thing as perfect code, nor is there such a thing as a perfect programmer. The best developers in the world have bugs in their code. However, the best programmers in the world are the best because they write their code in such a way as to minimize the possible ways that bugs can be introduced into the code. They know that there are some key strategies that lead to code that is easier to maintain, debug, and change when requirements change.

The real cost of trying to get software cheaper is that firms offering software for less tend to ignore these strategies and aim for getting the job done as fast as possible. They leverage the fact that if software is good enough, you can look like you are producing quality fast when in fact you are creating more work down the road with defects. I won’t even go into the fact that they tend to use strategies that don’t lend themselves to changing requirements. These costs are hard to monetize because they are spread out over time. It is also hard to account for work that could be done when you are instead fixing bugs from the previous work.

So what are the right strategies that great developers use? Here are some of mine:

  1. Clarity: you will not be the only person to touch some code. The more clarity you can give to code the better. This doesn’t mean comments. Code should be self documenting. If you find you are writing lots of comments, it means that you need to break things up into smaller pieces that are easier to follow.
  2. Naming: it seems easy enough, but naming can make a big difference. We get lazy and don’t want to type out long names for things, but when I see “itemsInBox” it tells me a lot where as “items” or even “i” can leave me guessing what is being referred to.
  3. Duplication: I learned early in my career that nothing makes a piece of software get very complicated like duplication. It starts off easy enough because it is easy to copy and paste, but the copies tend to change so that they work differently and when a bug is found in one you have to fix the bug in all the duplicates. Modularize as much as possible so that common code is centralized and easier to maintain.
  4. Don’t settle for good enough: when I write a piece of code, my first goal is to get it working. But I don’t settle for good enough code. I look for all the possible ways to make it better. Can I make it clearer what the code does? Do I need to reduce duplication? Do I need to name things better? It is like when you write a term paper or a business proposal. You write it and then refine refine refine.
  5. Test everything you can: Test often. Don’t just test the business modules-make sure to test utility classes as well. Don’t just test happy paths-test unusual inputs and never say “I don’t need to test that because it will never happen”.
  6. Review your code with someone else: I am not talking about a simple code review; explain your code line by line. Explain the choices you made about why you did one thing first and another thing second. I can’t count how many times I found defects in my code by simply walking through it with someone and explaining it. Often I will think of boundary cases I missed or they will ask questions that make me rethink my approach or even help me realize that there is an easier way.

This is not all of the strategies I use. Veteran developers will tell you that they get a feel of when the code is ready, and it is hard to explain that, but I can tell you that I learned my best strategies by watching other developers code and seeing what worked and what didn’t. What keeps me in business is the fact that I continue to evolve my strategy as time goes on and I write solid dependable code that helps my teams succeed. And you can’t get quality software when you just pick what is cheapest. You need to make sure they really are focused on the right strategies.

Leading the Transformation

About six months ago, I moved into the position of a senior software architect, specifically to work on technologies like containers. I have been playing a lead role in driving the adoption of Docker and OpenShift into our organization for running applications inside of containers. There is still a lot of work to be done, mostly in the areas of training materials, videos, and the processes by which developers will transition into the world of containers.

There is more to this job than that, however. As much as some see this as just the adoption of a few new technologies or some new methods of developing software, I see it more as a transformation of everything we do. I was recently reading a book called Leading the Transformation. It was largely about transforming your teams into Agile teams, and how to scale that into large organizations.

Humans by nature are creatures of habit. They learn habits in their early years that will affect how they do things throughout their entire lives, and the earlier something is learned, the harder it is for it to be unlearned. This is as much true for how people do things at home as it is for how they do things at work, and most people don’t like to be told that they are doing it the wrong way.

This is what was ironic about Agile: once people saw its potential for improving software processes, they became more open to changing their organizations, despite their usual resistance to change. Books were bought, consultants were brought in, and the process of the transformation began. Some companies found great success with Agile and have never looked back, but other organizations didn’t quite get the lift that they had expected, and some even went back to more traditional methods like Waterfall.

So what happened? Why did Agile succeed for some and fail for others? There are many factors involved. One is the size of your organization. Agile works extremely well for small companies or where the leaders at the top are close to the teams they manage. It doesn’t work so well when your organization has a deep reporting structure or is geographically split up.

Another issue is generally who has to make changes when adopting Agile. If it is just your development teams (scrum teams, daily standups, etc), it might work for you, but chances are that there is going to be some friction from above as a leadership that was used to giving all of the requirements up front is forced to break those requirements up into pieces that can be delivered over time. There could also be some friction as the teams will require more interaction with the leadership and end-users as features are developed and feedback is required.

The point is that your whole organization (or at least each unit within the organization that is dedicated to a product or service) needs to be a part of the transformation, and everyone needs to change how they do things to make it work. I have worked in companies in the past that struggled with the adoption of Agile because they were simply unwilling to or unaware of the need to make changes outside of the development teams.

I am fortunate to be in a position where I can help to lead this transformation within my own organization. In addition to my work on software tools like OpenShift, I am also writing blogs and documentation that cover many topics including Agile and will guide the business as we transition to more Agile ways of doing things. I look forward to seeing the fruits of this labor in the coming years.