Practical Futures is a series where members of the development team at Uncorked Studios bring realistic expectations to experiments with emerging technology. First up in the series is our recent foray into deep learning and neural nets.
Hi! I’m Evans, the CTO here at Uncorked.
We design products and as such our job, collectively, is to be extremely smart and thorough for clients who need to accelerate initial product thinking, leap forward in tech capability, or simply create and ship good product. My job as an executive boils down to spreadsheets and talking at conferences and guiding the team, I’m a computer scientist and have been an adjunct professor of mobile development, so spreadsheets and hotel breakfasts don’t always fulfill me. But the third part (guiding the team) takes care of itself mostly, because my team rules. By and large, I’m hyper-cynical (or is it suspicious) when it comes to new technologies, and I have a strong academic bent. It’s better to fail for good reasons than to fail because you heard that failure was cool on HN and you think you can re-implement UIButton in a week. This means I’m rarely excited about new technologies. Sometimes it feels like they’re mostly different ways of writing something that isn’t C, but requires a lot of C under the covers to work properly.
What I am excited about (especially in this world of invisi-bubble, bro-tech-tainted, cash-out-fast technology) is Deep Learning. There are lots of flavors and frameworks a person can use to explore this area. I use Tensorflow mainly. But the point of these articles isn’t to debate which of those is best.
(I promise that you’re only about to read “Tensorflow” and “MNIST” a couple of times here).
I started off writing some Convolutional Neural Nets, which are the core technology behind most modern image recognition, as well as things like Prisma, an app you’ve probably seen that makes selfies look like Degas paintings. Convolutional Neural Nets are one of the building blocks of larger, more complex implementations, but rarely an end in themselves. Not to shortchange Recurrent Neural Networks with Long Short-Term Memory, but that’s a topic for another post. What makes all this so exciting to me is that for the first time in my career I’m not only interested in the output of my work, I’m also fascinated by the input.
One reason that I am so fascinated by inputs is that the single most human and fallible part of Machine Learning is the way that we teach an algorithm. The code, the computer, the vectors and matrices that make these inferences and decisions learn the biases consciously or subconsciously taught to them by the humans who decide what they are exposed to while training. The work of Blaise Agüera y Arcas on physiognomy and the ethics of Deep Learning as well as the emerging work from Google’s PAIR initiative on human/AI interaction has had great influence on my approach to learning this technology, and I am driven to do this work with open eyes. If you haven’t already read or seen any of the great thinking on the ethics of machine learning being tossed your way by a concerned news algorithm, I’ve added a few links at the end.
Here’s where this gets really crucial: Algorithms themselves are not just horrid approximations of horrid people, but they, the algorithms have been nudged and mutated over tens of thousands of examples, subconsciously curated, into having new and potentially more horrid additional biases that (one hopes) were unintentionally injected into the training data.
So, let’s start with training data. What’s training data? When you hear that there’s a new algorithm to find faces in pictures or one that reads your license plate number when you speed, or one that recognizes what brand of pants you’re wearing while walking around the mall —they all work by “showing” the algorithm thousands and thousands of examples of faces, license plates, and pants. A bunch of (very, very) cool math then works to more or less learn the other very cool math needed to pick those out of a picture.
A critically important feature of training data is that it’s hard to come by if you’re new to this, not a giant corporation, or don’t have hundreds of student volunteers at your disposal, and you want to work with images. (I’ll get to this in a future post, but convolutional neural nets do their coolest shit with images.)
Let’s say you’re Facebook and billions of people have been telling you (for years) “Here’s a picture, and here’s who is in it, and here’s exactly their face.” Or you’re Google and every time someone wants to sign up for an account anywhere on the internet, they get shown a little picture and they have to tell you what letters and numbers are in it. Then you’ve got datasets on datasets about datasets because we’ve all been building GIANT datasets for training for years. It’s pretty cool (if not a little frightening), but for now — it’s hard to come up with enough image data to train with as a mere mortal.
Luckily there are a couple of interesting, free, well-tested, well-documented datasets out there that you can start with. A very famous one is MNIST, which contains thousands of examples of hand-written numerical digits. Street View House Numbers is one of my personal favorites. It is an enormous dataset of house numbers taken from the photos in Google Street View. Earlier, I had done some work using the Street View House Numbers dataset to train a convolutional neural network to classify and identify digits in images taken in the wild, think other address plaques on buildings. This turned out to be way more understandable than the old OpenCV method, so I was hooked.
Following a tutorial by Siraj Raval (I recommend watching some of these for any skill level) to generate images using a rudimentary GAN and MNIST, I decided to go a little further and worked SVHN into the framework. The results were fascinating. You can watch the algorithm learn what the house numbers look like over time. Here are the first few iterations:
Here’s what they look like at the end of training:
So this was pretty neat! This little GAN could create largely realistic SVHN data by just looking at thousands and thousands of SVHN images. If you look closely or watch the loop repeatedly, what you’re seeing is totally random pixel data quickly beginning to look like house number plaques. The algorithm is comparing random pixels to pictures of real numbers, and figuring out what to modify to make the random pixels look more real. For the purposes of the following experiments, I didn’t tell this GAN *what *it was looking at, so these are sort of *imagined *by the algorithms.
Now: WTF is a GAN? Try to think about it this way: there’s one algorithm generating fake images mathematically, starting with just completely random pixels. This is the Generator. Then there’s an algorithm loading real images from SVHN. Both algorithms take their images to a third algorithm called the Discriminator, which is looking at the images and trying to figure out which of them is real and which is fake. The Generator then takes those results and modifies its math to try to fool the Discriminator, while the Discriminator modifies its math to try to not get fooled, and the whole process starts over again with everyone being slightly better (or sometimes worse, but let’s be patient) at their own piece of this process. The Generator and The Discriminator in a GAN are always fighting each other, and each gets better or worse all the time. This research area is very new and methods to optimize and improve these are being theorized, discovered, tested, accepted, and improved, daily.
I liked how SVHN turned out. It looks pretty realistic, so I thought we might try it with the Alphabet. Enter Chars74k.
Fear and Loathing in the Input Layer
Chars74k is a very visually interesting alphanumeric image dataset. It includes several thousand images of letters and numbers found in the wild, as well as big dataset of images of alphanumerics that have been synthesized. So I got to work wiring up this data , and the results were unexpected.
Once the data had been parsed (I only used the directory called “GoodImg” because why would I use “BadImg?“) I ran the images through the same GAN (roughly) as the SVHN GAN above. And while the SVHN GAN had tens of thousands of images to learn from, the Chars74k dataset clocks in under 10K images. The effect of this is obvious:
You can see in the above animation that the dataset of images (despite being GoodImg) doesn’t seem to be able to make up its mind about where it’s going. It does a pretty good job of looking visually interesting, but overall, it’s just stabbing in the dark at what it’s trying to look like. The above is over 500 epochs, which was disappointing. The SVHN output was done in about 25 epochs, which means that all of the real images had been compared against the generated images 25 times. I kept pushing the number of epochs up, hoping something would converge, as more comparison and more modification should mean overall better results. So I decided to get the synthetic dataset, which isn’t comprised of visually interesting pictures in the wild, but black and white images of single characters like a font sample. Chars74k contains a large synthetic dataset, so it was straightforward to integrate these into my total training set. My hope was that this would result in a training set large enough to be interesting. Here’s the result of that experiment:
I think visually this is pretty neat, but not nearly as realistic or pleasing as SVHN. On the other hand , this video looks really cool on a big screen playing a loop. You can see the synthetic alphanumerics (the black “text” on white) trying to turn themselves into letters or numbers but in most cases just looking like both.
So I wondered: How would synthetic dataset, the font set by itself, without the real world photos do in this exact same GAN framework so I ran it:
Once again, you can see the system struggling to converge. This was surprising to me, because the images are all fairly normal (black on white) and there’s not a lot of unexpected variation like you can see in the “in the wild” alphanumerics from the images. It still looks pretty cool.
So why did I bother doing this? There are a couple of reasons:
Deep Learning is still in its infancy, but far beyond “can this be done.” When estimating and writing an app or a web framework or a high-volume-real-time system architecture, the problem space is fairly finite. We’re in a new era of “we’ll need to experiment and see how good it is,” which is somewhat like growing a crop. You can concept a goal, gather the necessary bits, and work very hard, but you still can’t fully know at the outset exactly what the outcome will be. You will revise.
Datasets are everything . The algorithms in this space seem amazing and interesting and make it seem like there are limitless ways to approach this kind of work. This is true, but for the most part, a large, well crafted dataset is essential to these systems. Creating and curating datasets is going to be a major part of the tech ecosystem in the future, and if you’re not already noticing the time that you spend generating training data, then you will soon.
I am tired of of technology. I am tired of what has become of “tech.” The industry picked up the same cultural bullshit as every other industry, but has become a festering cesspool of race bias, misogyny, and anger.* These bad traits end up woven into business models until no one dares approach them for fear of tanking an IPO. Data Science may be the area that we see the hard questions about bias (race, gender, ethnicity, etc.) get explored and dragged into the sunlight. The more people know about dataset bias, the more they’ll take a hard look at the inherent challenges.
The dataset I used here isn’t itself a good example of what I’d call “meaningful bias” in an ethics sense, but it does demonstrate that the data used to train to these networks have very real and significant impact on the output quality. If you train with a skewed dataset, you will get skewed results.
I’ll be writing more on this topic in the coming weeks. There won’t be much code. Programmers are only a small chunk of this story, and lord knows there are better places to get code snippets. Get in touch if any of my methods are overly puzzling.
And when it comes to the rise of Machine Learning, as my amazing coworker Lynsey always says: “Be troubled, but be excited.”
Some essential reading:
*Standard “no, of course not everyone” disclaimer… but come on.