Over the last month I've been working on an ML-powered subreddit called /r/SleevesHidingHands. What started as nothing more than a "what if" idea, turned out to be a great learning experience in building a production application backed by a machine learning algorithm.
The idea is to create a subreddit of images posted by a bot account, all those images following a very specific niche-interest. To be specific: to create a subreddit that is automatically filled with photos of girls with their sleeves pulled up over their hands. Rather than try to explain it further, I'll have you take a look at the live subreddit at /r/SleevesHidingHands. This is a common photo pose/element on sites like Instagram, as can be seen from the many posts on the subreddit so far, making it a great candidate for an image classifier to work on, and a great candidate for making into a niche-interest subreddit.
Why This Project?
This project came from a humourous conversation I had with a friend. In talking about these very specific niche-interest subreddits, I said that making a niche-interest subreddit community based off a machine learning algorithm should already be a common practice since in concept, it's not hard to do. For those that are unfamiliar with the idea of niche-interest subreddits, there are many communities out there for strangely specific topics. At the time it surprised me that there wasn't already a subreddit dedicated to photos of this particular pose of girls' hands being covered by their sleeves, because I've seen it so many times over the years.
It didn't take more than a day to draw out a practical approach to making this happen. Each component was identified and defined quickly, from the crawler script retrieving images from the Internet, to the application handling the posting of images (the "bot"), to the classifier that would sort valid photos from noise. Since the idea and the design came about so quickly, this seemed a relatively quick machine learning project to work on - so I set to off to it immediately.
How It Works
I am using the fastai library, which is backed by Pytorch, to train and run a convolutional neural network for classifying the images as either "signal" (photo contains sleeves hiding hands) or "noise" (everything else). This library was used for the project since I had already done some work with it as part of the fast.ai course on deep learning, which helped get the project up to a prototyped state within two days of getting a usable dataset. Since fast iteration was the focus of this project (getting from idea to implementation in as little time as possible), this high-level library was the natural choice. The ConvNet uses the pre-trained
resnet34 weights before being trained on the new dataset. Note that I have not experimented with using other pre-trained networks, thanks to the focus on the data work and application development shown in the "Time Spent on Data Curation" section later in this post.
Before getting to the classifier, my first run of image collection involved downloading ~102000 images. From these, I reviewed 27085 images and pulled out 1009 "signal" images, which took a few days to complete. Since that would be a very small training set for a convolutional neural network, data augmentation became a requirement for avoiding an overfit network.
Data augmentation helped significantly with improving the algorithm's performance out of the gate, as doing simple translations of images (flipping them horizontally, rotating them by 90 degrees, etc.) allowed me to grow the training set of "signal" images by six times, and improved the classifier's performance. Using no other tuning efforts the classifier runs at ~80% accuracy, which was plenty performant enough to make this application practical to implement, knowing that an even better score can be achieved with continued work.
The first batch of the classifier's postable images have been being posted several times a day for the last few weeks, and will continue to do so for the rest of the year and beyond.
1. Time Spent on Dataset Curation
Guess where most of my time was spent? If you said "data collection and cleaning", you are very correct! Here's a little graphic to put the time effort into perspective in making this application.
If you'd prefer, here's a link to the interactive chart on Google Sheets.
Notice from the chart that the preparation of the data took more time overall than the development of the application and classifier by the time the project was up and running. I point this out not because I find this unusual or disappointing, but rather to show you that data science problems are all about the data. The model creation and tuning is definitely a piece of the puzzle, but your time will mostly be spent on making sure the data going in has as little noise as possible, rather than actually tuning the model that's applied to it. For example, the data augmentation step I described above had a larger impact on the performance of the algorithm than any other tuning step I've applied to the model thus far, meaning that my time spent working with the data was better spent than time put into tweaking parameters. At this point I've not spent much time on the tuning of the model because so much of the effort on this project has gone into working with the data, and this work continues even today.
2. Closing the Gap Requires Manual Intervention
The application is not a fully automated system. The major touchpoint the application has, and can't do without, is the manual review of the classified images. Because the model is trained off a tiny training set and is therefore overfit to that data, this leaves the classifier to make a lot of mistakes when classifying images. As a result, this means that I have to move the correctly classified images into a specific directory for the bot to pick up and post to Reddit, and remove those that are incorrectly classified, rather than just letting the algorithm post whatever it believes to be correct. This becomes less painful as the algorithm is tuned to be more accurate on unseen data, but doesn't go away entirely, unless we want to start seeing bad submissions to the subreddit every now and then. If I were to feed the classifier's output into Reddit directly without any manual review, you'd see bad posts at least once a day, so the manual review must remain.
3. Producing a Public-Facing Repository after Working in Jupyter Notebooks
This is kind of a minor point. I've worked in the PyCharm IDE for years doing development at this point, and only picked up using Jupyter notebooks within the last year. Having now worked on both sides, I fully understand why Jupyter notebooks are discouraged when someone is looking to improve their development skills: it is an environment that encourages fast iteration and has fewer guardrails to prevent someone from violating best practices. Making a new cell for a class in the same page rather than creating a separate module and importing it is one common difference you can see between these two working environments. The former is great for fast iteration in the moment, but the latter is a better practice for long-term maintainable code. This of course isn't to say that Jupyter notebooks are bad for development - if they were, they wouldn't be so popular for this work. However, it is very apparent to me that working in an IDE like PyCharm makes it a lot easier to follow best practices when developing, and is a contributing factor to the quality of code I write.
The code for this project, in case you're curious, is not yet public. I am working on structuring the application and its classifier into a proper repository for public review, and am introducing proper tests to encourage others to keep their code quality up when working on their own ML projects. (I'm currently at 79% coverage, but am shooting for the lofty 100% line coverage goal by the time I make the repo public.)
- Data science work is time consuming, even when the project requirements are well defined, and you have all the tools to finish the project. The fact is, data can be noisy and there's no better tool out there for cleaning it than human beings. This can be helped along by having others help you clean the data.
- When you have a machine learning algorithm that performs well on live data, the need for human involvement in reviewing the results goes down significantly - but the inverse is equally true. Though this seems an obvious point, it's worth calling attention to when considering the time effort required to finish a project, as well as the time effort to take it to production.
- Working on projects with the mindset of "ship prototypes" has become my favourite approach for the results it has produced, versus the approach of "do it perfectly the first time." This of course does not mean one should do shoddy work or to cut corners on a production system. What it does mean is to focus only on what's necessary to get your application up and running, and worry later about those extra features that can be added after launch. Sort out the necessities from the "nice-to-have"s and keep that focus through to the deadline.
- While data augmentation can help in situations where a training dataset is too small for an algorithm's training (or when sourcing more data is costly), it can only take you so far in avoiding an overfit model. There is no substitute for quality data. In the case of this project, my model being trained on a small dataset is not detrimental to the final product of subreddit posts, since I will need to manually review the classified images regardless of how well the algorithm performs. The difference between an optimal and suboptimal algorithm here is the number of misclassifications I need to clear out, which is cheap to deal with, making the improvement of model performance much less important to the continued operation of the application. The learning here is that in some projects, even a suboptimal ML model can be helpful, but this is a very case-by-case situation.
That's what I have for you for today. If you've thoughts I'm always open to hear them directly to my email, or in a Reddit discussion (to which I'll link once I've posted this up). Until next time, I'm off for a vacation to Slovakia and the Czech Republic! That may be covered in an upcoming post. :)
I could have also enlisted the aid of others to help in sifting through the rest of the images, but data augmentation turned out to be helpful in quickly getting a reasonably-working model without too much extra time being spent on data preparation. ↩︎
For those of you wondering why this would be worth doing given that I could simply do this by crawling the Internet and doing this without having a classifier, the difference here is that the classifier acts as a first-layer filter for reducing the amount of noise in the data before I go through and remove the incorrectly classified images, thereby making the process of finding valid images faster by at least a factor of ten. ↩︎