a8b.io

How I got 10k post karma on reddit with (and without) fast.ai

My reddit profile with 10,088 post karma
Profile image made by AI Gahaku

Back in 2006-2007 my friend and I put together a spreadsheet of 20 or so high-level achievements called “Everything’s a Contest”. This included goals like “Photograph a live grizzly bear in the wild”, “Have something named after you”, and “Get 10,000 (post) karma on Reddit”. Despite our heated discussions about what should be on this list and the criteria for success none of us ever really did anything substantial to complete any of these goals. In early 2020 I decided to tackle one of these long-standing contests. But I was going to do it with AI since I wanted to see how I could apply AI to more of my problems. I’m a huge fan of fast.ai and I appreciate its high-level abstractions and simple interfaces. For someone trying to get into deep learning, I would highly recommend it and the associated courses. This is a post about how I built a bot to gain karma on Reddit with fast.ai.

Approach

Content on Reddit, in general, falls into two categories which you might call original content and found content. Trying to automate generating original content to post would be implementing something like imgflip’s AI meme generator and posting the resulting content to r/memes. While the memes that are generated are an amusing juxtaposition of tropes, they generally aren’t as good as memes that are generated by human users skilled in the art of observational comedy. Try the meme generator out and you’ll see what I mean. At some point, I did try posting one of those just to see how well it did.

Drake meme with text Staying in the Server / Watching Anime
I went through about 100 auto generated memes before I got this. This post got ~35 points with the title 'An AI generated this meme. Good to know it's putting its sentience to good use.'

The other approach would be finding content online and posting it. I chose this route for automation because while generating high-quality content is more interesting it’s also far more challenging and involved. Rather than being disadvantaged in trying to catch up to the “human quality” of content, a computer is at the advantage since the primary differentiator between posters is the ability to search through large amounts of content and speed (who finds and posts something first). In particular, I ended up looking at news based subreddits due to a few reasons:

  • Little chances of content being a repost
  • Lots of content being generated frequently
  • Large member bases
  • A lot of the content comes from the same set of known sites

Domains for all the > 1 score articles from /r/business in the last week as of October 29th 2020, showing the top few sites supply the majority of content.

Why use AI at all though? If I was going to automate posting why not just create a bot that submits all links and leaves it up to the masses to sort my fate. In general, spamming on Reddit is looked down upon, though I believe banning is up to the discretion of the subreddit moderators. While in theory I might have gotten away with it or used some sort of generic rate-limiting and hope for the best, I wanted my bot to be more like a productive member of the community submitting thoughtful content, an extension of myself, rather than a spam bot that would be the bane of its existence until forcibly removed.

I searched for and found many good news subreddits but I targeted initially /r/business and /r/worldnews, somewhat arbitrarily and somewhat because those areas seemed interesting to me. /r/business had a sizeable community (577k members) and relatively low frequency of posts for a news subreddit so it seemed approachable as a first target. I would build a web crawler to watch popular business news sites for new articles, and then leverage an NLP-based article classifier to determine if the article had a high chance of receiving upvotes.

Implementation

Finding and loading posts

To train the NLP model that will classify articles I need the article text and their corresponding Reddit scores. Unfortunately, the official Reddit API limits the amount of historic post data you can retrieve to 1000 items. Luckily there is an alternative, you can grab historic post data from pushshift.io using code I shamelessly adapted from WaterCooler: Scraping an Entire Subreddit (/r/2007scape). The output file of the script has individual lines like the one below, containing a JSON object of a post per line:

{"author": "CALIPHATEMEDIA", "author_flair_css_class": null, "author_flair_text": null, "brand_safe": true, "can_mod_post": false, "contest_mode": false, "created_utc": 1514767344, "domain": "caliphatemedia.info", "full_link": "https://www.reddit.com/r/business/comments/7nc5yf/south_korea_to_regulate_bitcoin_trading_further/", "id": "7nc5yf", "is_crosspostable": false, "is_reddit_media_domain": false, "is_self": false, "is_video": false, "locked": false, "num_comments": 0, "num_crossposts": 0, "over_18": false, "parent_whitelist_status": "all_ads", "permalink": "/r/business/comments/7nc5yf/south_korea_to_regulate_bitcoin_trading_further/", "pinned": false, "retrieved_on": 1514841750, "score": 1, "selftext": "", "spoiler": false, "stickied": false, "subreddit": "business", "subreddit_id": "t5_2qgzg", "subreddit_type": "public", "thumbnail": "default", "thumbnail_height": 140, "thumbnail_width": 140, "title": "South korea to regulate bitcoin trading further with tougher measures", "url": "http://www.caliphatemedia.info/2017/12/south-korea-govt-to-introduce-tougher.html", "whitelist_status": "all_ads"}

After generating a file with all the historic posts, I loaded the contents referenced by the “url” field of each JSON line into boilerpipe3 which is a python library that simplifies HTML documents into their primary content text.

boilerpipe3 simplifies this (mostly) down to what we really want which is 'By Jeffrey Dastin, Akanksha Rana 4 Min Read (Reuters) - Amazon.com INC AMZN.O on Thursday...'

After loading the article text I did some processing to remove pages that didn’t load properly and to truncate page text to 10k characters at most.

Training the model

I trained the NLP model for /r/business over a few days on approximately 271k articles from 2018-01-01 to mid-April 2020. I used an AWD_LSTM (e.g. language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3) and text_classifier_learner(data, AWD_LSTM, drop_mult=0.5)) for the model with discrete labels based on the article score: neutral (0-10 points), okay (10-100), good (100-500), and great (500+). The classes are kind of arbitrary and I’ve changed them around for different iterations and subreddits. The only thing I ended up paying attention to at runtime was the non-neutral class score, so this could have been a binary classifier or even better a classifier that looks at an individual article as a candidate for multiple subreddits rather than training an individual model per subreddit. For world news I had around double the number of articles. In either case I trained in 100k article chunks due to memory limitations, reloading the existing model and tuning with the new data each time.

Building the bot

I chose a few sites (Business Insider, Reuters, Bloomberg, CNN, CNBC, BBC, etc.) based on frequent sites for /r/business and put them into a script and built a crawler using requests and BeautifulSoup. Every minute or so I’d crawl the site root for new links and process the linked page with boilerpipe and pass it through the NLP model for scoring. New pages with a non-neutral score of greater than 0.25 (an arbitrary threshold I picked for this particular model) would be flagged and emailed to me using AWS SNS. Initially, I relied on filtering these incoming suggestions and submitting the articles myself using the Reddit app.

An example email I received from my site poller that I called 'Reddit Postmaster'.

Automation

My site poller is acting as a personal news aggregator that I use to generate suggestions for things to post. To remove me from this loop there are a few “last mile” problems to solve. Coding up submitting via the Reddit API is easy enough, but I don’t just submit all the articles that get passed to me, I act as a quality filter choosing not to submit things which don’t seem appropriate. I also gather an appropriate title for the Reddit post. The page title is usually not good as a post title (for example containing redundancies like the site name) directly and needs to be reformatted or the title should instead be extracted from the content for example by picking the most appropriate h1 tag on the page. Rather than refining the model to be better at classifying articles and improve the accuracy of the scoring mechanism, and then coding a new component for extracting the title, I decided to try to encapsulate these tasks into a mechanical turk task and have humans as the final gate-keeper and title generator. I can take the results from these mechanical turk tasks and submit the articles to Reddit via the Reddit API utilizing those results.

Mechanical Turk

Mechanical turk is a service for leveraging humans to complete small tasks for your application. It’s great for augmenting AI applications and collecting data via labeling tasks for them. But using it effectively isn’t without difficulties. The important thing to know about mechanical turk is that many workers on the platform are (sensibly) optimizing for task completion quantity. When using a custom qualifier for tasks, ensure the qualification you’re looking for is not apparent from the question. Similarily a bad HIT (human intelligence task) would be one where you ask the user to read an article and check some box if they think it belongs in some category. For example, I did this with /r/worldnews candidates to ensure that among other things didn’t pertain to US news, however the first fully automated submission I made to /r/worldnews was “Johnson & Johnson to stop selling baby powder in the United States” which was removed after receiving 42 upvotes since US internal news is banned on /r/worldnews. Make sure the work you assign at least appears as though you will look at the individual results (e.g. use text input boxes at least once). Be careful and clear about what you’re asking and reduce the opportunities for human error. Asking for the user to supply an article title, while well-intentioned results in some users writing poor article titles rather than copying the article title if they can’t think of a good one. A better task would be to ask the user to identify and copy the article title as it appears in the article since that has more reliable quality. After hooking up mechanical turk tasks to my bot, it was capable of running fully automously.

A new email letting me know when a post was submitted on my behalf.

Results

All items posted from "Reddit Postmaster".

You may notice that we only post a few things per day at most. In fact, our bot is probably under posting. One strategy we could adopt here would be to lower the threshold as time goes on to incentivize posting more content as each post comes with a relatively limited downside (a few downvotes then into obscurity) and a very high upside potential. While we want to avoid being spammy, our behavior is probably a little too conservative.

Alignment

The alignment problem (in AI) is when your system performs the way you told it to, for example optimizing upvotes, but it does so in a way that goes against what you actually want. For example regarding the post I referenced earlier related to US news that was posted to /r/worldnews and then removed by a moderator. While the post did get some upvotes (~42) it violated my expectation and preference that the bot would follow the rules. Because /r/worldnews isn’t full of US news articles that are downvoted and removed for violating the rules, my bot has a lack of examples of what it shouldn’t do. We could teach it the rules in the future augmenting the dataset with things that violate the rules with very negative scores. The problem of AI alignment is deeper though, as while I did identify one thing I didn’t want the bot to do, there are many unstated things I also don’t want the bot to do that it might do to try to optimize upvotes (for example posting photoshopped pictures, offensive content, making jokes about my mother, etc.) The problem of aligning your bots actions to your desires is an open one that requires a lot of careful thought to ensure we use AI responsibly.

The Karma Machine

In total over 1.5 months of running the system on two subreddits, I was able to garner around 3.7k Reddit karma. Unfortunately, I never got a big hit on /r/worldnews, though I got close on some occasions. For the highly competetive /r/worldnews my mechanical turk tasks were taking just a little too long. This could have likely been remedied by increasing the monetary reward associated with each task. The other 6.3k karma came because I got tired of waiting. While I tried many things, what ended up working was digging through “interesting facts” lists and finding things that hadn’t been posted to /r/todayilearned.

Slightly anti-climactic for ~14 years in the making.

Conclusion

Building a reddit post bot with fast.ai was an fun project, however a bit of manual effort beat out my months of web crawling and compute time. There is no doubt that I would have easily crossed the 10k mark given enough time running my post bot and expanding it to additional subreddits. Certainly the post bot was a better way to scale my effort, and I learned in the process about the struggles of alignment, human labeling, and training on larger data sets. Hopefully you as the reader were able to get a sense for this from what I shared. Feel free to drop me a comment below if you have suggestions for how you would improve the text classifier or approach this problem!

If you liked this post, consider signing up for email notifications for my next one. The views expressed in this article are my own.