Chatbot Scraper: Using (today's) IRC logs as your NLP datasets

Posted on Thu 29 September 2016 in hacking

I dunno about you, but I often find myself bored with NLP (natural language processing) datasets. Too often they are older, based around something that is not particularly interesting to me or something I've analyzed or used before.

For me, IRC has often been a source of community, fun, sometimes trolliness (is that a word yet?) and clearly an interesting source of news / assistance with regards to my work.

Given the fact that freenode has many publicly logged channels, I decided to see if I could scrape botbot.me to get more data for NLP fun.

After about a day of tinkering and testing, I present chatbot_scraper. It currently only scrapes the public lists for botbot.me, but if you use a major open-source framework / platform, you'll likely find at least one channel of interest. For me, I'm perusing the docker logs looking for interesting new topics. For you, who knows?! (Although feel free to send interesting things you find!) To get started, take a look at the README.md.

Here is an example run:

python botbot_scraper.py --network_name freenode --chan_name docker --start_date=2016-08-30 --end_date=2016-09-05

For more info, try the help command:

python botbot_scraper.py -h

I am hoping to expand it for more public chat logs and possibly even slack logging (although I'm unsure what ToS Slack has, probably too constrictive tbh..). That said, let me know if you have suggestions or issues on the issues page or simply fork and send a pull request!

Cheers and happy bot-ing!