Scraping links from irreal.org

irreal.org is a pretty cool Emacs blog. The mysterious author of irreal.org (who goes by "jcs") has been at it for a long time. There are so many irreal.org posts that it would take a long time to go through all of them.

One day I got the idea to write a Scrapy spider to recursively extract all of the links from irreal.org. JCS frequently links to obscure blogs and such, so there's bound to be some gems in there that I'd otherwise miss.

Today, I got around to doing this, and here are The Results©: the list

Skimming this verifies that it is most certainly a very interesting list. This is only scratching the surface of what you can do with a very simple web crawler. I have since evolved this into a much more powerful web crawler, but that's out of the scope of this article.

1. A Sales Pitch

Here, my goal is to explain why naively scraping links from a website is an interesting project that is worth the time.

For everyday web-searching purposes, most people use a search engines that are run by giant corporations. According to conventional wisdom, running a general purpose web crawler is out of reach of the individual internet user because of various reasons. For instance,

"You'll get banned by paranoid webmasters."
"A web crawler requires complicated logic to work."
"You need something more than just a personal computer and home internet connection."
"You need to collect an enormous amount of data to do anything useful."

Thanks to this project, I now know from experience that none of these reasons are valid. You can absolutely do interesting and useful things with your own web crawler.

OK, it's possible, but why would you want to do this? Yes, it turns crawling websites is useful for real-world use cases like discovering obscure Emacs blogs. But there's also a deeper reason: running your own web crawler is about sticking it to the man. If you only use a product such as Google search for your web search needs, you're only seeing what Google et al. wants you to see, and everybody knows that Google is not shy about injecting their own values and opinions into content.

In my experience, there's a ton of useful things that you can do with the raw HTML data that you can't do with any common search engine. The major search engines haven't substantially changed since the 2000s. People still interact with search engines according to the same simple and boring paradigm. The major search engines don't care about pushing the forefront of technology because that doesn't make money. I've talked to legit information retrieval researchers who also share this view. That's why DIY web-crawling makes sense.

ResultsMotivated.com

1. A Sales Pitch