Parsing URLs in Python

1. Parsing URLs

Sometimes, you want to know things about URLs. Namely:

Is URL X the same as URL Y?
Is URL X on the website as URL Y?

Question #1 has a number of pitfalls:

The order of the parameters in the query string
Case sensitivity
Weird foreign languages? Scary monsters?

Question #2 is not simple either:

Is subdomain.example.com the same website as example.com?
What about www.example.com and example.com?

To solve these problems, one must understand some basic jargon. Consider the URL https://www.resultsmotivated.example.com/?param1=69&param2=420

The Fully Qualified Domain (FQDN) is www.resultsmotivated.example.com.
- Basically, it's the domain + the subdomain.
The domain is example.com.
- This is what you buy from the registrar.
- As a rule of thumb, every FQDN that ends with example.com belongs to the same entity.
The host is www.
- This just a fun fact. It's not important for our purposes.

2. There's nothing special about "www"

Something that surprised me is that there's nothing special about "www". Whether a URL has a "www" prefix or not depends only on whether or not the site operator chose to create DNS records for the "www" subdomain. For example:

https://sling.apache.org/ - a working link.
https://www.sling.apache.org/ - The same URL with a "www." added (doesn't work.)

Personally, I had the misconception that the "www." prefix was always there, and therefore the convention of omitting it due to being redundant came about (or perhaps "www." was the default or something.) Turns out it doesn't really work that way. There's nothing special about "www.", and it requires DNS records to work just like every other subdomain.

3. Some code

Here is a simple Python function which implements these ideas.

import urllib
import w3lib
import collections

ParsedURL = collections.namedtuple("ParsedURL", "canonical, domain, fqdn")


def split_url(url):
    """
    Given a URL, return a namedtuple with canonical URL, domain and FQDN.
    """
    canonical = w3lib.url.canonicalize_url(url)
    parsed_uri = urllib.parse.urlparse(canonical)
    fqdn = parsed_uri.netloc.lower()
    splitted = fqdn.split(".")

    if len(splitted) < 2:
        raise Exception("Attempted to get domain of malformed URL.")

    domain = ".".join(splitted[-2:])
    result = ParsedURL(canonical=canonical, fqdn=fqdn, domain=domain)
    return result

ResultsMotivated.com

1. Parsing URLs

2. There's nothing special about "www"

3. Some code