1. Parsing URLs
Sometimes, you want to know things about URLs. Namely:
- Is URL
X
the same as URLY
? - Is URL
X
on the website as URLY
?
Question #1 has a number of pitfalls:
- The order of the parameters in the query string
- Case sensitivity
- Weird foreign languages? Scary monsters?
Question #2 is not simple either:
- Is
subdomain.example.com
the same website asexample.com
? - What about
www.example.com
andexample.com
?
To solve these problems, one must understand some basic jargon. Consider the URL https://www.resultsmotivated.example.com/?param1=69¶m2=420
- The Fully Qualified Domain (FQDN) is
www.resultsmotivated.example.com
.- Basically, it's the domain + the subdomain.
- The domain is
example.com
.- This is what you buy from the registrar.
- As a rule of thumb, every FQDN that ends with
example.com
belongs to the same entity.
- The host is
www
.- This just a fun fact. It's not important for our purposes.
2. There's nothing special about "www"
Something that surprised me is that there's nothing special about "www". Whether a URL has a "www" prefix or not depends only on whether or not the site operator chose to create DNS records for the "www" subdomain. For example:
- https://sling.apache.org/ - a working link.
https://www.sling.apache.org/ - The same URL with a "www." added (doesn't work.)
Personally, I had the misconception that the "www." prefix was always there, and therefore the convention of omitting it due to being redundant came about (or perhaps "www." was the default or something.) Turns out it doesn't really work that way. There's nothing special about "www.", and it requires DNS records to work just like every other subdomain.
3. Some code
Here is a simple Python function which implements these ideas.
import urllib import w3lib import collections ParsedURL = collections.namedtuple("ParsedURL", "canonical, domain, fqdn") def split_url(url): """ Given a URL, return a namedtuple with canonical URL, domain and FQDN. """ canonical = w3lib.url.canonicalize_url(url) parsed_uri = urllib.parse.urlparse(canonical) fqdn = parsed_uri.netloc.lower() splitted = fqdn.split(".") if len(splitted) < 2: raise Exception("Attempted to get domain of malformed URL.") domain = ".".join(splitted[-2:]) result = ParsedURL(canonical=canonical, fqdn=fqdn, domain=domain) return result