Recently I changed the permalink structure here at Drinking Caffeine and I’ve learned a few things about Jekyll that are worth sharing. There’s a lot of technical stuff below but the tl;dr is that you need dates in your URLs. I’ve stubbornly resisted that for nearly a year here but I’ve finally changed my mind. This article goes into the reason why, and how you can gracefully transition to a new permalink system without having to compromise all of your old URLs with 301 redirects. The latter would be almost impossible with a WordPress blog, by the way; this is a great example of where a static website is actually more flexible for less effort. Sorry Marco.
Okay, let’s dive in.
The old structure was just the title of the page, with all white spaces converted to dashes, the remaining non-alphanumeric characters removed, and then the final result lowercased. In the
_config.yml that looks like this:
In the event that the title was too long and I wanted to change the URL, all I had to do was rename post file to whatever I wanted. It was great.
One thing constantly worried me in the back of my mind, though: URL collisions. What if I created two posts with the same URL? The answer I tried to pacify myself with was that if you write two things with the same URL, then you’re repeating yourself with no nuance, you have no taste, and shame on you. But then I got to thinking. URL collisions are unlikely if you have 50 posts or even 500 posts. But eventually, if you write enough posts, you’re going to get a collision. Once you can conceive of scenario where you’ve written two articles that are relatively unrelated but that share the same URL, then it’s time to find a solution.
At a technological level, let’s look at how a URL collision would come about, and if we get any warning signs that this is happening in Jekyll.
First, what are the constraints of our filenames? We get our answer from the Jekyll docs:
To create a new post, all you need to do is create a file in the
_postsdirectory. How you name files in this folder is important. Jekyll requires blog post files to be named according to the following format:
There’s no leeway in this, but thanks to my custom initialization script for Jekyll, I can set up a post in Terminal with minimal elbow grease:
Which results in the following generated file with the pertinent front matter:
What if it’s tomorrow though (or more likely, 2 years from now), and I decide I want an article with the same headline? The result would be a file with a different name:
But due to my
permalink setting noted above, there would be a collision in URLs. Both of these files would have this URL:
This means that one of the articles would get shadowed by the other one and I would never know it because the file names contained unique dates. It’s the I would never know it part of this that is so disturbing. In my testing, it’s the most recent article that gets precedence in a collision. As a regular user, the only way you would notice that a collision had occurred would be if you were perusing some ancient archives. This is the sort of thing that keeps a man awake at night worrying.
However there does exist a command you can run to see if you have URL collisions called
jekyll doctor (good luck Googling that). I have to prepend
bundle exec with that due to some issues I’m still sorting out with a Ruby / gem version mismatch, but here’s what it looks like when I run that example:1
This is great, but who wants to run Jekyll’s doctor every time they publish something? Nobody. An alternative is to add this to the CI build process and kill the deploy if it fails. This would be annoying though; you should have 100% confidence that when you commit and push a new Jekyll post, it will deploy successfully if that’s the only change you’ve made since your last push. A CI fail would be an acceptable solution if clean URLs with no dates were such a high priority that it were deemed worth lowering this guarantee to a 99% certainty, but that’s not how I roll. No, the further I went down this path, the clearer it became to me why the vast majority of sites have dates in their URLs.2 Thus I went to my
_config.yml file and changed my permalink setting to this:
This created a huge problem though: it had the effect of breaking all of my preexisting posts’ URLs. Even if you don’t think you have enough offsite SEO to make this a big deal, you still need to think about your RSS readers. Since the permalink is the unique identifier for a static site, you’re going to inadvertently create a whole bunch of “new” articles that aren’t actually new.3
You need to do something about this. The best solution is to preserve all of the old URLs and only have this change apply to new posts going forward. The way I solved this was by doing a search and replace all in my
_posts directory. This was the search term:
And this was the replacement term:
I made this change in Atom and it worked without a hitch. I love this editor so much.
One final word about permalinks in Jekyll. You want to make sure you always have a forward slash at the end of your permalink settings, whether that’s in
_config.yml or in front matter. If you end in a forward slash, then the URL will work with or without an ending forward slash. If you do not, then it will only work without an ending forward slash.4 The reason is the difference in the build structure. With a forward slash, a post will get built like this:
Without a forward slash, it will get built like this:
If you want to test out this example, be sure to backdate the
datein the 2019 file’s front matter (unless, of course, you’re reading this on or after February 4, 2019). If you don’t, then Jekyll will assume that the post is scheduled for a future publish date. Though the URL collision is scheduled, it is not yet live, and since the Jekyll doctor only inspects currently published entities, it will give you false assurance that all is well. In my opinion this isn’t very wise. Jekyll doctor should warn you of approaching icebergs as well as preexisting holes in your ship. ↩︎
Almost all of the sites that don’t have dates in their post URLs have some sort of unique way of identifying each post. GitHub, for example, has
https://github.com/blog/[unique-id]-[post-name]as its structure. To me this is inferior however, because the unique ID does not create any value for the user, while a date in the URL does. At a glance, without looking anywhere on the page for a publish date, you can know when something was published when the URL has a date in it. There’s never a time that’s not helpful. ↩︎
I’ll grudgingly admit that this is one time that dynamically generated sites shine brighter than static ones. Their unique identifier for posts in RSS is usually the post’s ID in the database, which is unchanging for the lifetime of that post. This means that if you publish something and then decide 30 minutes later that you really need to change the URL for whatever reason, you can do so without worrying that an RSS scraper will consider the two visions as two separate articles. With a static site, unless you go custom with your identifier method, then your best bet is just hoping that the RSS scrapers hasn’t picked up the new article yet. If you do run into a situation where you need to change the URL after publishing - and that should be a very rare thing indeed, but it has happened to me - then I recommend implementing this Jekyll 301 redirect plugin. It works like a charm. ↩︎
This discrepancy only holds true on GitHub Pages in my experience. Locally, URLs with or without a forward slash resolve correctly using either setting. This serves as a reminder that just because something works locally doesn’t mean you shouldn’t double check on your production server. The environments are never fully identical. ↩︎