As I mentioned in a comment, I fixed the symptom quickly, but the root cause of this was surprisingly interesting.
Buckle up, I'm putting on my "back in my day" hat.
Historically, when we've deployed the Q&A application, we've been doing this in two steps. The first step only deploys to https://meta.stackexchange.com/ (MSE) and https://meta.stackoverflow.com/ (MSO). The second step then deploys to the rest of the network.
We called those steps "deploying Meta" and "deploying others".
This allowed us to use MSE and MSO as simple canaries if we wanted to check that things behaved as expected in production. In most cases, we simply deployed both, and now that we're in GCP, this distinction doesn't exist anymore.
The way that this two-tier deployment worked was quite simple: Our data center had eleven primary web servers, numbered NY-WEB01
thru NY-WEB11
. Each of those web servers ran the identical Q&A application, and each web server was able to serve any site (MSE, MSO, or otherwise). But the load balancer was configured to send MSE and MSO traffic to NY-WEB10
and NY-WEB11
, and all other traffic to all the other servers.
That way, the distinction between deploying Meta and deploying others was simply to which servers we deployed the application in the respective step.
Now, as you may or may not know, we usually serve our static assets (JavaScript, style sheets, images etc.) from a separate domain called sstatic.net
. Back in the stone age, this was a best practice; these days probably not so much, and it's more of a historical artifact now.
At some point, we started using a dedicated CDN for those static files, under the domain cdn.sstatic.net
. So, a request to sstatic.net
would go to our data center, while a request to cdn.sstatic.net
would go to a CDN edge node close to you, and the CDN would probably already have that file cached. Only if it didn't, would it then request it once from sstatic.net
. (These days with everything behind CloudFlare, the separate domains are yet another historical artifact.)
So far so good. But a lot of the static assets are shared between all the sites. And if we wanted to deploy "Meta" and "others" independently, that independence should include those static files. For that reason, on MSO and MSE we did not use sstatic.net
, but instead served the files from the site's local /Content
folder.
You can still see that happening: If you inspect this very page, you will see this:
<script src="https://meta.stackoverflow.com/Content/Js/stub.en.js?v=31c1a92afca8"></script>
but if you do the same thing on https://stackoverflow.com/, you'll see this instead:
<script src="https://cdn.sstatic.net/Js/stub.en.js?v=31c1a92afca8"></script>
And this finally brings us to the "page not found" image that this bug report was about. This image can be configured per-site. A lot of sites have a pretty boring default, but some of them have a quirky dedicated image that fits the site. Some examples:
On Server Fault for example, the image is configured as https://sstatic.net/Sites/serverfault/img/spaghetti-networking.jpg
(on the static assets domain). But here on MSO, it's configured as /Content/Sites/stackoverflowmeta/img/keyboard-waffles.jpg
(served locally as a relative URL).
Bored yet? We're getting close to the bug!
We have code (a method aptly named CDNify()
) that changes the domain to cdn.sstatic.net
if it's such an absolute URL, but that leaves relative URLs alone.
How does that code check whether it's an absolute or a relative URL? Us being Stack Overflow, you might think that we used a weird regular expression. But no, we actually did it the proper way, by using the framework-provided functionality in System.Uri
.
Specifically in this case, we used
Uri.TryCreate(path, UriKind.Absolute, out var uri)
which returns true
if it's an absolute URI, and false
if it's not.
And this worked perfectly – until we moved to the cloud. What's the difference? In our data center, the application ran on Windows Server. In GCP, it runs in a Linux container. And it turns out that System.Uri
behaves differently between the two operating systems. We aren't the first ones running into this.
Unlike on Windows, on Linux /foo/bar
is a perfectly valid absolute file path. And because this framework functionality handles URIs, not just URLs (yes, this is the once-in-a-lifetime situtation where the difference actually matters!), Microsoft decided that on Linux, Uri.TryCreate
should return true
for /Content/...
and create a file URI.
And that is why the 404 image had a file://
address, as you noticed.
My quick fix was to simply change the configuration value to an absolute URL, but I also wanted to make sure that we don't run into this issue anywhere else, by creating a helper that behaves identically across operating systems, and an automated check that prevents you from using the framework functionality directly. That's why it took me a few days before writing this answer.
/Content
subdirectory. (file:
discards the server part of the URL.) Most browsers wouldn't load it even if anyone did.file:
withhttps:
, the image still doesn't exist.