What's in a Hostname? Using SQL Wildcard Patterns to Quickly Identify System Ownership

NOTE: This post was originally published on djgoosen.blogspot.com Thursday, October 23, 2014.

Somewhere on the way to running many thousands of nodes, internal hostnames can become a problem. Well, not the hostnames themselves (after all, DNS allows up to 253 characters), but figuring out who owns them when multiple teams exist definitely can. Many shops try to solve this particular problem by including the team name in the hostname. However, what happens when a team name changes, but a hostname can’t immediately change with it? Or when node ownership switches to a different team? What can we do to improvise, adapt and overcome our challenges in keeping everyone on the same page?

Structure of a Hostname
Take a hostname FQDN like this one, which probably looks fairly typical at enterprise cloud scale:
<role>.<service>.<location>.<server-team>.local

Which transforms into something like this:
app1.payment.dc1.teamblue.local

Monitoring and Escalations
In most shops, systems monitoring is based on hostname.

Especially in larger shops, there can be multiple server teams who are responsible for different hostnames.

In those shops, nodes will sometimes change owners along service or even just role lines. Or the team names themselves will change in a re-org. Subsequently changing hostnames isn’t always easy or fast.

In most shops which have been around for a while, we are going to be relying heavily on “legacy” solutions like sending alerts to team email addresses; documentation; and institutional knowledge; for the Tier I group or automation to figure out whose on-call to escalate to.

And bear in mind, the server team isn’t always the last team to have to be involved resolving the incident! There can also exist multiple application support teams and database teams too. And they certainly don’t have to have 1:1 alignment with the server teams.

As you might imagine, this problem can lead to multiple misrouted escalations each week, which not only means waking up the wrong on-call people (a morale issue), but more importantly means increasing MTTR (a revenue issue).

Wait, What If…?

Now ideally if we had unlimited resources, we could do things like refactor our monitoring to point to the right team dynamically. Or we could retrofit our CMS to allow us to more easily change hostnames. Or we could include a manifest file on each node to describe attributes like team ownership, which could then be posted to an inventory database, and/or be referenced in said monitoring refactor. All of these are good ideas. But none of them are going to be fast. We need something fast.

SQL Wildcards are Fast

We should try to take an iterative approach to most legacy refactoring. It’s hard to resolve all of the dependencies that might exist before we start overhauling anything. So it’s really OK to start somewhere good and then go somewhere better.

Our starting point is to create a “single source of truth” for our Tier I team.
We want it to be lightweight and extensible. We want it to be based on SQL wildcards rather than being a complete database of instances, so that as node instances are added, the wildcard FQDN pattern continues to be valid. We want it to be a web tool. And when we’re finished writing it, we want to be able to hand it off to the Tier I team to run and improve as they see fit. (As engineers we should always be trying to eliminate ourselves from the equation so we can move onto solving other problems, right?)

The FQDN Search Tool

We publish version 1.0 as a simple PHP search form, with a MySQL backend of tables with wildcard regex patterns.

The backend table rows are populated by the logic that there should always be a “fall-through” resolver team for any given hostname pattern, which points to a team email address (or a relational table with team info):
.*\.teamblue.local teamblue@example.com

Whereas, a more-specific regex could point the payment service to a different team:
.*\.payment.*.\.teamblue.local teamred@example.com

We can even point the app role of the payment service to a third team:
app[0-9].payment.*.\.teamblue.local teamgreen@example.com

The results page will show only the longest, most-specific match, like so:
$raw_results = mysql_query(“SELECT * FROM hostname_wildcard WHERE ‘$query’ REGEXP pattern ORDER BY LENGTH(pattern) DESC LIMIT 1”) or die(mysql_error());

And of course, it’s an HTTP GET request, so that in addition to being a simple URL that anyone (Dev, Product, etc.) can lookup, its resulting page can quickly be parsed via curl plus cut or similar text manipulation tools:
$ curl -s http://localhost/fqdnsearch/search.php?query=app1.payment.nyc.teamblue.local

Version 1.0 also lays the foundation of extensibility beyond the server teams by also adding the application support teams in their own regex table. Once again, longest match wins. We can trivially extrapolate this concept to include the Database, Dev and other teams involved in supporting a service.

In Conclusion

This simple LAMP stack solution takes only a couple of hours to implement and allows us to defer a long-term solution– leveraging a new CMS and a true inventory database with manifest files– until such a project can be approved, prioritized and scheduled. We can use our institutional knowledge to prepopulate the FQDN Search regex tables, and then let the natural feedback loop from owner teams help improve the guesses the tool makes.

In my own shop’s case, within a matter of weeks of the Tier I Team beginning to use our FQDN Search tool several thousand (and counting) possible hostnames (and tons of docs) were replaced by 50 or so of the aforementioned hostname patterns, pointing to seven different systems teams. We saw similar reduction ratios for the app teams. All around, it was a quick win in the longer “series” that is our ongoing DevOps journey.