The Urban Legend of mandb on RHEL

Aka “The Case of mandb‘s Missing Daily Scheduled Job”

Lately I’ve been revisiting a number of fundamental RHEL OS parts that I’ve used regularly for I’ll just call it “awhile” already, with a beginner’s mind, and seeking to “kind of trust BUT definitely verify” the current textbooks being published, especially when it comes to old-school Linux tools, the tried-and-true, the everybody-knows-that stuff. So anytime I now encounter a declarative statement in a book like the latest RHCSA guide like this one: “…when you use the man -k command, the mandb database is consulted. This database is automatically created through a scheduled job“, I want to know if that’s still really true, and I’ll ask myself if it even seems to match my own experience with the latest and greatest RHEL version I’m working on. And in any case, I next motivate myself to take a first-principles approach to make sure I can SEE where and how this is all happening on my actual system. For the uninitiated, the mandb program is built to “create or update the manual page index caches”. This particular mandb rabbit hole ended up being sorta long to traverse, but it serves, in this writer’s humble opinion, as a flat-out fantastic reminder of what answering the bedrock questions for yourself of “how it all really works” can teach you about discovering your operating system. (Also, this post is not just about operating systems.)

By the way, I should preface the rest of what follows by reemphasizing that gerund, “discovering”. I realize that some advanced readers might arrive immediately at or within a few troubleshooting steps of the solution, where I took several others in between. A counterpoint to that sentiment would be that already knowing something is great, but knowing how to know something is perhaps just as important (and intellectually rewarding). What Nobel Prize-winning physicist Richard Feynman called “the pleasure of finding things out.”

The TL;DR is that mandbas a standalone program is not part of RHEL’s automatic manpage update puzzle anymore. What it effectively did has been replaced by macros that live on the RHEL system itself in /usr/lib/rpm/macros.d/macros.systemd but which are invoked by dnf, according to its compiled source code. The dnf.spec file specifically, which contains the line: %systemd_post dnf-makecache.timer, which invokes at least one of the two related “man-db-cache” systemd units, via what ultimately gets logged by auditd as a /usr/bin/systemctl start man-db-cache-update. This turns out to be super helpful, because it means that dnf is updating the manpage cache and index each and every time it installs a package. But the upshot is that what you would think of as a classic mandb now only happens on, say, a normal RHEL 9 system boot; and then whenever you install (or uninstall) a package via dnf; and then apparently before shutdown. Basically. If all of that sounds obscure, it is. But it’s how it actually works.

One of the first principles I follow when I am discovering things about software in general: There is no magic. Computers are, generally speaking, deterministic. Operating systems do precisely what their software developers and users tell them to do, whether the humans meant for them to do those precise things or not. And because of that, on any open source distro at least, there is almost always human-readable or decodable-from-binary text related to whatever you are looking for, somewhere on the filesystem.

So anyway, let’s deconstruct this statement which I already quoted: “…when you use the man -k command, the mandb database is consulted. This database is automatically created through a scheduled job“.

The end of that first sentence is true. Run man -k curl and you’ll get an answer back. So far, so good.

[root@rhel9 ~]# man -k curl
curl (1)             - transfer a URL
[root@rhel9 ~]#

The second sentence is no longer true, at all, and it didn’t “feel” true when I first read it. Because it certainly used to be that you’d install a program on RHEL and then have to either manually run mandb or wait for mandb to run on a schedule (usually via a daily cron job) to update the index, all so that you could read and search your latest cool program’s manpages. But this hasn’t been the observable fact of the matter for a while now. Yet many docs (including de facto “official” ones) continue to namecheck mandb like it’s still back there cranking away in some dark corner of your root volume. In actual fact, today if you dnf install wget -y and then man -k wget, its manpages are already there for you. With all that said, mandb‘s daily scheduled job’s disappearing (but still seeming to haunt your OS environment) act makes a whole lot of sense when you unravel it. It’s just not terribly-well documented (if it’s definitively documented anywhere). If any of this is laid out clearly in man sections 1, 5, or 8, I could not find it. And for their part, the new textbooks are probably all getting mandb totally wrong nowadays, probably because nobody’s been keeping up with the technical review on this one-of-one-gazillion RHEL topics. But hey, that’s what keeps us humble and hungry to keep learning (and relearning) what we thought we already knew.

#####

[root@rhel9 ~]# systemctl show man-db-cache-update.service -p Requires,Wants,Before,After
Requires=sysinit.target system.slice
Wants=
Before=shutdown.target
After=basic.target sysinit.target systemd-journald.socket system.slice local-fs.target
[root@rhel9 ~]#


#####

[root@rhel9 ~]# systemctl list-units | grep dnf
  dnf-makecache.timer                                                                      loaded active     waiting   dnf makecache --timer
[root@rhel9 ~]#

#####

[root@rhel9 ~]# grep -B 2 -A 2 cron /etc/sysconfig/man-db

# Set this to "no" to disable daily man-db update run by
# /etc/cron.daily/man-db.cron
CRON="yes"

[root@rhel9 ~]#

# Plot twist! `/etc/cron.daily/man-db.cron` doesn’t actually exist by default anymore!
 
[root@rhel9 ~]# stat /etc/cron.daily/man-db.cron
stat: cannot statx '/etc/cron.daily/man-db.cron': No such file or directory
[root@rhel9 ~]#

#####

# Relevant snippet of dnf.spec [https://github.com/rpm-software-management/dnf/blob/master/dnf.spec]:

%post
%systemd_post dnf-makecache.timer

# Translation: this macro, `%systemd_post`, is called to run the dnf-makecache.timer whenever dnf installs a package. It wasn’t always this way. Which is a nice change, but probably the opposite of obvious if you’re simply looking at the filesystem or the, ahem, manpages for any of this.

#####

# Relevant snippet of said macro in `/usr/lib/rpm/macros.d/macros.systemd`:

%systemd_post() \
%{expand:%%{?__systemd_someargs_%#:%%__systemd_someargs_%# systemd_post}} \
if [ $1 -eq 1 ] && [ -x "/usr/lib/systemd/systemd-update-helper" ]; then \
    # Initial installation \
    /usr/lib/systemd/systemd-update-helper install-system-units %{?*} || : \
fi \
%{nil}

# This ends up looking like this when logging running an `ausearch`:

"type=EXECVE msg=audit(1713317136.446:3364): argc=4 a0="/usr/bin/systemd-run" a1="/usr/bin/systemctl" a2="start" a3="man-db-cache-update"" anytime it occurs on a Red Hat Enterprise Linux 9 system. Full log is here: ---- time->Wed Apr 17 01:25:36 2024 type=PROCTITLE msg=audit(1713317136.446:3364): proctitle=2F7573722F62696E2F73797374656D642D72756E002F7573722F62696E2F73797374656D63746C007374617274006D616E2D64622D63616368652D757064617465 type=PATH msg=audit(1713317136.446:3364): item=1 name="/lib64/ld-linux-x86-64.so.2" inode=148674748 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(1713317136.446:3364): item=0 name="/usr/bin/systemd-run" inode=67512497 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:bin_t:s0 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0 type=CWD msg=audit(1713317136.446:3364): cwd="/" type=EXECVE msg=audit(1713317136.446:3364): argc=4 a0="/usr/bin/systemd-run" a1="/usr/bin/systemctl" a2="start" a3="man-db-cache-update" type=SYSCALL msg=audit(1713317136.446:3364): arch=c000003e syscall=59 success=yes exit=0 a0=55bf12df4f60 a1=55bf12dfc340 a2=55bf12dfc0d0 a3=55bf12dfc6e0 items=2 ppid=89764 pid=89765 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=16 comm="systemd-run" exe="/usr/bin/systemd-run" subj=unconfined_u:unconfined_r:rpm_script_t:s0-s0:c0.c1023 key="mandb-cmd" 

A tiny part of the challenge in isolating this man db cache update to dnf‘s compiled source code, which I suspected all along but had a bit of a time nailing down, is that the calling PID (short-lived), not program name, is logged above. Meaning that I had to put a watch on a ps -ef --forest to confirm that dnf was indeed the owner of that PID when it was doing a package install.

Again, there is no magic.

Fun list of programs, in no particular order, that I got to use in figuring all of this out:

  • auditctl
  • ausearch
  • dnf
  • find
  • journalctl
  • grep
  • man
  • ps
  • rpm
  • strace
  • systemctl

DevOps Engineer Interviews – Example Prescreen Questions

Background

While a DevOps engineer candidate’s aptitude should factor heavily into your team’s recruiting strategy (it deserves at least as much weighting as their experience with any specific tech stack), certain technology knowledge simply has to be there on Day 1. So is it ever too early to start funneling candidates on the hard technical dimensions?

Interviewing is big investment in time and effort. A great hire can have an immediate and lasting impact on your organization. But so can a misfire, right? For this reason, getting your internal recruiters involved upfront on things like technical questionnaires can save everyone (including the candidate, the recruiter and your team!) a number of cycles down the road.

A candidate is stating senior experience with Red Hat-based Linux OSes? Ask the recruiter to verbally prescreen the candidate on those questions below. Because either such a candidate knows how many bits are in a byte, or they’re probably not going to be a fit. It’s the same for any scripting languages for which they indicate expertise.

You’ll likely want to ask your recruiting colleagues to cover both Linux and the candidate’s preferred scripting language during the prescreen. So in addition to the one for Linux, I’ve included example questionnaires for both Python and Perl below.

Overall, an effective and efficient DevOps interview funnel might look like this:

  1. Resume review [DevOps leader].
  2. Technical prescreen, 10-15 minutes [tech recruiter].
  3. Technical phone interview, 30-45 minutes [senior DevOps team member].
  4. Onsite/video team interviews, 2-3 hours total [various DevOps team members; at least one internal dev/product team customer; and the DevOps leader, who should get preliminary feedback and then take the final interview slot].
  5. Offer.

Suggested Prescreen Passing Thresholds

  • 9-10: senior DevOps engineer candidate.
  • 6-8: midlevel.
  • 5: junior.

The Importance of Verbalizing Concepts

An ideal prescreen question is easy for a non-engineer to ask and easy for an engineer to answer– verbally.

(It’s also a good idea to ask the recruiter to transcribe and share with you the candidate’s answers, just in case the answer given is actually an acceptable alternative.)

Linux Prescreen Questionnaire 

1. On the bash shell, what does $? (“dollar sign question mark”) indicate?  

The exit status of the last command.

2. What’s the easiest way to determine the number of inodes (“eye nodes”) available on a file system? 

df -i

3. What is the process ID of init? 

1

4. What does ldd (“L-D-D”) do? 

list dynamic dependencies

5. What command lists the open ports on a system? 

netstat

6. What is the default run level of a running system (non X Window)? 

3

7. Where are failed login attempts logged by default? 

/var/log/btmp or /var/log/secure

8. What are the default file permissions if the umask is 022? 

644

9. How many bits are in a byte? 

8

10. What port number does DNS use by default? 

53

Python Coding Prescreen Questionnaire 

1. What’s the syntax for importing only the “date” submodule of the “datetime” module?

from datetime import date

2. In terms of error handling, what are the three clauses which can follow a try statement?

except/else/finally

3. Which of these are mutable and which are immutable? lists, sets, tuples.

mutable: lists, sets 

immutable: tuples

4. What is the syntax for declaring a function called “timer”?

def timer():

5. What would print(‘hi’ + 3) produce?

An error.

6. What would print(‘hi’ * 3) produce?

“hihihi” (it will print the word “hi” three times.)

7. What file extension does Python store its bytecode in?

.pyc

8. How can you print only the final element of a list named “groceries”, when you don’t know how many elements there are?

groceries[-1]

9. How can you determine the length of a string named “email”?

len(email)

10. How would you run a Bash command such as “uptime” within the Python interpreter (either single or double quotes are OK in the candidate’s answer)?

import os

os.system(‘uptime’) 

…OR…

from os import system

system(‘uptime’)

Perl Coding Prescreen Questionnaire

1. What variable types are there?

Scalar, Array, Hash

2. How do you debug a Perl script on the command line?

use “-d” switch

3. How do you remove the last element of an array?

pop function

4. What switch is used to execute Perl code on the command line?

either “-e” or “-E

5. What is the operator used for Perl regular expression substitution?

"s", for example "s/foo/bar/"

6. How do you access parameters passed to a subroutine?

"@_"

7. How do you declare a variable when the strict pragma is used?

prefix the variable with “my

8. What function is used to return a specific non-zero value?

exit, NOT die

9. When you require a module in Perl do you need specify the extension?

Yes

10. What is used to execute an external command from a perlscript and capture the output of the command?

backticks, for example: $system_uptime = `uptime`

Level Up Technology Blog Post on Ansible Jira Integration

Check out the Level Up team’s latest Ansible Pro Tips post: “Simplify Your Jira Change Management Processes Via Ansible Playbooks“.

Highlights (bolds are mine):

A big way in which we typically measure success with DevOps has to do with integration.”

“…there is almost always at least a small amount of coding required to help you maximize your ROI in connecting the technologies you have with whatever you decide to build or buy next. Jira offers a good example of how a small amount of coding can have a big impact.”

“…as a guiding principle, you want your tools to enforce your processes. Not the other way around. If you are using Ansible, there’s a great opportunity to enforce more of your technology and governance processes without anybody having to remember to do anything except the task itself.”

“And this Jira module is flexible in terms of working with whatever your custom issue types and statuses might be as well. But hopefully as you can see, this is the type of straightforward, module-level integration that your organization can start taking advantage of to increase ROI and let the platform do the hard stuff when it comes to change management via Ansible right now, today.”

And Sometimes, DevOps is Literally About Knocking Down Walls

NOTE: This post was originally published on djgoosen.blogspot.com Monday, May 11, 2015.

As we transition our organizations into the DevOps model, it’s helpful to spend at least a little time thinking about the three-dimensional space that our engineering and operations teams are sharing.

We’ll compare two photos of the same place at different moments in time:

The first of these photos was taken at our office recently and implies that there are two teams; the second was taken today and implies that there is really just one team. A wall sends a message, whether we want it to or not. “Tearing down walls” is a common metaphor, but the act of actually, physically tearing one down is also a pretty empowering statement to the entire team to always be thinking and acting as a single, collective, AWESOME unit. The second photo is a work in progress, just as many companies’ DevOps transitions may be. However, it’s undoubtedly forward progress.

Network Performance in the Cloud: How Many Packets Per Second is Too Many?

NOTE: This post was originally published on djgoosen.blogspot.com Saturday, September 13, 2014.

Overview

When it comes to systems performance, there are four classical utilization metric types we can pretty trivially look at: CPU, memory, disk and network. In the cloud, we also have to consider shared resource contention among VM’s on cloud hypervisors, within pools/clusters and within tenants; this can often complicate attempts to diagnose the root cause of slowness and application timeouts. At cloud scale, especially when consuming or managing private cloud KVM/OpenStack, Xen or VMware hypervisor resources, and/or when we have visibility into the aggregate metrics of our clouds, it’s useful to keep our front sights focused on these building block concepts, because they help us simplify the problem domain and rule out components in capacity planning and troubleshooting exercises.
Metrics like CPU idle and physical memory utilization tend to be fairly straightforward to interpret– both are typically reported as percentages relative to 100%. Yes, caveats apply, but just by eyeballing top or a similar diagnostic tool, we usually know if these numbers are “low” or “high”, respectively.
In contrast to the first two, many disk and network metrics tend to be expressed as raw numbers per second, making it a bit harder to tell if we’re approaching the upper limits of what our system can actually handle. A number without context is just a number. What we want to know is: How many is too many?
I’ll save disk performance for a future post. Today we’re going to zoom in on network utilization, specifically packets per second (pps).


Is 40,000 pps a lot? How do we know?
Before we try to clarify if we actually mean, Is 40,000 pps a lot for a VM? Or for a hypervisor? Or for some other type of network device in the cloud? And before we try to think through all the exceptions and intermediary network devices we might have to traverse, we should stay disciplined and first ask ourselves: What are we always going to be limited by?
There are two basic constraints here, effectively no matter what:

  • Maximum packet size: 1538 bytes
  • Network interface speed (e.g., 1Gbps, 10Gbps)

So if every packet has a 1538-byte MTU, and a 1Gbps NIC on a system is handling 40,000 pps, that means that the system is pushing (1538 * 40000) * 8 = 492160000 bits per second, or ~492Mbps. Thus I think that we’ll agree the system is pushing about half of its theoretical maximum speed. (Though in reality, not every packet is created equal. Some are going to be smaller, so 40,000 pps might actually be less than 492Mbps. Maybe a lot less. But it can’t really be more.)


(MTU * pps * 8) / NIC speed in bps = % we’re interested in
A good rule of thumb is that a system with sustained network throughput utilization of > 75% is probably too busy.
So on that basis, 40,000 pps really isn’t too many pps for a single 1Gbps system to handle. It’s even less of a concern for a 10Gbps system.


Some Gotchas in the Cloud


Many VM’s running on the same hypervisor
However… 40,000 pps might actually be too many if it’s a single VM running on the same hypervisor as a bunch of other VM’s, especially if they’re all pushing the same sort of pps. We might want to isolate this VM on a hypervisor. Or scale its role horizontally onto other hypervisors. Along those lines, the business might tell us that 40,000 pps’ worth of revenue is too many for a single VM to take with it at the instant it goes down. But then we’re no longer talking about maximum pps, we’re talking about something else.


The underlying hypervisor NIC’s and EtherChannel
At the hypervisor level, the maximum pps really depends on what the aggregate is of all traffic on it; as well as its NIC speed.
Also relevant is the hypervisor’s EtherChannel configuration (we wouldn’t run our production VM on a hypervisor without some type of redundant links), especially in terms of actual maximum throughput (for instance, actual LACP maximum throughput can start to fall when we take into account factors like MAC-based load balancing between the links).
Additionally, implementation choices like Open vSwitch vs. Linux bridging can have an impact on effective pps. On Citrix Xen hypervisors, OVS is a common design choice. My understanding is that the default OVS flow eviction threshold is 2,500. The maximum recommended value appears to be 10,000. Reasonable people appear to disagree about whether this metric is leading, coincident or lagging to the root causes of packet loss, but in my own experience, with higher hypervisor pps we can expect to see dropped packets relative to this metric for our VM.


What else do we have to think about in the cloud?
Assuming every networking component in our cloud is configured correctly, here are some other factors that can come into play, in no particular order:

  • Firewalls: Obviously these have a maximum pps too. In aggregate, we might be constrained by firewall pps at our tenant’s boundary, or even closer.
  • Load balancers: Things like licensing limits; request throttling; and their own NIC/EtherChannel pps; can all influence our cloud VM’s effective pps.
  • Broadcast storms: Other cloud systems in the same subnet as our VM or hypervisor can saturate the network.
  • UDP vs. TCP: if we’re testing maximum pps using a UDP tool, we’re likely to experience more packet loss and thus a smaller perceived maximum.
  • Outbound internet bandwidth: guaranteed and burst rates will apply here.
  • Jumbo frames: These have 9000-byte MTU’s, so especially in our storage and database tiers, we’ll make sure we remember which MTU to use in our calculations.
  • sysctl network parameter tuning: These are beyond the scope of this post, but they can definitely impact the VM’s network performance.

Level Up Technology Blog Post on Custom ansible-lint Rules

Inspired by a great audience question at our last meetup… check out the Level Up team’s latest blog post: “Express Your Team’s Infrastructure Code Standards Using Custom ansible-lint Rules“.

A few highlights:

  • “The Ansible by Red Hat project currently maintains ansible-lint as a way to communicate its summary view of global default best practices and “syntactic sugar” to the wider community”
  • “Granted, your team may not always end up agreeing with ansible-lint’s default worldview. Which is totally fine. There are both runtime (the -x flag) and config file options available to exclude rules either by the rule ID or any matching tag. If and when you encounter false-positives, or have other reasons to want to reduce warning “noise” in your linting efforts, you can simply share your ansible-lint config files somewhere like GitHub that the rest of your team can consume as well.”
  • “ansible-lint becomes super-charged when you combine its default rules with custom rules created by you and your team”

A Change Approval Process at DevOps Speed

NOTE: This post was originally published on djgoosen.blogspot.com Monday, December 21, 2015.

Nobody launches a startup thinking about change management.
Everybody eventually realizes they needed it yesterday.
Still, change management can be a formidable topic. If and when we figure out what we want to do (drawing from the best parts of ITIL, Agile/Lean, etc.), it can be hard to champion it, because even in the most supportive of environments, we’re asking DevOps engineers to add to their already-overloaded plates, and we’re asking business stakeholders to be patient with us while we adopt and build it into existing workflows.

Trying to Talk Ourselves Out of Change Management is Easy

We may procrastinate or downright talk ourselves out of doing it. In fact, let’s try:
·             Managing change takes time. Sometimes managing a change takes longer than the change itself. 
·             And… engineers usually make safe and successful changes. Managers usually approve their engineers’ changes somehow. The business usually knows what’s going on. 
·             And… the change management processes that some companies implement are basically, um, not great. They’re often a grab-bag of intranet docs, post-it notes, “institutional knowledge” and mental reminders. They’re basically not even processes per se, as much as they are proclamations and wishful thinking on the part of management. 

Our Tools Should Enforce the Team Processes We Want to Have

Today we’re going to take it as a given that we’ve already had our organizational change management “epiphany”, and focus on building a lightweight yet effective change approval process, which is probably the first thing we can start doing if we aren’t doing much else to formally manage change yet.  
Ground rules for having change approval process that goes at DevOps speed:

  1. Don’t reinvent wheels. If we already have a solution like JIRA in place, we’ll use that.
  2. Only managers should be able to approve or deny a request.
  3. Auto-emails should happen, but be sent to only the people who actually need to see them.
  4. Changes involving product downtime should probably be subject to multiple approvals in sequence (e.g., DevOps manager à Product Manager à CTO).
  5. If we have to document much beyond stating, “This is where you submit change requests for approval”, we probably want to iterate over how the tool enforces the process again. Because a good tool will funnel the engineer to the desired result every time.

Again, JIRA can handle most of this pretty trivially in a given project’s workflow. 
OK, so now we have the beginnings of a change approval process. But is it one that lets our DevOps go sufficiently fast? Our engineers are here to solve problems, and as managers we’re here to solve theirs. One problem we can largely solve for them, is reducing a simple change request (e.g., routine tasks or anything low-risk, zero-downtime and self-explanatory) to a one-liner Python script.
The simpler the process, the more often it will be followed consistently.
So is this bash command simple enough? For zero-downtime changes at least, we’ll hope so:
$ ./create_change_req.py -u <my username> -a <my mgr’s username> -s “Upgrading server1” -de “More details about upgrading server1” -dt 2016-01-31T21:30:00.0-0800 -du 2h
^—- This will live in our engineers’ bash histories. It’s not the prettiest string, but it’s definitely effective at requesting approval for a change with one line.
^—- BTW, the datetime format above is Atlassian’s default, given here with the Pacific Standard Time Zone offset.


How Much of the Request Process Can We Automate?

Here’s how that Python script might look (let’s treat this as the example code it is, OK?):
#!/usr/bin/env python
# create_change_req.py – create a JIRA Change Request
from jira import JIRAimport argparse, getpass
# Gather values from args
parser = argparse.ArgumentParser()parser.add_argument(‘-u’, ‘–username’, help=’Username for authentication and reporter fields’)parser.add_argument(‘-a’, ‘–assignee’, help=’Manager username for assignee field’)parser.add_argument(‘-s’, ‘–summary’, help=’Change summary’)parser.add_argument(‘-l’, ‘–link’, help=’Optional: Related JIRA/Confluence URL’)parser.add_argument(‘-de’, ‘–description’, help=’Optional: Change description’)parser.add_argument(‘-dt’, ‘–datetime’, help=’format: 2015-12-31T21:30:00.0-0800′)parser.add_argument(‘-du’, ‘–duration’, help=’format: 2h’)args = parser.parse_args()
# Authenticate to Jira
jira = JIRA(‘https://<JIRA FQDN>,basic_auth=(‘%s’ % args.username,getpass.getpass()))
# Create issue# Fields are visible by looking at https://<JIRA FQDN>/rest/api/2/issue/CM-1/editmeta
issue = jira.create_issue(    project = ‘CM’,    issuetype = {‘name’:’Change Request’},    customfield_11203 = {‘id’: ‘11739’}, # 11739 means ‘Infrastructure Change’. This could be another arg if we had several types and wanted to add it.    reporter = {‘name’:’%s’ % args.username},    assignee = {‘name’:’%s’ % args.assignee},    summary = ‘%s’ % args.summary,    description = ‘%s’ % args.description,    customfield_11201 = ‘%s’ % args.link,    customfield_11200 = ‘%s’ % args.datetime,    customfield_11204 = ‘%s’ % args.duration    )jira.transition_issue(issue, ‘151’) # change status to Approval Requested. Transition id’s are visible by editing your workflow.print(“\nChange Request: https://<JIRA FQDN>/browse/” + issue.key)


Wrapping Up

Running this Python script might not be the most wildly-popular practice our DevOps engineers will ever adopt. But after they’ve used this script for a while, they won’t be able to imagine not having it. Teams never want less automation, and we’re probably not going to be asking anyone to slow down anytime soon.

When Tweets are a KPI: Tweetdeck as Monitoring Dashboard

NOTE: This post was originally published on djgoosen.blogspot.com Wednesday, November 26, 2014.

With Black Friday and Cyber Monday fast approaching, this post is probably timely.

At B2C product launch (or any other time you can expect your business to be trending in the Twittersphere), tweets effectively become a sort of key performance indicator for customer engagement and UX, etc. In such cases, Tweetdeck is pretty great. We can just Chromecast it on a big screen in our war room (whether alongside other dashboards using something like “Revolver – Tabs” or standalone)– it’s an instant and ongoing conversation piece. 

Tweetdeck adds data points (and maybe even a little humor?)
to any command center.

If we prefer, we can block images. Chrome Preferences… > Privacy | Content settings… > Images | Manage exceptions… > Hostname pattern: tweetdeck.twitter.com Behavior: Block. This lets everyone focus on the 140 characters or less themselves, whether good, bad or indifferent (and let’s be honest, there are no indifferent tweets). Of course, if someone tweets a screenshot of an error message, we can click it and not have Chrome block that, because of the more-specific pattern we opted for above. Blocking images provides a minimalist interface for the auto-refreshing cascades of our “qualitative KPI’s”.

In all seriousness, we use Tweetdeck as more of a real-time “heat check” than as an actual indicator of problems (hopefully we already spend enough time and dollars on monitoring/alerting/analytics to know what we need to know when we need to know it– leveraging Metrilyx/OpenTSDB and Splunk among others). But since customer experience is everything and social currency means so much in today’s ecomm environment, Tweetdeck gives us just a little bit more confidence that all is well with our stack on the days that really, really count.

What’s in a Hostname? Using SQL Wildcard Patterns to Quickly Identify System Ownership

NOTE: This post was originally published on djgoosen.blogspot.com Thursday, October 23, 2014.

Somewhere on the way to running many thousands of nodes, internal hostnames can become a problem. Well, not the hostnames themselves (after all, DNS allows up to 253 characters), but figuring out who owns them when multiple teams exist definitely can. Many shops try to solve this particular problem by including the team name in the hostname. However, what happens when a team name changes, but a hostname can’t immediately change with it? Or when node ownership switches to a different team? What can we do to improvise, adapt and overcome our challenges in keeping everyone on the same page?

Structure of a Hostname
Take a hostname FQDN like this one, which probably looks fairly typical at enterprise cloud scale:
<role>.<service>.<location>.<server-team>.local


Which transforms into something like this:
app1.payment.dc1.teamblue.local

Monitoring and Escalations
In most shops, systems monitoring is based on hostname.

Especially in larger shops, there can be multiple server teams who are responsible for different hostnames.

In those shops, nodes will sometimes change owners along service or even just role lines. Or the team names themselves will change in a re-org. Subsequently changing hostnames isn’t always easy or fast.

In most shops which have been around for a while, we are going to be relying heavily on “legacy” solutions like sending alerts to team email addresses; documentation; and institutional knowledge; for the Tier I group or automation to figure out whose on-call to escalate to.

And bear in mind, the server team isn’t always the last team to have to be involved resolving the incident! There can also exist multiple application support teams and database teams too. And they certainly don’t have to have 1:1 alignment with the server teams.

As you might imagine, this problem can lead to multiple misrouted escalations each week, which not only means waking up the wrong on-call people (a morale issue), but more importantly means increasing MTTR (a revenue issue).

Wait, What If…?

Now ideally if we had unlimited resources, we could do things like refactor our monitoring to point to the right team dynamically. Or we could retrofit our CMS to allow us to more easily change hostnames. Or we could include a manifest file on each node to describe attributes like team ownership, which could then be posted to an inventory database, and/or be referenced in said monitoring refactor. All of these are good ideas. But none of them are going to be fast. We need something fast.

SQL Wildcards are Fast

We should try to take an iterative approach to most legacy refactoring. It’s hard to resolve all of the dependencies that might exist before we start overhauling anything. So it’s really OK to start somewhere good and then go somewhere better.

Our starting point is to create a “single source of truth” for our Tier I team.
We want it to be lightweight and extensible. We want it to be based on SQL wildcards rather than being a complete database of instances, so that as node instances are added, the wildcard FQDN pattern continues to be valid. We want it to be a web tool. And when we’re finished writing it, we want to be able to hand it off to the Tier I team to run and improve as they see fit. (As engineers we should always be trying to eliminate ourselves from the equation so we can move onto solving other problems, right?)

The FQDN Search Tool

We publish version 1.0 as a simple PHP search form, with a MySQL backend of tables with wildcard regex patterns.

The backend table rows are populated by the logic that there should always be a “fall-through” resolver team for any given hostname pattern, which points to a team email address (or a relational table with team info):
.*\.teamblue.local   teamblue@example.com

Whereas, a more-specific regex could point the payment service to a different team:
.*\.payment.*.\.teamblue.local teamred@example.com


We can even point the app role of the payment service to a third team:
app[0-9].payment.*.\.teamblue.local teamgreen@example.com

The results page will show only the longest, most-specific match, like so:
$raw_results = mysql_query(“SELECT * FROM hostname_wildcard WHERE ‘$query’ REGEXP pattern ORDER BY LENGTH(pattern) DESC LIMIT 1”) or die(mysql_error());

And of course, it’s an HTTP GET request, so that in addition to being a simple URL that anyone (Dev, Product, etc.) can lookup, its resulting page can quickly be parsed via curl plus cut or similar text manipulation tools:
$ curl -s http://localhost/fqdnsearch/search.php?query=app1.payment.nyc.teamblue.local 

Version 1.0 also lays the foundation of extensibility beyond the server teams by also adding the application support teams in their own regex table. Once again, longest match wins. We can trivially extrapolate this concept to include the Database, Dev and other teams involved in supporting a service.

In Conclusion

This simple LAMP stack solution takes only a couple of hours to implement and allows us to defer a long-term solution– leveraging a new CMS and a true inventory database with manifest files– until such a project can be approved, prioritized and scheduled. We can use our institutional knowledge to prepopulate the FQDN Search regex tables, and then let the natural feedback loop from owner teams help improve the guesses the tool makes.

In my own shop’s case, within a matter of weeks of the Tier I Team beginning to use our FQDN Search tool several thousand (and counting) possible hostnames (and tons of docs) were replaced by 50 or so of the aforementioned hostname patterns, pointing to seven different systems teams. We saw similar reduction ratios for the app teams. All around, it was a quick win in the longer “series” that is our ongoing DevOps journey.