You think you’ve got a lot to read for your work? A client who’s been having latency trouble with their e-commerce server sent me a database log file yesterday that contained more than 1.3 million lines of text.
Now as much as I’d just love to curl up in front of the fireplace and slowly savor the whole thing (the prose is truly a thing of beauty), I don’t think that’s realistic. Instead, I chose to use some of the Linux command line’s powerful text manipulation tools to filter out the information I needed. I can’t resist sharing the way I approached the data problem, and I’ll get to that in just a minute or two. But this also got me thinking a bit about how we process the oceans of information that often characterize our professional lives.
Technology Information Overload: am I part of the problem?
Last week I had a very pleasant conversation with my publisher at Manning Publications. I had just signed a contract with Manning to write a book called “Learn Bash in a Month of Lunches” (which later changed to “Learn Amazon Web Services in a Month of Lunches“), and the two of us were speaking for the first time. I mentioned that, for strategic reasons, some technology publishers try to put out books in the 700-800 page range – sometimes whether or not the content requires it. I’m not sure this is a conscious process, but I can say that I’ve seen some books that are way longer than they need to be.
My own recent “Teach Yourself Linux Administration” book is less than a third the length of any of its competitors. That’s not to say that the other books in this genre aren’t worth reading – the ones I’ve seen most definitely are – but that it’s also possible that you can plan a book to focus on just the critical information that your readers will almost always need, and guide them to research some of the less common details on their own. You know, using that Internet thingy that they all seem to have.
In other words: short is the new long.
Now, let’s get back to my data problem.
Filtering large data stores
The log file contains records of requests against a large MySQL database. I quickly noticed that the second line of each record had a “Query_time:” value, usually of around 0.5 seconds or lower. I did, however, see some values that were higher than 1.0 seconds, and I thought I should try to find out just how high they went. Rather than visually scanning through screen after screen of text hoping to find something, I turned the problem over to cat (the Linux tool for reading text streams) and grep (for filtering).
Here’s how it looked:
cat mysql-slow.log | grep Query_time
Now, of course, all that will do is print all the many thousands of lines in the file containing the string “Query_time”. I only wanted those lines with a value above, say, 3.0. I’m sure there are more straightforward ways of doing this, but I can’t see why I shouldn’t just run a second grep filter against my initial results. Now that will be a bit more complicated since, by default, grep interprets the dot (.) as a meta character and not as a simple dot. To get around that problem, I’ll run grep with the -F (fixed string) argument, and have it search for all occurrences of the number 3 followed by a dot (anything between 3.0 and 3.999, in other words).
cat mysql-slow.log.2016-03-15 | grep Query_time | grep -F '3.'
That gave me some useful output and it was something I could repeat for 4, 5, and 6 second responses (besides 1 and 2, of course). There were still way too many results to read, but I could at least use them to comfortably browse through the source records within the file and look for any patterns that might point to a recurring problem. Unfortunately (or perhaps fortunately), I couldn’t see any obvious issues that would explain all the delays.
I then wondered just how many instances there were of each delay block. That was easy: simply pipe my grep output to wc, which will count the number of lines (and words and characters – but I didn’t care about those).
cat mysql-slow.log.2016-03-15 | grep Query_time | grep -F '3.' | wc
Here’s what I got:
Response times: | Number of instances (over 24 hours): |
1-2 seconds | 7076 |
2-3 seconds | 2697 |
3-4 seconds | 591 |
4-5 seconds | 329 |
5-6 seconds | 91 |
6-7 seconds | 56 |
Those look like really big numbers. In fact, it means that the total number of requests that could not be completed in less than 1 second was 10,840. But of course, I needed to put that into context. How many requests were there in total on that day? That was easy, just grep for all Query_time lines and count them with wc.
cat mysql-slow.log.2016-03-15 | grep Query_time | wc
That one came to 333,257. Meaning the number of slow responses came to just over 3% of the total requests which, in the context of the problems the site had been experiencing, wasn’t significant.