Wednesday, September 23, 2009

Reinventing the Wheel - Perfmon Analyzer Notes

Can't sleep. Maybe if I take some notes on the app I've been thinking about, it'll leave my head.

So I spent some time today writing Perl scripts to help me out with some perfmon data. Customer sent me about 200 files with pretty much every Windows counter, each file about 70MB. That's a metric fuckton of stuff to wade through. So the scripts I wrote:
  1. Call relog to get rid of everything but the counters I want
  2. Call relog to consolidate all the records that appear to come from one server into a single file (okay, 1 and 2 really happen in the same loop. Who's counting?)
  3. Convert over to a per-server CSV file for easy Excelling
  4. Collapse into a couple of Excel files with some extra columns to make it easy to do PivotTable-type stuff
All of that is cool. But there's still things I want to do with the data. For each server, I want to:
  • Find average IOPs
  • Find peak (top value, average of top 2%, 4%, etc) - Note: There's probably a statistical operation I'm thinking over here. Get out your damned statistics book.
  • I think maybe the other thing I'm doing is trying to figure out what percentage of operations are x number of standard deviations about average. Maybe I just want some metric of how "bursty" the system is.
  • All the other basic stuff - What percent is reads versus writes, what will that look like after a RAID penalty, how big are my average reads and writes. No reason it couldn't try to fit that data to some patterns and maybe guess at an IO profile.
  • Kill the fly that's found my monitor in the dark. Stupid horse farm.
  • Generate a pretty graph of the above
  • Do all of the above for both on-hours and off-hours work, and maybe separately for a backup window
For the entire set of servers, and any subset of servers I choose (say, a SQL cluster), I want to
  • Find the various IOP values above. There's probably a way to apply Erlang-style analysis and say "I want only a 1% chance of having peak IOPs above this given SAN capacity"
  • Associate the servers above with amounts of storage, and graph IOPs versus server and metaLUN size. Ideally, this results in a pretty 1/x graph and helps me easily identify flash and SATA candidates
  • I killed that fly. Woohoo!
  • Heck, there's no reason that the system couldn't take a swag at trying to devise a basic LUN layout.
All of the above is entirely possible with Excel, but it would take an extraordinary amount of time. There's no reason it couldn't be automated. For that matter, all of the above assumes that the data has been ingested into a SQL database, which means I could normalize between perfmon, iostat, and whatever other stuff may be out there.

Since it's all in a big ol' database, that opens the doors to larger sets of statistics over time. No reason I wouldn't keep EVERYTHING in there.

Some of this stuff - the LUN layout, the IO profile - could take some work. Most of it is just combing some datasets and doing basic math.

So the thing is - Surely this has been done to death a thousand times before. Where is this application?

No comments:

Post a Comment