Wednesday 14 March 2012

RRDTool, and C# .NET bindings

RRDTool is pretty much ubiquitous in Linux/Unix cluster environments – or at least, it is in those I’ve worked with. It provides simple but highly reliable round-robin database services into which you can dump pretty much any metric you like. It’ll keep a running dataset at the granularity and duration you define; enable you to export data in chart and xml form; and even perform some useful analysis.

These days I’m typically working in Windows environment, and typically most clients I work for need some sort of dashboard to aggregate information about their clusters and grids. For a long time RRDTool was a pain to build on Windows, and there was no single .NET wrapper. The front runner – NHawk – actually spawned rrdtool processes rather than wrapping the underlying libs direct.

Now it turns out that someone’s contributed .NET bindings to the trunk of RRDTool. I figured I’d see if I could get it working, on 64bit Win7 (although actually x86 is fine for my app), under Visual Studio 2010.

I checked trunk out from SVN and took a look.

The first thing is to set up the dependencies – I tried to take a short cut and begin with a GTK+ package, but found it’s best to start with the files mentioned explicitly in the WIN32_BUILD_TIPS file.

There were a few apparently non-serious build errors in the rrdlib project. In several places explicit casts from void* were required on malloc, realloc. Also, there was some skipping of variable initialization in switch statements – latter can be fixed just by enclosing in a block.

Running through the tutorial with rrdtutorial raised one problem – an error claiming “No positional legend found”. The error was generated in rrd_graph_helper.c. The log showed that this was a work in progress, so I updated to the preceding version, built and it worked apparently fine.

After that I built the .NET bindings – basically fine, although I wrangled everything to x86 in the end just to get started. But then – nothing was being exported. There’s a .def file in the lib solution, but it doesn’t appear to exist any more. So: I wrapped a copy of the public method declarations in #ifdef WIN32, and prefixed with __declspec(dllexport), then rebuilt the .NET binding solution and: success.

All looks good to go. Next step is to grab metrics from windows performance counters in a C# app and see what I can present through RRDTool.

Saturday 10 March 2012

FEEE

It was like Christmas again: new blades arrived to add to our UAT and Production environments. And to start with, everything seemed rosy: twice the number of cores, to start with, and the installation of our server code went pretty smoothly.

Then I spun everything up, and watched all our processes die. No meaningful error messages, just some flurries of forlorn my-child-went-away pleas for help in our workflow manager logs.

Stumped, I wondered if I should be looking at the “socket forcibly closed” exceptions I could see that indicated my-child-went-away; but no. The child processes were just dying, and these errors were just an artefact. Later, the workflow manager eventually timed out the children, noting that they’d already finished with error code –254667849. Or something similar; the set varied.

I fired up eventvwr and there it was – a veritable storm of .NET framework errors in the application logs, with two characteristically repeated, one of which was:

.NET Runtime version 2.0.50727.3607 - Fatal Execution Engine Error (7A09795E) (80131506)

What followed turned out to be several hours of searching and getting rapidly downhearted. The whole of the internet seemed to have seen this very exception, and it seemed critically linked to either trying to run a process as a user that had no associated user profile on the box, or to vague “problems in the .NET 3.5 SP1 on Win 2k3 64bit boxes” that no-one – except one particularly determined individual and his team – seemed to get to the bottom of.

I checked: the user had an associated profile. I went to get some coffee.

I had to get a clearer picture of where our workers were dying. I installed the Debugging Tools for Windows and SysInternals suite on the box and then stopped.

The child processes were dying, but they were being spun up by a workflow manager. How was I going to get windbg to attach if I couldn’t spin up the process myself? They died immediately – there’s no way I’d be able to attach in time.

Luckily, Chris Oldwood knew the way. The gflags app in the Debugging Tools set let’s you configure a lot about the debug environment – and lets you even specify that for a given app (under the ImageMap tab) you should spin up a specific debugger – like c:\program files (x86)\debugging tools for windows (x86)\windbg.exe, for example.

It even worked! I pushed some work through the system and watched windbg attach, and let it run through to the exception. !analyze –v took an age – over 10 minutes – without actually giving me any information, so I used procdump to make a full dump, and analyzed that on my own machine.

procdump -ma TaskEngine.exe d:\some\temp\directory\TaskEngine.dmp

There, !analyze -v brought up two things. One, that it was dying while trying to make a SQL Server connection, and another that it might be having trouble associating a user context with the login

Trying with independent tools from that box also fail to connect to the database. The database was definitely up – I had to talk to it to push work through the system. The native client installed on the box is the same version as that installed on the existing servers.

Something’s clearly missing, and as yet, I don’t know what.