I've been faced with a bit of a problem lately, how to test the quality of results being produced from a changing dataset? It's easy to write unit tests for components of an application, and to make sure they're producing the expected results, but how do you test end to end? Andraz from Zemanta asking the same question, How do you test a complex system that is trying to mimic being smart, last year.
when you have new content in the system, you get completely new related stories and you have to go back and have a human judge them. There is expansion of the evaluation data - as you add new tests you generally can’t send them through previous versions of your algorithms, since that would be prohibitely expansive. And there is statistics that hardly gives you overview over what exactly your changes caused, just few final numbers. And then there is the problem of pipelining the processing. Even if you improve the first stage, end results might be worse, since you’ve already adapted the second stage to previous first one. So you need to actually evaluate each part of the system in isolation and then together. At the end you actually find out that you spend disproportional amount of time evaluating even the smallest changes. So you are in danger to just skip that evaluation which naturally you shouldn’t.
The fundamental problem you run up against is that the index is constantly changing, and it's meant to change. So it's hard to automatically test the output without a clear idea of what is going in. It's also difficult to get an accurate picture of how small changes in code affect the general results if you're just using a testing index with a small dataset. One way to go about it is to gauge result quality based on measuring user interaction. Basically there are things users do when they get results they're expecting and things they do when they haven't found what they were looking for. So if you can get measure how they're reacting, you can get an idea of quality. At the moment I'm a lone developer putting all the data in the index, and I have a good idea of what I should be seeing out if it's actually working. In the next while though we're going to be rolling out to a few more internal beta users as we get a prototype system developed and we're not going to be in control of the inputs or output anymore. So soon enough we're going to actually be faced with trying to measure the quality of the results we're giving users in a dynamic system - expect to hear much more about this as we go on.