Validating Financial Data with PDL
My job involves financial market data. A lot of financial market data. I take the market data from various sources and store it in a database for later analysis.
Being a programmer/analyst and not a mathematician with a Ph.D. in finance, my use for time series analytics falls into the "ensure correct data is being collected" category. But even then, some basic statistical analysis helps me preserve quality historical data for later use.
PDL is perfect for doing these kinds of calculations very quickly. Combined with PDL::Finance::TA, all the hard work is already done, and all I need to do is wire it all up.
Let's take a large set of random numbers. If our random number generator were perfect, we would expect that the set would be evenly distributed because each possible value is exactly as possible as any other value. If we calculate the standard deviation (stddev, a measurement of how disperse the data set is), we would expect that 99.7% of the points would be within 3 standard deviations of the mean (average).
So, if we write a test that checks to see if a new (completely random) point is within 3 stddev, there is a 0.3% chance that new (completely random) point will fail our test. If we bump that to 4 stddev, we should expect 99.99% of the points to pass the test, and 0.01% of the points to fail (1 of every 15787). If I collect 500,000 (completely random) points in a day, then 50 of them will fail our test.
So I create a time series of random points. Then I create a new time series of the 30-day standard deviation of the original series. Then I compare the two and see which points are outliers.
use PDL;
use PDL::Finance::TA;
my $ts = random( 5000 ) * 50;
my $stddev = ta_stddev( $ts, 30, 1 );
Market data is not completely random, it's stochastic, which I interpret to mean as "given value A1, the next value A2 will be somewhere between A1 +/- B". It's predicting (guessing) "B" that earns quants the big bucks. But, over the entire set of data, I know each previous value of B, which is the difference between A1 and A2, or the rate of change between 2 points. What I really want to know is if the rate of change from A1 to A2 appears abnormal, say, if it's more than 4 stddev from the mean.
So I take my time series, create a new time series that is the rate of change for each point in the previous series, create another new time series that is the 30-day stddev of the previous time series, and then compare the rate of change with the stddev to see which ones are outliers.
Finally, I should also make sure that my source is still updating, as it is very rare that most series would be the same twice in a row, or for an entire week. So let's check for flatness by using stddev.
PDL and TAlib make this all incredibly simple, so I can get on with my real work (fragging lamers in Quake)