Validating Financial Data with PDL


My job involves financial market data. A lot of financial market data. I take the market data from various sources and store it in a database for later analysis.

Being a programmer/analyst and not a mathematician with a Ph.D. in finance, my use for time series analytics falls into the "ensure correct data is being collected" category. But even then, some basic statistical analysis helps me preserve quality historical data for later use.

PDL is perfect for doing these kinds of calculations very quickly. Combined with PDL::Finance::TA, all the hard work is already done, and all I need to do is wire it all up.

Let's take a large set of random numbers. If our random number generator were perfect, we would expect that the set would be evenly distributed because each possible value is exactly as possible as any other value. If we calculate the standard deviation (stddev, a measurement of how disperse the data set is), we would expect that 99.7% of the points would be within 3 standard deviations of the mean (average).

So, if we write a test that checks to see if a new (completely random) point is within 3 stddev, there is a 0.3% chance that new (completely random) point will fail our test. If we bump that to 4 stddev, we should expect 99.99% of the points to pass the test, and 0.01% of the points to fail (1 of every 15787). If I collect 500,000 (completely random) points in a day, then 50 of them will fail our test.

So I create a time series of random points. Then I create a new time series of the 30-day standard deviation of the original series. Then I compare the two and see which points are outliers.

use PDL;
use PDL::Finance::TA;

my $ts = random( 5000 ) * 50;
my $stddev = ta_stddev( $ts, 30, 1 );

Market data is not completely random, it's stochastic, which I interpret to mean as "given value A1, the next value A2 will be somewhere between A1 +/- B". It's predicting (guessing) "B" that earns quants the big bucks. But, over the entire set of data, I know each previous value of B, which is the difference between A1 and A2, or the rate of change between 2 points. What I really want to know is if the rate of change from A1 to A2 appears abnormal, say, if it's more than 4 stddev from the mean.

So I take my time series, create a new time series that is the rate of change for each point in the previous series, create another new time series that is the 30-day stddev of the previous time series, and then compare the rate of change with the stddev to see which ones are outliers.

Finally, I should also make sure that my source is still updating, as it is very rare that most series would be the same twice in a row, or for an entire week. So let's check for flatness by using stddev.

PDL and TAlib make this all incredibly simple, so I can get on with my real work (fragging lamers in Quake)

Announcing Statocles


Static site generators are popular these days. For small sites, the ability to quickly author content using simple tools is key. The ability to use lower-cost (even free) hosting, often without any dynamic capabilities, is good for trying to maintain a budget. For larger sites, the ability to serve content quickly and cheaply is beneficial, and since most pages are read far more often than they are written, generating a full web page to store on the filesystem can improve performance (and lower costs).

For me, I like the convenience of using Github Pages to host project-oriented websites. The project itself is already on Github, so why not keep the website closely tied to it so it doesn't get out-of-date? For an organization like the Chicago Perl Mongers, Github can even host custom domains, allowing easy collaboration on websites.

It's through the Chicago.PM website that I was introduced to Octopress, a blogging engine built on Jekyll. It's through using Octopress that I decided to write my own static site generator, Statocles.

Continue reading Announcing Statocles...

Managing SQL Data with Yertl


Originally posted on -- Managing SQL Data with Yertl

Every week, I work with about a dozen SQL databases. Some are Sybase, some MySQL, some SQLite. Some have different versions in dev, staging, and production. All of them need data extracted, transformed, and loaded.

DBI is the clear choice for dealing with SQL databases in Perl, but there are a dozen lines of Perl code in between me and the operation that I want. Sure, I've got modules and web applications and ad-hoc commands and scripts that perform certain individual tasks on my databases, but sometimes those things don't quite do what I need right now, and I just want something that will let me execute whatever SQL I can come up with.

Yertl (ETL::Yertl) is a shell-based ETL framework. It's under development (as is all software), but included already is a small utility called ysql to make dealing with SQL databases easy.

Continue reading Managing SQL Data with Yertl...

Manage Boilerplate with Import::Base


Originally posted as: Manage Boilerplate with Import::Base on

Boilerplate is everything I hate about programming:

  • Doing the same thing more than once
  • Leaving clutter in every file
  • Making it harder to change things in the future
  • Eventually blindly copying without understanding (cargo-cult programming)

In an effort to reduce some of my boilerplate, I wrote Import::Base, a module to collect and import useful bundles of modules, removing the need for long lists of use ... lines everywhere.

Continue reading Manage Boilerplate with Import::Base...