Wednesday, September 12, 2007

When algorithms work better than you dream...

So, Timepedia is building a time machine, right? It sounds pretentious, but for us, it's really a geeky moniker of love for our project, after all, is Google's "search engine" really an "engine"? How much horsepower does it have? :)

One part of Timepedia, readers of this blog are already familar with: Chronoscope. With Chronoscope, we are attempting to build an open platform of visualization tools for time oriented data, in much the same way that Google Maps and Google Earth deal with spatial data.

However, what good is a time machine, if you don't know where to go, or don't understand what you're looking at? Timepedia has another platform, aimed at data mining time related information, called Everett (owned and implemented by another Timepedia founder, Mat). Everett is a collection of many algorithms for both data mining, and forecasting, some of them bleeding edge academic research. When we started, we weren't sure which of them would work, or how well they would work, we only knew that they had promising features, so Everett was less of a end user product, and more of a research platform.

One of the tools of Everett is an algorithm that lets us find hidden recurring patterns in data, even in the presence of noise, or scaling. Last week, we tested the algorithm on real life data for the first time, and had one of those "holy cow!" moments, which don't occur too often for me personally, where your own code surprises you.

To give you an example, I fed Everett an 18,000 data point series of federal funds rates over the last few decades, and it identified a pattern that occured 3 times in history. Visualizing this in another tool we call Timelord (A Chronoscope married to Everett and other server-side services), I was puzzled as to the significance of these three sequences. My co-founder Shawn spent about 1 hour Googling, until he found the correlation: These sequences corresponded to international financial/currency crises (such as the Mexican currency crises), in which the Fed was forced to take action. The leadup to the crises appeared identical each time. A fluke? It sure the hell was very interesting.

I was worried it was a fluke, so I tried something more mundane. A time series of unemployment benefit expenditures in Indiana, and once again, Everett identified a series of puzzling repetitive sequences. What were they? The dates looked very familar, 1980-81, 1990-91, 2000-1...were they recessions? To check, I used Timelord to overlay a National Bureau of Economic Research official measure of economic expansions and contractions, and sure enough, these patterns intersected with NBER recessions. One other interesting property stood out, the patterns returned prefixed the recessions, that is, Everett was showing us a pattern that leads to a recession.

How cool is that? Ambition got the best of me, I went for broke: I tried a historical time series of average hurricane strength (saffir-simpson scale), as well as a yearly count. There appears to be good evidence that a 40-60 cyclical hurricane season exists, and I was hoping that Everett could find these patterns, but alas, it did not.

Still, the initial results are promising, and we hope that Everett will give average users an ability to query time in ways that have not been previously available.

So, if you're wondering why I haven't released Chronoscope yet, it's because I've been working on integrating Timelord with Everett. :)

-Ray
p.s. Timelord is another GWT application, making it our 4th major GWT application. Everett is C++ coupled via JNI a Java/GWT RPC interface, since performance is absolutely critical in Everett.

5 comments:

Corbin said...

Have there been any papers been published about the Everett pattern finding algorithm?

It sounds fascinating.

Ray Cromwell said...

The algorithm being used isn't ours, it was published in the last few years, and presented at KDD'07. I hope I didn't convey the impression that we had invented it (we've only tweaked some of its parameters), since that would be a serious disservice to its inventor, who I think deserves credit for blazing a revolutionary trail in time series analysis. We will fully disclosee all of the relavent citations and the inner workings of Everett when the Timepedia launches. -Ray

Laszlo Kozma said...

What you mention is pretty standard stuff in Data Mining, it's 'called frequent episode discovery', it's used in all kinds of industrial monitoring systems, and in telecom.

Unknown said...

Can I use this algorithm to predict the numbers at lottery ? :)

Ray Cromwell said...

Laszlo,
We are not using FED, we are using a new motif discovery algorithm. FED is far far too slow for our purposes, we need to detect motifs in 10k or 100k long datasets with millisecond response times and oodles of concurrent web users.

We're also not interested in the rules output that FED gives you, instead, we overlay historical timelines and news feeds that best fit discovered Motifs so that human beings can make a judgement call, and write an interpretation.