Thursday, September 13, 2007

Knowledge extraction with 2 dimensional regular expressions?

Some of my best and worst ideas come when I'm sitting on the toilet. I pretty much have an informal bookshelf next to it, which I use to refresh and reinforce my memories of galois theory, probability, or in this case, automata from time to time. I'm not getting any younger (almost 36) -- it's been ages since college -- so whatever time I can use to reinforce those neurons and keep those skills from slipping away is golden.

So, there I was, automata book in hand, thinking about how I could simplify and modularize, Tardis, Timepedia's web crawler, automated knowledge discovery extractor, and temporal query engine. One part of what Tardis does is look for time related information in documents, and extract timeseries and timeline information.

How it does this exactly, I won't say at this time, but it uses a large battery of heuristics and other classical algorithms. The problem is, this logic is cemented inside of Java code, a sort of implicit 'expert system' with a huge number of cases, and complex interactions. Extending this to support newly discovered rules is hard enough with the source code in hand, but impossible for the end user.

Wouldn't a DSL be better? But if so, which one? An expert system like CLIPS/JESS? Those risk alot of complexity as well as scalability issues. If this was simple text extraction, I'd give the users the ability to author custom regexp match/rewrite rules for particular documents, sort of a mini-online-SED. As you can imagine, a lot of time related information is two dimensional in nature, stored in documents in tabular form, so ordinary regexp's don't capture the problem.

Aha! What about two dimensional regular expressions? Do they even exist? By the time I had gotten off the toilet, I had already devised replacements for regular expression concatenation operators in the form of row/column concatenation operators. I knew however that I was not the first to think of this, so a quick Google Scholar search led me to this paper: Two-Dimensional Languages (1997), Dora Giammarresi, Antonio Restivo, Handbook of Formal Languages

At first glance, this looks like a fertile field, previously applied for recognizing graphical features in pictures. Could I extend it to my needs? More on that later. First, let's look at an example.

As you may recall, a classical regular expression consists of operations like concatenation, union, and Kleene star. Concatenation is usually represented by placing two symbols next to one another, such as "ab" which means "a + b" where + is the concatenation operation. Union is typically represented via the pipe character "|", so "a|b" means "a union b". Kleene star of course is the union of all possible concatenations, e.g. "(a|b)*" = e, a, b, a+b, a+a, b+b, aa+b, ab+a, etc.

The two dimensional analogues of these would row concatenation, column concatenation, and row/column Kleene star operators. Let us denote column concatenation as "+", and row concatenation as "/", so "a + b" recognizes "ab" as long as they are in the same row, and "a/b" recognizes "ab" as long as they are in the same column. Column-wise Kleene star as "+*" and row wise Kleene star as "/*".

Now let's try a problem, recognize the set of all "chessboards", encoded with the symbols 'b' for black, and 'w' for white squares. Here's an example chessboard:


b w b w b w b w
w b w b w b w b
b w b w b w b w
w b w b w b w b
...


First, we match the first row with "(b + w)+*". Next, we match the second row with "(w + b)+*", and then we concatenate these two expressions row wise, to form

"(b + w)+* / (w + b)+*"

This says match "bwbwbw..." followed on the next row by "wbwbwb...". Now all we've got to do is repeat this pattern row rise.

"( (b + w)+* / (w + b)+) )/*"

This will match the above chessboard pattern. I realize it may be confusing to represent column wise concatenation via '+' instead of the empty string as is traditional, but I wanted to make the separate row and column operators explicit and equal in importance.

Now, I was delighted that something like this could be done, and furthermore that theoretical results for NFAs, DFAs, and closure properties existed for such languages, but I still faced the problem that knowledge extraction seems to require processing values semantically, not just lexically, so I tabled it as just a nifty and cool area of research, but with no application to what we were doing.

That was, until I showed it to Mat, who came up with an additional step that would enabled 2D regexps such as those described above to be used by Tardis for time related extraction. And that story, I will leave for a future article. :) The lesson I learned however, is not to give up on an idea that may seem impractical until you've got input from other people.

In the mean time, check out the paper I linked.

-Ray

2 comments:

Cristian Strat said...

I fail to understand how you can call this thing a 2-dimensional regular expression.

The example you provided treats the "text" as a continuous, 1D, stream of bytes. Just insert \n in the regular expression to mark the end of a row and you'll be matching chess boards immediately.


Try to design a 2D regular expression that matches rectangular areas of vowels. Make sure to test it with multiple rectangles that intersect column-wise, but not row-wise.

Ray Cromwell said...

It is certainly a 2-dimensional regular expression, the fact that the language it is recognizing (l in L such that l is a chessboard) may be recognized by 1-dimensional regular expression does not detract from that, anymore than if I design a context free grammar to recognize the language "abababab", does not mean the grammar isn't a CFG, it just means the language "abababab" doesn't require a CFG to accept it.

Perhaps it's a poorly chosen trivial example, but it's not my example, it's the first example given in the paper I cited. There are other more complex examples in the paper if you take a look.