Plumbing Life's Depths - Research in Action 2012

Spent the afternoon at Research in Action (at University of Toronto). Highlights:

big classical (i.e. right out of the textbook) neural network used for road detection in satellite imagery; okay results, huge amounts of computing power involved; not sure it's really all that much better than a hand-coded solution, but it did seem to work for the (fairly heavily constrained) environment in which he was working
3 modelling projects; as in formalized computer modelling, particularly for requirements gathering. Very abstract, requiring huge buy-in from the client, seemed very much to be about working around dis-functional organizations and/or a lack of vision. Upside; there's lots of those and little of that. The tools they were developing seemed fine if you were committed to the formal modelling approach, but none of them seemed to make it good, just tweaked in which ways it was bad. My gut reaction was: if I'm in a project using these tools, I want *out*.
N-dimensional compression applied to Wikileaks word frequency analysis; basically map documents into 2D space based on similarity using an algorithm that keeps "close" things close and makes "distant" things repel to produce an interesting, but not necessarily good, map. There really didn't seem to be much focus on whether what was produced was actually valuable, i.e. whether it made the document-space more useful for a user when they wanted to explore it (the project seemed to lack a human fitness function). That said, the algorithm got me thinking about how I would do such dimensional compression.
Neat little trick to do proximity/similarity search in (sparse) high-dimensional space by dividing the vector search keys into M bit-blocks of Bit-Dimension/(Distance+1) such that you do M queries for exact-match bit-sequences O(1) and then do a linear scan on the linked records for those records which are within the desired distance. The approach had a few scaling artifacts that seemed likely to limit real-world applicability (basically as you increase B (more data about each document), your efficiency becomes increasingly dependent on the target distance (larger distance, i.e. fuzzier searches, means you need higher M to be able to get exact matches on bit strings, which means greater numbers of O(1) queries, and more linear scanning on (much) larger sets of matching records. Still, an interesting approach, particularly if you're looking for extremely similar records.

All in all an interesting intellectual exercise, I didn't come away with much that seems particularly applicable to my day-to-day work, but I did at least get to toy with some interesting problems.

Research in Action 2012
Written by Mike on April 1, 2012 in Snaking.

Comments

Pingbacks

Categories

Authors

Recent entries

Recent comments

Random entries