Seem to need hidden-markov-models for text extraction...

So in order to "seed" listener with text-as-it-would-be-spoken for coding, I've built up a tokenizer that will parse through a file and attempt to produce a model of what would have been said to dictate that text. The idea being that we want to generate a few hundred megabytes of sample statements that can then be used to generate a "python coding" or "javascript coding" language model. Thing is, this is actually a pretty grotty/nasty problem, particularly dealing with "run together" words, such as `asstring` or `mkdtemp`. You can either be very strict, and only allow specifically defined words (missing out on a lot of statements) or you can attempt to pull apart the words.

If you attempt to pull apart the words you run into dragons. Any big English dictionary has almost all short character sequences defined, so "shutil" can be broken up into "shu" and "til" rather than "sh" and "util"... and really there's nothing wrong with the "shu" "til" guess other than that it's not statistically likely.

Interestingly, this is a big part of voice dictation, guessing what sequence of actions generated a sequence of observed events.  When doing voice dictation you're looking at "what spoken words would have generated this sequence of phonemes" while in corpus preparation you're trying to guess "what set of target words and combining rules would have generated this sequence of characters".

I'm beginning to think that I should use the same basic strategy as I used for translating IPA to ARPABet, namely do a very strict pass to produce a statistical guess as to what words are common/likely, then use that to do the second-pass where we attempt to guess. So if we see the words OpenGL and Context dozens of times then when we do the next pass and see OpenGLContext we likely see it as OpenGL Context, not "Open" "GLC" "on" "text".

Or maybe I should just go with strictness in all things and explicitly ask the user to deal with all un-recognized words (or at least just do a guess and keep those words separate). I only see a few in any given project anyway. If you could just say "say that word as 'open g l context'" that is, spell out the words you would expect to say to get the result then you could re-run the extraction in strict mode and get exactly what you wanted.


  1. anatoly techtonik

    anatoly techtonik on 08/20/2014 12:14 a.m. #

    I think that simple deterministic models like Markov's for any kind of human activity are useless and faced their limits long ago. I suggest you to try Theano and deep belief networks.

  2. Mike Fletcher

    Mike Fletcher on 08/21/2014 10:16 p.m. #

    Hmm, that seems like a bit too involved for the tokenizer. This is just a simple tool that generates a corpus for input to the language model compilation. While it would be nice to have a reasonable guess at combined words I'm leaning toward just punting and telling the user to enter any mis-guessed combined word as a correction.

Comments are closed.


Pingbacks are closed.