If you attempt to pull apart the words you run into dragons. Any big English dictionary has almost all short character sequences defined, so "shutil" can be broken up into "shu" and "til" rather than "sh" and "util"... and really there's nothing wrong with the "shu" "til" guess other than that it's not statistically likely.
Interestingly, this is a big part of voice dictation, guessing what sequence of actions generated a sequence of observed events. When doing voice dictation you're looking at "what spoken words would have generated this sequence of phonemes" while in corpus preparation you're trying to guess "what set of target words and combining rules would have generated this sequence of characters".
I'm beginning to think that I should use the same basic strategy as I used for translating IPA to ARPABet, namely do a very strict pass to produce a statistical guess as to what words are common/likely, then use that to do the second-pass where we attempt to guess. So if we see the words OpenGL and Context dozens of times then when we do the next pass and see OpenGLContext we likely see it as OpenGL Context, not "Open" "GLC" "on" "text".
Or maybe I should just go with strictness in all things and explicitly ask the user to deal with all un-recognized words (or at least just do a guess and keep those words separate). I only see a few in any given project anyway. If you could just say "say that word as 'open g l context'" that is, spell out the words you would expect to say to get the result then you could re-run the extraction in strict mode and get exactly what you wanted.
Pingbacks are closed.