Since the point of Listener is to allow for coding, the language model needs to reflect how one would dictate code. Further, to get reasonable accuracy I'm hoping to tightly focus the language models so that your language model for project X reflects the identifiers (and operators, etc) you are using in that project. To kick-start that process, I'm planning to run a piece of code over each project and generate from it a language model where we guess what you would have typed to produce that piece of code.
Which immediately runs into issues; do you think of '(' as open-paren or open-bracket or start-call or start-tuple? Do you have ten different ways of saying X, each of which is reasonable and valid? Do you spell out initial-isms or pronounce them as a word? Does an unknown word represent a shortening of an existing word, an acronym or initial-ism, multiple words run together, etc. As we get more "smarts" at a higher level, a lot of those default translations might go away, so you're more likely to say "class beer googles" and automatically get the cap-camel-case BeerGoogles name. For now I'm mostly punting, the coding translation will be "my way or the highway" until I figure out something smart.