Plumbing Life's Depths - Pernicious Influence of Python (Still can't sleep...)

Pernicious Influence of Python (Still can't sleep...)
Written by Mike on Oct. 30, 2004 in Snaking.

As expected back when we were solving the Pernicious Influence of Puzzles, writing a script to generate all permutations of a word from a given word-list was a pretty trivial endeavour. Applied to our TRANSIT challenge the computer came up with a few we missed...

The ones I feel bad about:

artist
stain
taint
taints
titans

Ah, oh well, should have thought of it, but no big deal:

tats (verb form of tat, which we got)

And the ones I feel very little shame about at all (the first 3 I've at least heard, but the rest could be made-up words for all I really know):

strati (multiple of stratus, as in stratification)
stria (as in striations, grooves or channels)
tarsi (ankle, (the bones therein))
ti (Diatonic note, Asian Shrub)
tis (multiple Asian Shrubs)

I was actually expecting far more missed words. I'm sure if I were to use the "huge" word-lists (I used the common word-lists in Aspell) we'd get more.

Anyway, for the curious, the code used is below...

Comments

Alan Green on 11/22/2004 1:58 p.m. #

I wrote something similar a while ago, but I must remember your trick with translate for next time 
 
<a href="http://cardboard.nu/blog/2003_02_11/instant_programming_with_pytho.html">http://cardboard.nu/blog/2003_02_11/instant_programming_with_pytho.html</a>
Mike Fletcher on 11/24/2004 3:23 p.m. #

[Reader (in email) asked me to explain the algorithm choice somewhat more] 
 
Time to check all words is: 
time to load all words from files (no dependence on the word) 
time to run string.translate on all words (no dependence on the word) 
string.translate is ~ linear, it simply takes a source character and substitutes it with a destination character or drops it, the translation arrays are 256-byte lookup arrays, they are different for different words, but their effect in terms of processing required is (at this level of consideration) the same. Basically it's linear with the size of the words in the dictionary, and that's a constant value. 
time to check length of words (constant-time) 
 
With that, you only have to run the expensive checks for those words which are composed solely of the characters in the target word, which will be a tiny fraction of total words. It's dependent on the word chosen (but not on the length of the word chosen, just on the number and length of the words composed out of characters in the word). 
 
There's a dependence on number of distinct characters in the word (i.e. statistically there will be more words that are formed of the letters in words with more distinct characters), but even if your source word has *all* letters in it, then you're *still* constant-time, just with a larger constant. The time to do an expensive search of all words is constant for a given dictionary set (well, actually slightly better than constant, it can short-circuit if it determines that a word uses a letter too many times and stop considering it), it's just slower than the string.translate check, so the graph will approach a maxima of doing the expensive search for every word as you approach a word with all printable characters in it. That maxima is constant regardless of what word you use. 
 
With the permutation approach you would wind up with some combinatoric number of words to check (hmm, it's been so long since I took factorials I can even come up with the proper equation, it's N choose M, anyway), so you'd need to construct a table with all words in memory (approximately as long a process as loading and string.translating them, but with lots of extra memory allocation overhead, since they're large tables) and *then* start searching for all the words. You have the large constant overhead and *then* you start into the combinatoric checks that, though they'll be individually fast, aren't going to be free, and will multiply rapidly with word length. If you were to have your word with all possible characters in it (same worst-case scenario as above), you'd be doing something like 64 choose N checks, which IIRC is a really big number.
Brian St. Pierre on 11/30/2004 11:01 a.m. #

I just wrote something similar. I began with a combinatorial solution like what you used, but I stumbled across an alternative that is quite a bit faster, though not as precise. 
 
The fastest method is to use python to build a couple of regular expressions: the first is of the form "^(transi)?(transi)?...", where ... is the () expression repeated for the number of letters in the word. Pass this to egrep and you will get a list of words that *might* fit your seed word. You need to filter words that have, for example, two Rs where there aren't two Rs in your seed word. A second egrep expression of the form "([ransi]).*\1" can pick out most of these, but might not catch everything -- you need to manually pick through the list... not really ideal. (The reason for this is that, to use TRANSIT as an example, you have two Ts in your seed word, so if there are three Ts in a candidate word it doesn't fit.) 
 
The second method is pure python and based on the same idea -- this is what I ended up using. First, load your dictionary. As you're loading, strip and lowercase each word. Also filter out words in the dictionary that contain letters that aren't in your "alphabet" (this is the set of letters in your seed word). Build the regexp as above. Now, for every word in the dictionary, test against the regexp. If it matches, verify that the word can be built using the alphabet (i.e. doesn't use dupe letters). 
 
I haven't taken measurements, but this method feels quite a bit faster than the combinatorial alternative for the problem I was working on.
Mike Fletcher on 11/30/2004 12:52 p.m. #

Your approach probably would be faster than the *combinatorial* approach, but I think you'll find that it's slightly slower than the original non-combinatorial approach embodied in the code above. That's mostly because there's no need to store non-matching words in memory, nor, indeed, to do any regex matching. 
 
You still need to do the expensive "are we using a letter too many times" checks for each word composed of only characters in the word, it's just that you're using a regex instead of string.translate to do the test for "composed solely of characters in the word" test. 
 
The regex might have a slight advantage in that it doesn't *necessarily* require a copy of the string, but I'd imagine that there's some considerable overhead involved in creating the regex match object (which has to store all the sub-group matching records) that would likely outweigh that benefit.
Mike Fletcher on 01/12/2005 9:16 p.m. #

Hmm, turns out Brian was right (from a poster on Python list), regexen do appear to be faster at doing the initial filtering than translate... I'm not clear whether the regex was run for each word, or for the entire file, but oh well.

Comments are closed.

Pingbacks

Pingbacks are closed.

Pernicious Influence of Python (Still can't sleep...)
Written by Mike on Oct. 30, 2004 in Snaking.

Comments

Pingbacks

Categories

Authors

Recent entries

Recent comments

Random entries