February 17, 2014

Anagrams/C++

This weekend I decided to refresh on C++ in preparation for some interviews. I tried to pick a project that I could code in roughly one day. It should not only have me review C++ but force me to solve some problems and implement some data structures I don’t necessarily use every day.

Idea #

Anagrams allows you to find anagrams in the English dictionary for any given set of characters, then prints them out in alphabetical order, while displaying some stats.

Goals #

My goal is not to build something that just works, but to actually build a decent solution that’s efficient and lightweight, while exploring a few things. This is neither the quickest to implement nor the most efficient, so please view this post more as a study.

Problems #

Here are some very basic to average problems that building this program entails:

Storing an English language dictionary that the program can read
Storing the dictionary in memory without hogging too much space
Getting all the character permutations of a string
Checking if a word is in the dictionary efficiently
Printing out validated anagrams in alphabetical order

Traps #

There’s actually a couple traps that are not necessarily obvious immediately:

A set of characters may have several permutations that are the same, they should be ignored to provide a more efficient solution, to count the actual number of anagrams and not display redundant results.
Some English words may be the substring of longer words, e.g.: compute is a substring of computer and they are both English words. Depending on the implementation of the dictionary, it may cause problems.
The given characters or the words in the dictionary may contain capital letters. I think anagrams should be case insensitive so that may be an issue.

Implementation #

Steps #

The program reads the dictionary file and loads it into memory
The user given strings are each permuted into all possible permutations, discarding duplicates
Each permutation is checked to find out if it is in the dictionary
Results are sorted and displayed
Statistics counted as we go

Dictionary file #

I have a UNIX-based laptop, which is great because it contains a text-format English dictionary file. Most likely you can find it at /usr/share/dict/words. Make a copy with:

cat /usr/share/dict/words > words.txt

This dictionary contains one word per line, sorted in alphabetical order.

Dictionary data structure #

Initially I thought I would simply read the dictionary file and keep a hash table of all the words. C++ provides a data structure called map, which is pretty much a key/value store. However one part of me thinks that this is too easy and also a few Google searches show that it takes a lot of memory and is rather slow. The regular map retrieves values in O(log(n)) and the unordered_map structure does it in O(1) meaning it does it at the same speed regardless of the size of the dictionary. The bummer is that while O(1) is a constant, it actually happens to be a pretty big number.

Another solution is to build a two-dimensional array. In C++ a character is 1 byte, the longest English word (in a main-stream dictionary) is 45 characters and we have about 240,000 words. This would take around 45x1x240000 bytes of memory, or 10MB. Unfortunately an unexpected issue comes up: many words start by the same letter, radix and so on so it takes several iterations before one can figure out if a word exists. Look-up time thus increases as the dictionary grows.

The third option is a tree. Trees are often used for string search, notably TrieTrees for auto-completion. I don’t really need a TrieTree because I only care to know if a word exists and not so much to know which other words share the same radix (Unless I use the tree for finding valid permutations, which is not my plan). There’s a bunch great things a tree provides for this specific case:

Nodes can be shared by several words, reducing memory use. With about 240,000 words in my dictionary, I get around 700,000 nodes (as opposed to 10 million cells required by a two-dimensional array implementation)
Sorting is done automatically when nodes are added
Figuring out if a word exists or adding a word to the dictionary takes O(n) where n is the number of characters in the string

Tree retrieval time #

If a node represents a character, then every node will have many children (all the words that have the same radix). Since there’s 26 letters in our alphabet (though my implementation allows using characters beyond that), any node may have up to 26 children. Worst case scenario, I would have to check the 26 children, then the 26 sub-children etc. so the word “zzzz” would take 26⁴ checks to find. This is terrible, so there’s a way around. Characters are representable as integers, by their ASCII code. This is why I decided to make the tree nodes store their children in arrays, where the index in the array is the ASCII code of the character belonging to the child stored at that index. Lucky thing is that the alphabet comes in ascending order of ASCII codes, so I can simply iterate once over the children to obtain a sorted list of words.

Word existence #

An issue came up with the tree implementation: it is not immediately clear if a word is in the dictionary if it can be retrieved. The string “compu” is not a word, but it does exist in the tree because it’s the radix to words such as “compute” or “computer”. I can’t just check for leaves either because “compute” is a real word and it’s the radix of “computer” therefore it’s not a leaf. My solution is to simply have each node remember whether there is a word that ends there.

getopt() #

I wanted to say a word about getopt(). It’s a great function that allows you to parse command line flags. It comes with <unistd.h> (check the manpage for getopt). Most examples I usually find online show poor code that iterates over all the arguments in a bad attempt to figure out if any flags were entered. I strongly recommend getopt for single character flags, check out getopt_long for string flags.

Statistics #

How many ways can you permute a string with n characters? it’s simple it’s factorial of n (n!). Now how many distinct permutations are there? it depends on how many distinct characters there are. To be more precise, lets say the string is n characters long and that one character shows up n₁ times, another character shows up n₂ times and so on. The total number of distinct permutations is:

$\text{Permutations} = \frac{n!}{n_1! n_2! n_{...}!}$

A simple implementation of the factorial function in C is as follows:

int factorial(int n)
{
    int ret = 1;
    for(int i = 1; i <= n; ++i)
        ret *= i;
    return ret;
}

It is iterative, which may seem less fancy but uses less memory than a recursive function.

Permuting the string #

Next is the task of outputting all the permutations of the given string, discarding the duplicates and testing them to find out if they are English words. I didn’t want to think of a way to discard the duplicates, nor to sort them so I decided to use my dictionary data structure. By putting all the permutations into a new fresh dictionary, they get automatically sorted and duplicates don’t get saved. Here’s some of my code:

Dictionary* getWordPermutations(char* word)
{
    Dictionary* dict = new Dictionary();
    getNextPermutations(dict, 0, strlen(word), word);
    return dict;
}

void getNextPermutations
    (Dictionary* dict, int start, int len, char* word)
{
    if(word){
        if(start == len-1){
            dict->addWord(word);
        }else{
            for(int i=start; i<len; i++){
                swap(word+start, word+i);
                getNextPermutations(dict, start+1, len, word);
                swap(word+start, word+i);
            }
        }
    }
}

The first method getWordPermutations is essentially a driver for the second method getNextPermutations, which calls itself recursively. I use a swap function which swaps the values of two given pointers. When calling swap(word+start, word+i) I add the word pointer to an integer, which returns the pointer to the character I want to swap.

Validating #

Getting the words out of the dictionary in sorted order is slightly more difficult than it looks. The hard part is not sorting the words, they’re already sorted. The problem is that a word can end at any node, even a non-leaf node; thankfully we can assume that every leaf node is indeed the end of a word. To solve this I wrote an iterator method (check out my source code on BitBucket, it’s kinda long). Once I can get all the words out and that coincidentally they’re sorted alphabetically, all that remains is testing each of them against the English dictionary.

Source #

Download the source code from BitBucket:
https://bitbucket.org/tlextrait/anagrams

Usage:
./anagrams [-h] [-s] [-w] WORDS... 
-h Display help 
-s Display statistics for each word
-w Count and display the number of words in the dictionary

Kudos