Anagrams/C++

This weekend I decided to refresh on C++ in preparation for some interviews. I tried to pick a project that I could code in roughly one day. It should not only have me review C++ but force me to solve some problems and implement some data structures I don’t necessarily use every day.

Idea #

Anagrams allows you to find anagrams in the English dictionary for any given set of characters, then prints them out in alphabetical order, while displaying some stats.

screenshot2.png

Goals #

My goal is not to build something that just works, but to actually build a decent solution that’s efficient and lightweight, while exploring a few things. This is neither the quickest to implement nor the most efficient, so please view this post more as a study.

Problems #

Here are some very basic to average problems that building this program entails:

Traps #

There’s actually a couple traps that are not necessarily obvious immediately:

Implementation #

Steps #

Dictionary file #

I have a UNIX-based laptop, which is great because it contains a text-format English dictionary file. Most likely you can find it at /usr/share/dict/words. Make a copy with:

cat /usr/share/dict/words > words.txt

This dictionary contains one word per line, sorted in alphabetical order.

Dictionary data structure #

Initially I thought I would simply read the dictionary file and keep a hash table of all the words. C++ provides a data structure called map, which is pretty much a key/value store. However one part of me thinks that this is too easy and also a few Google searches show that it takes a lot of memory and is rather slow. The regular map retrieves values in O(log(n)) and the unordered_map structure does it in O(1) meaning it does it at the same speed regardless of the size of the dictionary. The bummer is that while O(1) is a constant, it actually happens to be a pretty big number.

Another solution is to build a two-dimensional array. In C++ a character is 1 byte, the longest English word (in a main-stream dictionary) is 45 characters and we have about 240,000 words. This would take around 45x1x240000 bytes of memory, or 10MB. Unfortunately an unexpected issue comes up: many words start by the same letter, radix and so on so it takes several iterations before one can figure out if a word exists. Look-up time thus increases as the dictionary grows.

The third option is a tree. Trees are often used for string search, notably TrieTrees for auto-completion. I don’t really need a TrieTree because I only care to know if a word exists and not so much to know which other words share the same radix (Unless I use the tree for finding valid permutations, which is not my plan). There’s a bunch great things a tree provides for this specific case:

Tree retrieval time #

If a node represents a character, then every node will have many children (all the words that have the same radix). Since there’s 26 letters in our alphabet (though my implementation allows using characters beyond that), any node may have up to 26 children. Worst case scenario, I would have to check the 26 children, then the 26 sub-children etc. so the word “zzzz” would take 264 checks to find. This is terrible, so there’s a way around. Characters are representable as integers, by their ASCII code. This is why I decided to make the tree nodes store their children in arrays, where the index in the array is the ASCII code of the character belonging to the child stored at that index. Lucky thing is that the alphabet comes in ascending order of ASCII codes, so I can simply iterate once over the children to obtain a sorted list of words.

Word existence #

An issue came up with the tree implementation: it is not immediately clear if a word is in the dictionary if it can be retrieved. The string “compu” is not a word, but it does exist in the tree because it’s the radix to words such as “compute” or “computer”. I can’t just check for leaves either because “compute” is a real word and it’s the radix of “computer” therefore it’s not a leaf. My solution is to simply have each node remember whether there is a word that ends there.

getopt() #

I wanted to say a word about getopt(). It’s a great function that allows you to parse command line flags. It comes with <unistd.h> (check the manpage for getopt). Most examples I usually find online show poor code that iterates over all the arguments in a bad attempt to figure out if any flags were entered. I strongly recommend getopt for single character flags, check out getopt_long for string flags.

Statistics #

How many ways can you permute a string with n characters? it’s simple it’s factorial of n (n!). Now how many distinct permutations are there? it depends on how many distinct characters there are. To be more precise, lets say the string is n characters long and that one character shows up n1 times, another character shows up n2 times and so on. The total number of distinct permutations is:

equation

A simple implementation of the factorial function in C is as follows:

int factorial(int n)
{
    int ret = 1;
    for(int i = 1; i <= n; ++i)
        ret *= i;
    return ret;
}

It is iterative, which may seem less fancy but uses less memory than a recursive function.

Permuting the string #

Next is the task of outputting all the permutations of the given string, discarding the duplicates and testing them to find out if they are English words. I didn’t want to think of a way to discard the duplicates, nor to sort them so I decided to use my dictionary data structure. By putting all the permutations into a new fresh dictionary, they get automatically sorted and duplicates don’t get saved. Here’s some of my code:

Dictionary* getWordPermutations(char* word)
{
    Dictionary* dict = new Dictionary();
    getNextPermutations(dict, 0, strlen(word), word);
    return dict;
}

void getNextPermutations
    (Dictionary* dict, int start, int len, char* word)
{
    if(word){
        if(start == len-1){
            dict->addWord(word);
        }else{
            for(int i=start; i<len; i++){
                swap(word+start, word+i);
                getNextPermutations(dict, start+1, len, word);
                swap(word+start, word+i);
            }
        }
    }
}

The first method getWordPermutations is essentially a driver for the second method getNextPermutations, which calls itself recursively. I use a swap function which swaps the values of two given pointers. When calling swap(word+start, word+i) I add the word pointer to an integer, which returns the pointer to the character I want to swap.

Validating #

Getting the words out of the dictionary in sorted order is slightly more difficult than it looks. The hard part is not sorting the words, they’re already sorted. The problem is that a word can end at any node, even a non-leaf node; thankfully we can assume that every leaf node is indeed the end of a word. To solve this I wrote an iterator method (check out my source code on BitBucket, it’s kinda long). Once I can get all the words out and that coincidentally they’re sorted alphabetically, all that remains is testing each of them against the English dictionary.

Source #

Download the source code from BitBucket:
https://bitbucket.org/tlextrait/anagrams

Usage:
./anagrams [-h] [-s] [-w] WORDS... 
-h Display help 
-s Display statistics for each word
-w Count and display the number of words in the dictionary
 
13
Kudos
 
13
Kudos

Now read this

Prime Numbers

Prime numbers are fun and it’s even more fun to come up with ways of finding them. People interested in prime numbers are usually interested in these areas of research: Finding algorithms and formulas that yield a lot of prime numbers... Continue →