Binary+Tree+Exercise

A binary tree is made of nodes. Each node is the root of a sub-tree. Each node may have children called "left" and "right". Each node contains some data.

Terminology: this assignment is about binary trees only. The terms "tree" and "node" shall both refer to "binary tree".

For this assignment, start with the data being simply a word (a string).

Given some text (in a file or a string), break it into words. Convert each word to all lowerCase letters. (A pre-normalized copy of the Gettysburg Address is given in ——-). Create a node with the first word of the document. Refer to it (store it in a variable) as "tree", for it shall be the root. For each succeeding word, find or insert a node in the tree with the word in it. This process builds the tree. We practiced this on the board on Friday, 10.21

Recursion is your friend. A sub-tree IS a tree.

Create a function which prints out the tree. Run some tests where you process some text and dump out the resulting tree. For starters, make a simple text representation of the tree and its linkage. E.g. code 1: four           /3    \2 2: score           /0    \0 3: and           /0    \0 code meaning that the root is node #1, its right child is node #2 and its left child is node #3. In later revisions, you can play around with trying to show the tree structure.

Create a function which returns the size (the number of nodes) in each tree. Re-run your tests, showing the size of each of your trees (each node IS a tree). Be sure to include a text with some repeat words. code "four score and one score is one hundred" => @ word          left    right    size 1: four           /3    \2        6 2: score       /4    \0        4 3: and           /0    \0        1 4: one           /5    \0        3 5: is           /6    \0        2 6: hundred       /0    \0        1 code

C programmers: if you malloc the first node, called "tree", first, then usually you can use it as a base address and subtract it from each node's address to produce a smaller number for display. I find it fairly easy to keep straight tags from 1 to a hundred, but much harder to remember or even distinguish tags from 3256768 to 3266464, for example.

Augment your definition of a node to include a tally. As you build a tree, keep a tally of how many times each word is observed. Alter your output to show tallies: code @ word       left    right    size    count 1: four       /3    \2    6    1 2: score   /4    \0    4    2 3: and       /0    \5    1    1 4: one       /5    \0    3    2 5: is       /6    \0    2    1 6: hundred   /0    \0    1    1 code NB: As soon as the data is more than one thing, it starts to make sense for the data itself to be a struct, without left and right. Thus the "universal" definition of a tree node in C could be code format="c" struct node { struct node *left, *right; struct data *data; } code For this exercise, you MAY simplify your life and just put the fields of data directly into your node definition. That is, homework does not have to produce generically useful code.

Create a function which traverses the tree and outputs the words in alphabetical order. Eg.: and four hundred is one score

Create a function which outputs a histogram of the depths of the nodes. Eg. code depth # nodes 1    1 2     2 3     1 4     1 5     1 code Create a function which displays the shape of a tree. It's hard to make it pretty if all you have is text, so don't waste a lot of time. I'd go further, and opine that text is an insufficient medium to draw very pretty trees. Laying the tree on its side is practical: code /---< and ---< four \             /---< hundred \        /---< is     \    /---< one \---< score code Your display need not be even that "fancy" to be useful while testing the next part:

Create a function which takes a possibly badly-balanced tree and produces a balanced- or nearly-balanced tree. (Unless there are 2^n - 1 words, the tree cannot be perfectly balanced.)

Can you think of a way of producing the nearly balanced tree "in place"? That is, the resulting tree contains the same nodes, but possibly connected differently.

Bonus: (not easy at all) can you think of a way of balancing a tree which uses only a constant amount of extra memory regardless of the size of the tree? Which only uses Order(lg n) extra memory?

A nice write-up should include some examples of the tests and their results as you went along. An easy way to slam that is to paste a test and its output onto your writeup file each time you see some new test working.

When things seem to work on small tests, use the Gettysburg Address. Is the balanced tree for the Gettysburg Address unique?

Questions:

Given N different words, how deep must the tree be? What is the worst case depth?

Harder question: how deep is the expected tree with N randomly presented words? That is, the words arrive in random order. Hint: how can you define "average" in this domain? If you do your math on paper, just hand it to me: don't bother trying to cram it into Moogle.

Please put your stuff in a directory including a clue to your name. E.g If I were working with Izad, we might call our directory "RikIzad5", "RSIK5", "izik5" or even "RI5". Then, sitting in the PARENT directory of your RI5 directory, tar(preferred) or zip it up and submit.

If several of you call it "p5", or "tree.c", I can waste a bunch of time accidentally overwriting. I don't want to do that again. Using separate directories cooks that problem.

Put your screenshots and writeup in your folder too. And special test data. Be sure to state what platform you have tested on. If you've run it only on a Windows, I want to know that so I don't spend much time trying to run it on a Unix. Note for Squeak, Java and Python users: C programs are often not readily portable between platforms.

Binary Tree, Example beginning C code Rik's sample output, via Squeak: