Good+Recipe+for+Hash

Hash tables, hash indices, Dictionaries among Smalltalk Collections, indices in data bases... ideas which get used a LOT. This assignment will explore some of the core principles.

We have visited hash functions a couple of times before: Out on the lawn when people went to the chalk-mark matching their bottle cap... and Using a character as a direct-index into a table of Morse-code times.

The problem normally boils down to this:

You have a bunch of data items. Perhaps student records. They do NOT conveniently have unique small integers on a subcutaneous chip. You'd like to be able to retrieve data items. How about a linear search? Just keep data in an array and search by comparing key data to #1's, #2's, ... until found. That would take an average of N/2 comparisons. (I know: there's less than a million records, so B.F.D.) Just imagine that you will some day be working with BIG data sets, or that slowing things down by 3 orders of magnitude will be enough to get your boss's attention.

If instead, you could compute some index based on some unchanging feature of the data, an index which spreads the items out, you could improve the chances that you find an item in the first place you look. If you find the record in the first or second or third place you look, that's pretty good. If you have to compare to half the record of people who have ever attended the college, that's pretty sad.

This homework assignment is to experiment with what makes a good hash function and a good algorithm vs. what works poorly.

Let us start out with some mythical sample data. We want to get the sandbox set up for experimenting. //I'd rather not be this pedantic, but it's my job.//

Let StudentRecord be as this C struct, almost the same as in an earlier assignment: code struct StudentRecord { char givenName[26]; char surName[26]; int SID; char phone[12]; float height; // cm. int birthday; // {0..366} Yes, people are born on February 29th. int hashVal //   I don't need programPoints; int rehashVal //  I don't need examPoints; int compares  // We don't need participationPoints; }; // Rik commandeered those last three variables and uses them to make the output show us what's going on

struct StudentRecord *studTable; int studTableSize = 42; // 100 is too big for early investigations: it takes too much screen space

initializeStudTable { int byteSize; byteSize = studTableSize * sizeof(struct StudentRecord); studTable = malloc(byteSize); printf("studTable byte size=%i @%d\n", byteSize, studTable); }

code Make sure the studTable is dynamically allocated (as in example above), as later we will want to re-allocate it to a larger size. But please do start with size 42 so that I can compare my results and all the student projects.

Make a function "dumpStudents" which dumps the filled contents of the studTable. The first test will be disappointing, as there are no students in there yet. Perhaps you don't want to print empty slots.

Quick, make a function "atIndexPutStudent(...)" which takes an index and a temporary StudentRecord whose values get copied to the fields of the indexed StudentRecord. This allows you to load some test data quickly: code main { initializeStudTable; dumpStudents; // expect none actualIndex = atIndexPutStudent(13, studentWithValues("Tom", "Mix", 12345678, "360-360-3603", 185, 277, 0, 0, 0)); actualIndex = atIndexPutStudent(29, studentWithValues("Tom", "Thumb", 54352671, "360-123-4567", 83, 177, 0, 0, 0)); dumpStudents; }

code which gives: code Students: 13: Tom Mix   12345678    360-360-3603 29: Tom Thumb   54352671    360-123-4567

code These StudentRecords have only a little bit of data about each student, but they are already clumsy enough that we see that we don't want to be shoveling the data around, but instead move pointers or indices. The case can be even more compelling when the real data lives on a database and gets bulk-loaded. We don't need to worry about height or points w.r.t. hashing functions. (Why?) So I don't really care if you display those or not.

In earlier assignments, we figured out how to read carefully formatted data from a file. Use techniques you learned then to QUICKLY make a file reader which just reads one student record per line and, using the current variation of hashing, fills a record in the table with the data. code For each line of data: { Read values from a line into the fields of a temporary StudentRecord. Based on one (or several) of those data, compute a hash index. Invoke atIndexPutStudent to fill the "permanent" record. } code Notice that atIndexPutStudent might return a DIFFERENT index, in case the first choice slot was taken. Something special must happen if no suitable slot can be found.

Please do run some very simple versions. For example, I want to see that you have investigated the performance of the re-hashing algorithm in each of the following:
 * 1) Look in the next slot (increment by 1) until you find the one you are looking for or an empty slot
 * 2) Add 7 as a rehash (jump that many spaces when needed).
 * 3) Add 35 as a rehash
 * 4) Add 11 as a rehash
 * 5) Add the first hash index as a rehash (jump that many spaces).
 * 6) Compute a second number from the data using one of the other hash functions.

And of course, the initial hash function matters as well. Investigate at least each of the following: Of course, arithmetic needs to be done modulo TableSize.
 * 1) ASCII value of the first letter of the first (given) name.
 * 2) Count of letters in the first name.
 * 3) Sum of the ASCII values of the letters in the given name.
 * 4) Sum of the ASCII values of the letters in both names.
 * 5) The SID
 * 6) (SID mod 28) * 2
 * 7) Sum of digits in the phone number

Be sure to measure performance by counting the number of comparisons needed to process the data set. Checking if a slot is empty? I'll give you that for free. Checking the SID of the StudentRecord in a slot bumps the count. So,if the first place you look is empty, insert the record and don't bump the counter. If you find a StudentRecord matching the SID you are looking for in the first slot you look in, that counts as one (1) comparison. If it's not there, but in the second place you look, that's two (2). Nothing deep about this: just be sure to do it, as this is the central issue under investigation in this assignment. How many comparisons for the whole batch?

I strongly suggest that you make the choice of algorithm parameter-driven. You may, for example, have a global variable which sets which of several hash functions to use, and another which sets which of several rehash functions to use. Consider it part of the assignment to figure out how to pass (or set) functions as data in your language. A switch statement (in this innermost-loop) is a fall-back, but it will cost you. I know function variables can be done in every language in which people have been submitting homework.

Here are some sample datasets of StudentRecords which provide decent tests. You might prefer hashing these Klingon students: All samples in a .tar file: and .zip

For full credit on this assignment, you must produce a writeup including results, showing how altering the algorithms gives different results. Of course, include legible source code Also, answer these questions: PS: All those functions? Notice that you can simplify your life as a programmer by making one big collection of functions (which all have the same signature) and selecting which functions to use with an ordered pair of small integers. Who knows: you might observe something if you use a "dumb" function in the wrong role. For example, I know a function which works great as a first hash, but can fail miserably as a re-hash function. I do hope you noticed the hints to create hash/rehash functions of your own design. Once you get you framework set up, it should be fairly easy to include other functions.
 * 1) Why do some combinations perform relatively poorly?
 * 2) How does the spare table size matter? That is, how does performance degrade as the table fills up? A graph of results would be nice.
 * 3) What makes for a smart table size, perhaps based on the number of Students?
 * 4) How full should you let it get before re-allocating (and re-hashing all the data)? Would your answer change if a comparison cost you $5?
 * 5) If you implemented "growing the table" (a fine idea), how expensive was it (how many comparisons)?
 * 6) Who is easier to hash: Klingons or Students? Guess why.
 * 7) Why did some combinations perform relatively poorly?