UMBC CMSC202, Computer Science II, Fall 1998, Sections 0101, 0102, 0103, 0104

25. Hash Tables

Thursday December 03, 1998

[Previous Lecture] [Next Lecture]

Assigned Reading: 10.7-10.9

Handouts (available on-line):

Programs from this lecture:

Hash Tables

Topics Covered:

We revisit some issues about the new and delete operators.
- First, despite what some manuals might say, when we use the new operator with the SGI CC compiler, the operator returns NULL if it cannot allocate that amount of memory. (See test program and sample run.)
- We can define a new operator in a class. We have discussed overloading the new operator before in Lecture 22. When new is used to dynamically allocate space for an object, the new operator in that class is used if one is defined. You can also define new[] operator, which is used when an array of objects is dynamically allocated. We can also define analogous delete and delete[] operators.
- We looked at a simple program that uses a Test class that has new, new[], delete and delete[] defined. (See program and sample run.) Some noteworthy points:
  - The Test constructors and Test destructors are still called as before. The class defined new and delete operators simple allocate and deallocate memory and do not have to deal with constructors and destructors. (This is good.)
  - If you mistakenly use delete instead of delete [] to destroy an array of objects, the compiler only generates code to destroy the first element of the array.
  - Note that the array of 10 Test items takes 56 bytes and not 40 bytes. Since each Test object only uses 4 bytes, the extra 16 bytes must be used to store the number of objects in the array and the size of each object.
- We update our BString class and our GenList template class to have new and delete operators.
  - BString header file and implementation.
  - GenList header file and implementation. The remove() function in this version of GenList returns the number of items removed. We need this later.
Hash Tables: we have looked at a few data structures: arrays, linked-lists and binary search trees. Each of these have advantages and disadvantages. In a sorted array, we can use binary search to find an item in O(log n) time. However, inserting into a sorted array takes O(n) time (linear time). In an unsorted linked list, search takes linear time, but insertion can be done in constant time (just insert at the front or the back). Using a sorted linked list increases the time it takes to insert an item without making a big improvement in the search time, since both operations now take linear time. A binary search tree allows you to insert, delete and search in O(log n) time. A hash table allows you to insert, delete and search in constant time on average. So, if the only operations you need to support are insert, delete and search, a hash table offers many advantages.
An example: suppose that you are the UMBC registrar and you want to store and retrieve student records based upon the student's social security number (ssn). There is an easy way to this quickly, simply create a huge array of records indexed from 0 to 999,999,999. To retrieve a student's record simply use his/her social security number as the index. The only disadvantage of this method is that it uses too much memory. As an alternative, we can use just the last 4 digits of a student's social security number. Then we would only need 10,000 entries. The disadvantage here is that there are more than 10,000 students at UMBC, so many students would have to use the same index. To solve this problem, we keep a linked list at each entry. For example, if two students have social security numbers that end in 6666, then the 6666 entry of the table is a linked list with the two students' records.
We have here are the main ideas of a hash table. The hash table is an array of linked lists. The key used for hashing is the student's ssn. The hash function takes the key and transforms it into a legal index value for the hash table. In this example, the hash function simply takes the ssn and removes the first 5 digits. Ideally, a hash function would evenly distribute the keys in the hash table. That way, each linked list in the hash table would be relatively short. When two keys hash to the same index value, the situation is called a collision. With 12,000 students and an ideal hash function, each linked list in the hash table would only have 1 or 2 elements. Thus, searching, inserting and deleting from this hash table would take constant time.
So, is taking the last 4 digits of the ssn a good hash function? It is theoretically possible that next every entering freshman has the same last 4 digits in their ssn. Then, our hash table would simply be an unsorted linked list and the performance of search would be poor. However, our experiences with ssn's tells us that the chances of this happening is small. The design of a good hash table depends on having a good hash function. There are schemes for picking provably good hash functions which would be discussed in an algorithms class, not here.
One disadvantage of using the last 4 digits of a ssn as a hash function is that we are not able to control the size of our hash table very well. If UMBC's enrollment increased to 20,000, our only choice is to use 5 digits of the ssn and have a table of size 100,000. Another hash function we can use is to take the ssn and take its remainder modulo some prime number N. That would leave us with a value between 0 and N-1. If we have a hash table with N entries, then this value can be used directly as the index into the hash table.
We implement a hash table as an array of linked lists. (See the header file hash.h and implementation.) Each linked list is a list of StudentRecord using the latest version of the templated GenList class. The StudentRecord class is straightforward (header file and implementation).
The implementation of the HashTable class is relatively simple. The most complicated function is the constructor. Here we first look for a prime number greater than or equal to the parameter size. Recall that our hash function simply takes a student's ssn divide it by the size of the table and use the remainder as the index for the table. Choosing a prime number for the table size tends to reduce the number of collisions. A little number theory (Bertrand's Lemma) tells us that there is always a prime number between size and 2 * size. The HashTable class does not have a default constructor. Each time you create a HashTable you must specify the size of the table.
Otherwise, the HashTable member functions are straightforward. In the Insert function for example, we simply compute the hash table index, and append the item to the list.
We test the HashTable class with two main programs. The first main program is a trivial test of the HashTable member functions. (See sample run.) The second program inserts random ssn's into the hash table to test the number of collisions. The sample run shows that the average number of collisions is fairly predictable. With a good hash function, we can control the average number of collisions by adjusting the size of the table.

[Previous Lecture] [Next Lecture]

Last Modified: 22 Jul 2024 11:28:48 EDT by Richard Chang

Back up to Fall 1998 CMSC 202 Section Homepage