DKVS: Distributed Key-Value Store --- Hashtables

Introduction

The aim of this week's work is:

  1. become fully acquainted with the project framework and concepts: start by reading the main description file;
  2. and create data structures and functions for local hash tables.

Provided material

In your group's GitHub repository, you will find the following files in provided/:

  • hashtable.h: header file for hash tables (to be developed this week);
  • util.h and util.c: macros and miscellaneous functions; you do not have to use them, but might if appropriate (have a look to see if some may be useful);
  • error.h and error.c: error code and messages;
  • a Makefile containing useful rules and targets;
  • some material to use ours and develop your own unit tests in tests/unit/.

Implementing the hashtable

Data structure

As a reminder, a hashtable is a data structure that allows (quasi-)constant-time access (O(1)) to a value using the "hash value" of an access key:

A hashtable binds a key to a value using a hashing function
Figure: a hashtable binds a key to a value using a hashing function. (CC) Jorge Stolfi@WikiMedia

The concrete representation of the hashtable that we propose to implement in this project will look like this:

Representation of a hashtable with collision management by chaining
Figure: representation of a hashtable with collision management by chaining lists. (CC) Jorge Stolfi@WikiMedia

In hashtable.h, define the types for keys (dkvs_key_t and dkvs_const_key_t) and for values (dkvs_value_t and dkvs_const_value_t) as char* and const char* respectively.

NOTE about the ownership of table contents: since the type of stored values (dkvs_value_t) is a pointer, the question arises as to the ownership of hashtable contents (i.e. do the elements contained in the table have to be "hidden" or can they be shared externally?)

The point of view chosen in this project is that the hashtable is the sole owner of its contents. Added values (pointed values) must therefore be copied (instead of simply copying the pointer).
[end of note]

Then "define" the type bucket_t, which corresponds to one entry in the hashtable. We here quoted the word "define" as strictly speaking all we need here is a predeclaration (typedef) to a struct to be defined later. You can freely choose the struct name here (all you will need to do will be to be coherent with that name in the future). More details in the Collision management section below.

Finally define the type Htable_t which corresponds to a dynamically allocated hashtable containing size "cells".

size is a characteristic of the hashtable (indicating its size in terms of the number of possible positions; note that the actual number of entries contained in the table may be different: some positions are empty (e.g. 002 above), some keys may share the same position (e.g. "John Smith" and "Sandra Dee" above; this is referred to as "collision").

Then define the following functions to perform a set of standard operations on hashtables (prototypes in hashtable.h, implementations in hashtable.c):

  • Htable_t* Htable_construct(size_t) which takes a number of buckets as a parameter and returns a newly-created hashtable (of the corresponding size);

  • void Htable_free_content(Htable_t*) which frees the memory of all the contents stored in the hashtable passed in parameter.

  • void Htable_free(Htable_t**) which frees the contents memory stored with the hashtable pointed by the parameter, as well as the pointed Htable_t itself.

  • int Htable_add_value(Htable_t*, dkvs_const_key_t, dkvs_const_value_t) which takes a hashtable as argument as well as a key and a value and stores the value associated with the key in the hashtable; if the table is NULL, it should return ERR_INVALID_ARGUMENT, if everything is working normally, it should return ERR_NONE (the various errors and tool functions can be found in error.h); although this would be perfectly feasible, we've decided for this project not to accept NULL values, i.e. if the dkvs_const_value_t passed is NULL, this function also returns ERR_INVALID_ARGUMENT; moreover, don't forget that the content of the value must be copied (and not just a copy of the pointer); see also the Collision management section below;

Hashing function

To calculate the hash value of a given key (`key` below) for a hashtable with `size` "cells", we ask you to use the following function:
/** ----------------------------------------------------------------------
 ** Hash a string for a given hashtable size.
 ** See http://en.wikipedia.org/wiki/Jenkins_hash_function
 **/
size_t hash_function(dkvs_const_key_t key, size_t size)
{
    if ((key == NULL) || (size == 0))
        return SIZE_MAX;

    size_t hash = 0;
    const size_t key_len = strlen(key);
    for (size_t i = 0; i < key_len; ++i) {
        hash += (unsigned char) key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash % size;
}
  • dkvs_value_t Htable_get_value(const Htable_t*, dkvs_const_key_t) which takes a hashtable as argument as well as a key and returns the associated value if it exists or NULL if it doesn't; it also returns NULL in the event of an incorrect parameter.

Don't hesitate to create other utility functions if necessary!

And don't forget to properly and incrementally test the functionality of your hashtable for several combinations of sizes, keys and values before going on to the next step (see tests below). This will save you a lot of trouble...

Collision management

For adding the value (function Htable_add_value()): as always, if the key already exists in the hashtable, the new value received must overwrite the old one already stored; on the other hand, if it's a new key whose hash function gives a collision (same hash value) with a key already stored, then the new key must be added to the chained list of key-value pairs in the corresponding "cell".

Similarly, when searching for a value for a key (Htable_get_value()), you'll now have to browse the chained list of key-value pairs in the target "cell".

As usual, don't hesitate to modularize your code and create other utility functions if necessary! In this spirit, you could _for example:

  1. define a kv_pair_t type that represents a pair (key, value), i.e. a key and a value; although not strictly necessary (we can do without it; it's not as good, but we can), this type could be useful when you need to manipulate keys and their associated values as a whole; if you do so, of course, define the kv_pair_free() function as well;
  2. define in hashtable.c (yes, yes, .c) the struct type associated to (aliased by) the former typedefed bucket_t; we indeed recommend you to define it in hashtable.c and not in .h as this type is strictly internal, there's no reason to export it via the .h file; only its pre-declaration has to be in the .h (as you already did); this is an interesting pattern to (1) increase independence (here between hashtable internals, in the .c and external usage); and (2) hide implementation details (for whatever reason).

You can also, if necessary, create associated tool functions.

Example and tests

Testing by hand

We provide you with a unit-test-hashtable.c file which we strongly advise you to copy into your done directory so as to edit it and add your own tests: replace the line

puts("Write your tests here and DELETE this puts");

with whatever you feel is necessary.

Note: don't edit the unit-test-hashtable.c file in the provided, as it will be modified (and overwritten) by us for our own purposes. [end of note]

For example, you could test that adding a value to a key results in the correct value when it is requested again:

Htable_t table;

dkvs_const_key_t key = "foo";
dkvs_const_value_t set_value = "bar";
Htable_add_value(&table, key, set_value);

dkvs_value_t read_value = Htable_get_value(table, key);

ck_assert_str_eq(set_value, read_value);

Note: this example isn't necessarily perfect, or even complete; feel free to adapt it as you want.
[end of note]

For testing purposes, we use the Check library. To test whether two strings are equal, use the ck_assert_str_eq "function" (see also test.h).

To perform link editing with the Check library, you need to add the -lcheck -lm -lrt -pthread options; for example:

cc unit-test-hashtable.o hashtable.o error.o -lcheck -lm -lrt -pthread -o unit-test-hashtable

On some architectures, you also need to add the -lsubunit library.

Provided tests

We provide you with a very few tests to run against your code by using make check, both unit tests (testing functions one by one) and end-to-end tests (testing the whole executable at once).

We strongly advise you to complete them by adding you own tests for edge cases (see next subsection).

Personal unit tests

As we move forward with the project, it is important that you can write your own tests, to complete the provided ones. You can find those in provided/tests/unit/. Before adding new tests, don't forget to copy the provided/test/ directory in done/test/. You will also need to modify the TEST_DIR variable in the Makefile.
Note: don't forget to never push modifications in the provided/ directory.

We strongly advise you to edit these files (in done/test/) to add your own tests, or even to create new ones as you move forward. This can be done quite simply by adding your own values or lines of code to the tests already provided, or by copying this file and drawing inspiration from it (don't forget to update the tests' Makefile accordingly). You don't need to understand everything in this file, at least not initially, but it is important you start to get familiar with its content.

For those who want to go further, the main test functions available in the environment we use (Check) are described over there: https://libcheck.github.io/check/doc/check_html/check_4.html#Convenience-Test-Functions. For example, to test whether two int are equal, use the ck_assert_int_eq macro: ck_assert_int_eq(a, b).

We have also defined the following "functions" in test.h:

  • ck_assert_err(int actual_error, int expected_error): assert that actual_error is expected_error;
  • ck_assert_err_none(int error): assert that error is ERR_NONE;
  • ck_assert_invalid_arg(int error): assert that error is ERR_INVALID_ARGUMENT (i.e. correspond to the return code of a function which received a invalid argument; see error.h);
  • ck_assert_ptr_nonnull(void* ptr): assert that ptr is not NULL;
  • ck_assert_ptr_null(void* ptr): assert that ptr is NULL.

Finally, we'd like to remind you that just because 100% of the tests provided here pass doesn't mean you'll get 100% of the points. Firstly, because these tests may not be exhaustive (it's also part of a programmer's job to think about tests), but also and above all (as indicated on the page explaining the project grading scale), because we attach great importance to the quality of your code, which will therefore be evaluated by a human review (and not blindly by a machine).

Final word

It's up to you to organize the group work as best you can, according to your objectives and constraints; but remember to divide the task properly between the two members of the group. If you haven't already read it in full, we recommend you read the end of the foreword page.

And also, do not forget to fill (and add+push) your time.csv file (see end of the foreword page).