ImgFS: Image-oriented File System --- general description

Outline of this document

Project background
Project goals
General description

Context

The aim of this project is to have you develop a large program in C on a "system" theme. The framework chosen this year is the construction of a command-line utility to manage images in a specific format file system, inspired by the one used by Facebook. For your information, Facebook's system is called "Haystack" and is described in the following paper: https://www.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf.) You are not required to read this paper as part of the course (it's just for information) because, obviously, we'll be implementing a simplified version of this system. All the basic concepts required for this project are introduced here in a simple way, assuming only standard "user" knowledge of a computer system.

Social networks have to manage hundreds of millions of images. The usual file systems (such as the one used on your hard disk, for example) have efficiency problems with such large numbers of files. Furthermore, they aren't designed to handle the fact that we want to have each of these images in several resolutions, e.g. very small (icon), medium for a quick preview and normal size (original resolution).

In the "Haystack" approach, several images are contained in a single file. What's more, different resolutions of the same image are stored automatically. This single file contains both data (the images) and metadata (information about each image). The key idea is that the image server has a copy of this metadata in memory, to enable very rapid access to a specific photo in the right resolution.

This approach has a number of advantages: firstly, it reduces the number of files managed by the operating system; secondly, it elegantly implements two important aspects of image database management:

Automatic management of multiple resolutions;
The ability to not duplicate identical images submitted under different names (for example from two different users). This optimization is incredibly useful for any social network.

This deduplication is done using a "hash" function, which summarize a binary content (an image in our case) into a signature much smaller. Here, we will use the "SHA-256" function, which produces a 256 bits signature, and has the useful property that it is collision resistant: it is almost impossible for two different contents to have the same signature. In this project, we will use the assumption that two images with the same signature are identical. Although it may seem surprising, many systems are based on this principle.

Goals

You will build an image server, in a version inspired and simplified of Haystack. During the first weeks, it will consist of implementing the basic functions of the system, which are:

list data (metadata, image list);
add a new image;
delete an image;
extract an image, in a chosen specific version: "original", "small" or "thumbnail".

During this first part, those functions will be exposed through a command line interface (CLI). Further on, you will build a true webserver to distribute the image over the network using the HTTP protocol.

Global description

Here, we will describe the main concepts and structures you will need for this project. Their implementation details will be specified later in the weekly handouts.

You will use a specific format -- let's call it "imgfs" -- to represent an "image file system". A file of type imgfs contains three distinct parts:

A header, of fixed size, which gathers the elements of configuration of the system; its content is created during the imgfs creation;
A metadata array; it is an array whose sized is specified by the max_files field of the header; each of its entry describe the metadata of a single image, especially their position in the file;
the images themselves; each image is stored in a contiguous part of the file, one after the other.

This file format will be used by the two tools that you will develop:

a command line interface allowing to manipulate those file by listing, reading, adding and removing images;
a webserver with the same capabilities, but usable over the network.

Data format

The three parts explained above consists of the following data structures:

struct imgfs_header: the header with the configuration data:
- name: a string of at most MAX_IMGFS_NAME characters, the name of the database;
- version: a 32-bits unsigned int; the version of the database, it is incremented after each insertion/deletion;
- nb_files: a 32-bits unsigned int; the current number of images in the system;
- max_files: a 32-bits unsigned int; the maximum number of images that the system can contain; this field is specified during the creation and must not be modified afterwards;
- resized_res: an array of 2 times (NB_RES - 1) elements, each of which is a 16-bits unsigned int; the resolutions of the "thumbnail" and "small" images (in order: "thumbnail width", "thumbnail height", "small width", "small height"); this field is specified during the creation and must not be modified afterwards; the handling of the original resolution is explained below;
- unused_32 and unused_64: two unsigned int (of 32 and 64 bits); unused (but intended for future evolutions or temporary information - it is often useful to include fields of this type in large-scale projects; this allows old data structures to be used directly in newer versions of the software);
struct img_metadata: image metadata:
- img_id: a string of at most MAX_IMG_ID characters, containing a unique identifier (name) for the image;
- SHA: an array of SHA256_DIGEST_LENGTH unsigned char; the image hash code, as explained above;
- orig_res: an array of two 32-bit unsigned int; the resolution of the original image;
- size: an array of 32-bit NB_RES unsigned int; memory sizes (in bytes) of images at different resolutions ("thumbnail", "small" and "original"; in this order, given by X_RES indices defined in imgfs.h);
- offset: an array of 64-bit NB_RES unsigned int; the positions in the "image database" file of images at the various possible resolutions (in the same order as for size; also use the X_RES indices defined in imgfs.h to access the elements of this array);
- is_valid: a 16-bit unsigned int; indicates whether the image is in use (value NON_EMPTY) or not (value EMPTY);
- unused_16: a 16-bit unsigned int; not used (but intended for future evolutions).
struct imgfs_file:
- file: a FILE* indicating the file containing everything (on disk);
- header: a struct imgfs_header; the general information ("header") of the image database;
- metadata: a dynamic array of struct img_metadata; the "metadata" of the images in the database.

Notes:

the size of the metadata array never changes; it is specified in the header and dynamically allocated to max_files;
to delete an image, all you have to do is change is_valid; there may therefore be "holes" in the metadata array, and unused parts in the file (since the images themselves are not deleted); the basic idea behind all this is to be prepared to lose a little space to save time; At a more complex level, we can imagine a "garbage collector" (or a "defrag") which, in parallel, when "there's time", effectively deletes images that are no longer in use, reorganizes metadata to reduce gaps, and so on. We won't go into such considerations in this project, but you may implement it as an extension.

(To check, whatever the architecture, sizeof(struct img_metadata) must give 216.)