OpenChanfix - database

Docs: installation - commands - database - opping logic - the config file

1. Foreword

This document assumes the reader has a basic understanding of how Chanfix works. It's purpose is to describe the basic design of the database portion of the service, with a technical emphasis.

The goal of the OpenChanfix database was to develop a robust and uncorruptable repository for user@host score data for a large number of channels. Reliability was the foremost criteria in the design, sometimes overriding performance and resource-allocation concerns. In its present v2 manifestation, the database pre-allocates memory in accordance to the demands of a large-scale network's data profile. Small-network hobby activists might wish to retune the header file's parameters for a smaller and leaner Chanfix.

The OpenChanfix database design was never intended as optimal or perfect. It was designed to perform a specific task "well enough" to make it feasible for managing the channel op records of large-scale IRC networks. The best current estimate suggests the database design should be efficient enough to deal with a network of a typical usage profile of 200,000 channels and 500,000 concurrent users on a 1 Ghz system with 512 megabytes of RAM, without risking resource starvation.

2. Design

2.1. Fundamentals

The database stores most structures in arrays of dynamic sizes. Because of the profile of the database, little regard is paid to memory fragmentation (preliminary profiling suggests there should be no more than 12-17% of wasted memory as a result of fragmentation in a theoretical worst case scenario). Whereas the database design imposes no real finite limits on the amount of channels or op user@hosts per channel, some safe limits have been imposed in the header to thwart denial of service attacks (in the form of resource starvation) on the database.

The structure OCF_channel contains key information about a given channel. This includes the channel name of MAXCHANNELNAME + 1 bytes (the database size could be streamlined further by allowing dynamic string lengths for the name, and this is indeed marked as a TO-DO item for future iterations of the database). Furthermore, the OCF_channel structure contains a history of timestamps of automatic and manual chanfixes, and support structures for managing the channel op and channel note records.

The OCF_chanop record is a dynamic array, which dynamically reserves memory for 8 to MAXOPCOUNT user@host records with associated scores. The OCF_chanop record is little more than a repository for a given case-insensitive user@host and scores for a finite number of scores for a DAYSAMPLES number of days. Every channel is initially created with memory space for 8 ops, and as the number of individual ops increases, more memory will be allocated (in log 2 buffer increments). When the amount of ops is well below the critical threshold, the buffer will in return be reduced in size, but the size will never be reduced below a fixed amount of 8 chanop records.

New OCF_channel records, just like OCF_chanop records, will be allocated as needed. When a channel is registered with at least one chanop, note or flag, it will begin its lifespan in the database. If and when a chanop's accumulated score (over DAYSAMPLES days, due to non-attendance) equals zero, the chanop record will be removed from the database. If a channel has no scoring ops (or in the default configuration, no recorded notes or flags), it will be removed from the database. All allocated memory structures will be freed, and the channel ceases to exist.

When a channel op registration is complete for a particular channel (or at any other instance where channel scores change, such as at the daily rotation), the database will internally sort the channel op records to guarantee that the highest scorer will always be the first chanop record, the next-highest scorer the second, and so on. This is done to dramatically speed up lookups and processing.

2.2 Optimizations

Lookups are optimized by lookup tables which track the earliest possible occurance of a given channel in the lexicographically sorted OCF_channel database. Internally, this is achieved by first allocating 256 sub-databases, with the direction of channel creation and lookups to a specific sub-database determined by the first (irccmp() case-insensitive) character in the name. Thus, all channels beginning with the letter 'b' (or 'B') go into one sub-database, 'c' (or 'C') into the next, etc. As the lexicographic distribution is highly biased toward the alphabetic realm of the 256-byte ASCII standard, this will result in unbalanced memory allocation, but performance trials suggest this is not a significant issue in itself.

Furthermore, each sub-database contains its internal lookup table of 256 further sub-entries (subsort) which point any lookups to the earliest possible occurance of the second character in a given lexicographic string. The 256 element subsort key is derived from the case (irccmp) insensitive second character of the channel name. Thus, a sub-database managing all channels beginning with the letter 'b' would consult with the subsort record to find the earliest possible occurance of channels beginning with 'ba', followed by the earliest possible occurance of channels beginning with 'bb' (as a point of curiosity, channel names shorter than 1 or 2 effective characters will end up in sub-database 0 or subsort index 0, respectively).

Whereas the above optimizations are far from ideal (one might e.g. consider a more complex hash generation system distributing channel records in a more balanced fashion across the full 256 element spectrum), they have proven to be more than sufficient for dealing with a vast amount of lookups in performance tests involving databases of over 200,000 channels.

3.0 Interface

This section is the most technical of the sections, detailing the basic interfacing philosophy from a programmer's point of view. Anyone wishing to develop applications around the database must observe these specifications.

Before the database is invoked at all, the DB_load() function must be called. This ensures a) that any previously stored channel information is properly reloaded into memory, and b) memory structures are properly set up. Failure to call DB_load() on startup will result in undesireable effects.

Whenever one desires to write information into a particular channel's record, the database must be prepared for writing with the DB_set_write_channel() function. The function takes one argument only, a char* string to the channel name. If the channel exists in our database, its record is opened for updating. If it does not exist in our database, a new channel record is created. If a new channel is introduced to the database, but the process is too low on resources to create the necessary memory for it, the return value will be 0. If the char* is NULL or the string is too long (overflow condition) the return value will also be 0. Otherwise, DB_set_write_channel() will always return 1.

Whenever one desires to read information from a particular channel's record, the database must likewise be prepared. This is similarily achieved with the DB_set_read_channel() function. Again, the sole parameter is a char* string pointing to the channel name. If the channel is found in the database, the return value is 1. Otherwise, if the read command was issued for a non- existant database, or if the char* is otherwise malformed (NULL or too long), the return value will be 0.

The various basic interfaces which can be applied thereafter are concisely documented in the cf_database.h header. Essentially, a number of interfaces are provided for updating (writing) channel op scores (DB_register_op), manual chanfix timestamps (DB_register_manualfix), automatic chanfix timestamps (DB_register_autofix), channel notes (DB_channel_add_note), channel flags (DB_channel_set_flag) etc. Likewise, interfaces for filling caller-reserved structures with information about channels (read) are provided, including the somewhat self-explanatorily named DB_channel_get_note(), DB_channel_get_flag(), DB_get_op_scores(), DB_get_top_user_hosts() and DB_get_oplist(). Worthy of further elaboration is the DB_poll_channel() read function, which fills a caller-reserved OCF_topops structure with information pertinent to a basic SCORE request. For the database to know which user@hosts to associates as current ops and current nonops, two char** arrays of user & host pairs must be passed to the function. The integers num_opusers and num_nopusers tell the database how many user & host pairs to expect in the respective twin arrays.

To ensure proper behaviour by the database, the DB_update_day() function must be called once per day, with an incremented integer "daystamp" value. The daystamp can in theory be any sequentially incremented integer, but it is suggested that a reproducable standard such as the number of DAYS since the beginning of unixtime is used. Based on the passed integer, op scores for the oldest day in the database are erased.

DB_save() stores the database on disk. It is recommended that DB_save() is called at least once per day. DB_save() will also create two sets of backup files to counter the risk of file corruption.

4.0 Concluding Comments

In summary, the OpenChanfix database module was designed primarily with robustness in mind. Speed and resource efficiency were also significant considerations, but were not optimized as religiously. It is believed that the current design is adequate for large-scale IRC network use as-is, without significant resource allocation concerns.