An alternative way of EAV modelling

I was reading this month’s php|architect. It has a nice article about EAV modeling. I had seen this db structure in other project, but didn’t know that it was called EAV. For those who don’t read php|architect, EAV describes a method of saving a large set of attributes, only some of which apply to an individual entity. Normally you would create a table, with a row for each entity and save each attribute in a column. With EAV you save each attribute as a row.

This makes selecting the data quite tedious. If you can life with some constraints, there is an easier way to do this.

The common way

How php|architect describes it (and how it’s normally described) is to split out the values over different tables for the different data types. The DB structure proposed in php|architect look like:

CREATE TABLE field_names (
    fid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    field_name VARCHAR(50) NOT NULL DEFAULT '',
    field_type ENUM('VARCHAR', 'INTEGER', 'DOUBLE',
        'DATE', 'TEXT') NOT NULL DEFAULT 'VARCHAR',
    UNIQUE KEY (field_name)
);
CREATE TABLE varchar_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value VARCHAR(255) NOT NULL DEFAULT '',
    UNIQUE KEY (value)
);
CREATE TABLE integer_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value INT(11) NOT NULL DEFAULT 0,
    UNIQUE KEY (value)
);
CREATE TABLE double_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value DOUBLE NOT NULL DEFAULT 0,
    UNIQUE KEY (value)
);
CREATE TABLE date_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value DATE NOT NULL DEFAULT '0000-00-00',
    UNIQUE KEY (value)
);
CREATE TABLE text_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value TEXT NOT NULL DEFAULT '',
    UNIQUE KEY (value(100))
);

A small notice might be that using a unique key on the first 100 chars of a text field isn’t wise. You might have 2 texts where the first 100 chars are the same causing trouble.

The alternative way

The suggested previous makes selecting the data quite tedious. If you can life with some constraints, there is an easier way to do this. Data types can be split in 2 categories: scalars and arrays. The difference is very clear when you look at a language like C. A scaler is x bytes of memory containing 1 value. An array is a x*y bytes of data, where x is the number of bytes needed for 1 value and y is the number of values. In this definition a string is an array of characters.

Normally you define the data type in MySQL. When adding data the right number the right number of bytes is allocated by the storage engine. When the field is used in ‘WHERE field=value’, the value in the SQL string so it cast to the byte value. It’s fantastic that this is done outside the user’s scope, but here it’s part of the reason that the data can’t be saved in a single field. We want the way that the data is seen to be based on per-row meta-data. That’s simply not how SQL works.

However, if we don’t mind some pre- and post-processing, we can find a way around this. Let’s see if we can find a data type, to which all the different scalar types can be transformed. It’s important to remember that the data type isn’t only used to display the data, but it also determined the way it’s ordered, indexed and reacts to operator like ‘>’. Since we don’t want to get into issues with this, we’ll choose a signed type. The representation of integers is fairly strait forward, so it seems like a good candidate.

Dates
Dates are often saved as integers using unix time. It represents the number of seconds since 1970. With an unsigned integer this can run up to 2038.
This can work fine for appointments, but will not work for something like the date of birth. However for birthdays, the time part isn’t important. We can repeat this trick, but instead of counting seconds we count the days since ’0000-00-00′.
For time we also count seconds running up to 86399 for 23:59:59.

Numbers
For floating points the point is decided for each value. However this is probably not necessary, since the scope of the values for a field should be pretty much the same. Since we’re deciding the data type per attribute type (the field table), we can make it a fixed point where the position of the point is decided by in the field table. Both fixed point and integer are numbers, with the only difference that an integer has got precision 0.

Text
The php|architect article distinguishes between text and varchar fields. Free text fields for names, etc are saved in varchar fields. Data that would be represented as selection box is saved in the same table. We will make a different division. In a normal db setup, values that aren’t free text would be saved in a second table and the ids would be stored in the original table. This will work fine with the integer set-up, so let’s stick with that.
All free text values will be saved in a TEXT field, which has a variable size. Searching on a text field is extremely slow. However by placing an index (of 255 chars) on the field, the index will always be used unless wildcards are used. This makes searching on this field as fast as it would be with a varchar field. Instead of using wildcards we could use fulltext indexing. More on this choice later.

The DB structure looks like:

CREATE TABLE `field` (
  `fid` int(10) unsigned NOT NULL auto_increment,
  `field_name` varchar(50) NOT NULL default '',
  `field_type` enum('NUMBER','DATE','TIME','ENUM','TEXT') NOT NULL default 'NUMBER',
  `precision` tinyint(4) NOT NULL default '0' COMMENT 'To make fixed point. 10^x',
  `date_format` VARCHAR (50) NULL, 
  PRIMARY KEY  (`fid`),
  UNIQUE KEY `field_name` (`field_name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `field_option` (
  `fid` int(10) unsigned NOT NULL,
  `value` int(11) NOT NULL,
  `description` varchar(255) NOT NULL,
  PRIMARY KEY  (`fid`,`value`),
  KEY `description` (`description`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `value` (
  `item_id` int(10) unsigned NOT NULL default '0',
  `fid` int(10) unsigned NOT NULL default '0',
  `value` int(11) NOT NULL default '0',
  `text` text,
  PRIMARY KEY  (`item_id`,`fid`),
  KEY `value` (`fid`,`value`),
  KEY `text` (`text`(255))
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Selecting data

This setup makes it easier to look through the data, though you need to convert filter values to integers in your script and place those integers in the query. The integer values aren’t good for displaying though. You can do this in the script, however it’s nicer to get the correct data from the DB.

By using a stored procedure we can cast the value into a string which can be displayed. This isn’t terribly fast, but for a single item, it’s surely fast enough.

delimiter |
CREATE FUNCTION `value_display` (`type` enum('NUMBER', 'ENUM', 'DATE', 'TIME', 'TEXT'), `value` INT, `option` VARCHAR(255), `text` TEXT, `precision` INT, `date_format` VARCHAR(50)) RETURNS VARCHAR(255) CHARACTER SET latin1 NO SQL
BEGIN
  CASE type
    WHEN 'NUMBER' THEN RETURN `value` / POW(10, `precision`);
    WHEN 'ENUM' THEN RETURN `option`;
    WHEN 'DATE' THEN RETURN DATE_FORMAT(FROM_DAYS(`value`), `date_format`);
    WHEN 'TIME' THEN RETURN FROM_UNIXTIME(`value`, `date_format`);
    WHEN 'TEXT' THEN RETURN `text`;
    ELSE RETURN NULL;
  END CASE;

  RETURN NULL;
END;
|

Other thoughts

You might have noticed that I’ve choosen InnoDB instead of my regular choice of MyISAM. I expect that the `value` table will be updated a lot. MyISAM only has table locking, which can causes delays. InnoDB doesn’t support fulltext indexes, but I use sphinx anyway, which is faster and more flexible.

Multi value fields, grouping and ranges are discussing in the follow-up article ‘EAV multi-value fields’.