MySQL 101 - Creating your first database

In our last episode we found out how to connect to a MySQL server.  This time we learn how to lay out a database and start creating it. For this, and following episodes, we will be looking at creating a database to support an online bookshop.

Creating the database

Using the mysql command line client, you can connect to the server and then create the database.  We need a name for the database, and in this case we'll call it 'bookshop'.   We'll also create a user who is specifically allowed to add and update the database, but not alter its structure:

mysql> CREATE DATABASE `bookshop`;
Query OK, 1 row affected (0.01 sec)

mysql> GRANT INSERT, SELECT, UPDATE, DELETE ON `bookshop`.* to 'bookuser'@'localhost' identified by 'bookpass';
Query OK, 0 rows affected (0.00 sec)

The first command created the database, but it is only accessible by users that have global database privileges. The second command gives the user 'bookuser' access when connecting locally to the four basic data management functions, often referred to as CRUD (Create, Read, Update, Delete).  These correlate to the SQL statements INSERT, SELECT, UPDATE and DELETE. For more information on the GRANT syntax, review the last episode.

Note that we have used two different types of quotes around different parts of the query.  When referring to data structures like the database, tables, columns and indexes, you should always surround them with the backtick (`) (sometimes called an opening single quote).  String data, like the username, the host and the password, should always be surrounded by the single quote character (').  This helps the query parser to identify the parts of the query correctly and result in less likliehood of a query syntax error if you inadvertently use a reserved word.

Tables, Columns and Indexes

Now we have the database we need to work out what data we want to store in it. Databases store related data in tables, structured as columns.  To be able to efficiently find things, indexes are also required.  Lets look at each element in turn.

Tables

A table allows you to store related data together. Data is stored in rows, and each row is constructed of columns. Because each row must have the same columns as every other row, the data must be structured and must be related for the table to make sense.  In our case we may have a table that defines all the books we want to sell.  It would make no sense to also try and store all of the sales in that same table.

Tables can have a storage engine associated with them.  This defines how the data will actually be stored on disk, or in memory.  MySQL supports several storage engines, with the most popular being InnoDB.  This is fast taking over from the original storage engine, MyISAM.  Other storage engines include ARCHIVE, CSV, BLACKHOLE and FEDERATED. If your database server is set up correctly, you should not need to worry about this for most uses, and we will cover some special use cases of storage engines later in this series.

Columns

A column is a collection of like information that makes up a single attribute of the data stored in a table. Columns are defined by a data storage type, which can be INTEGER, CHAR, DECIMAL, FLOAT, TEXT, BLOB and a number of variations on these.  For instance, VARCHAR is a variable length CHAR type and is often used for string data as it is more space efficient than the equivalent CHAR.  Columns are also defined by a length.  Defining too great a length results in wasted space and lowered performance, and too small means you may get truncation and losing information.

As an example, the price of a book is going to be DECIMAL as it will always contain decimal numbers, and since we are selling books that people should be able to afford, we probably don't need to be able to store a number greater than 999.99.  In MySQL we would therefore define the price column as DECIMAL(5,2).  The 5 refers to the total number of digits expressed (the precision), and the 2 is the number following the decimal point (the scale).

Indexes

One of the great benefits of using a database is the ability to find items by various attributes.  For instance a book will probably be searched for by title, but might also be searched for by author or even publisher. Without an index the database would need to match every record in the database to the search criteria you've supplied to find the right book. With a large number of books this could take a long time.  To solve this you could create an index on the title, or the author, or the publisher, or all three.  Indexes are structured in such a way that you don't have to search all of the records in a database to find a match, and the index tends to be a lot smaller than the data it is indexing, so even if it has to scan an entire index, this will be less work than scanning all of the data in the table.

A word of warning though.  Every time you create, update or delete a record in a table, every index associated with that record will also need to be updated.  So you may want to create indexes on every field just in case, but that would not be a good idea as it would make updates extremely slow.  It is far better to analyse what indexes are required than use a shotgun approach.

What is a book?

You know what a book is, right?  Lots of bits of paper bound together with words in it.  This is fine for humans, but if you wanted to describe a book in such a way that a computer could find information about it for you you would need to determine what the unique attributes of the book are.

So, for our books, seeing as we are trying to sell them, we might have the following attributes:

  • Title
  • Author
  • Publisher
  • Date publised
  • ISBN
  • Price
  • Format (paperback/hardback/trade)

Looks like we have the basis for our first table.  A possible table would be:

CREATE TABLE `book` (
`title` VARCHAR(200),
`author` VARCHAR(100),
`publisher` VARCHAR(50),
`published_date` DATE,
`isbn` DECIMAL(13),
`price` DECIMAL(5,2),
`format` VARCHAR(20)
);

Before hitting the Return key, we probably want to look a little closer at what we are storing here.

Looking at these attributes you may notice that some will be shared by more than one book.  For instance, an author may have written multiple books.  A publisher, unless the book is self-published, is likely to have published many books.  This gives us the opportunity to save a lot of storage space, and to improve our database design.  After all we are using a Relational Database system, so lets create some relationships.

What about Author?  Really an author has at a minimum a name.  They are likely to have a first name and a last name, and perhaps a middle initial.  For our purposes though, lets keep it simple with just a first and last name.  Why? If you are looking for a John Grisham novel, you would probably search for Grisham, and not John as that would match way too many authors not related to the John you are after.  So lets first create our author table:

CREATE TABLE `author` (
`id` INTEGER NOT NULL AUTO_INCREMENT,
`first_name` VARCHAR(50),
`last_name` VARCHAR(50) NOT NULL,
PRIMARY KEY (`id`),
KEY (`last_name`)
);

Whoa!  There is a lot more there than just a first and last name.  What's going on?  Taking it one step at a time:

`id` INTEGER NOT NULL AUTO_INCREMENT

This is one of the more important thing you need to understand about relationships.  Picking the right field to relate.  When we finally create the book table we need to be able to identify which author we are talking about.  So we need a common field, one that exists in both tables.  With just first name and last name, we are going to have problems.  For instance both Merrilee Moss and Tara Moss write books.  Both Peter Temple and Peter Corris write books.  So neither first name nor last name are unique.  This is where id comes in.  It is an automatically generated number (the AUTO_INCREMENT sees to that), that is independent of the other fields.  It can uniquely identify an author record. To ensure it must exist we also mark it as NOT NULL.

PRIMARY KEY (`id`)

This is the other requirement of a relation field.  It needs to be indexed, and it needs to be a unique index.  KEY is the keyword in SQL that identifies that this is an index, and the name of the field (or fields, we can have more than one) that forms the index is in brackets.  PRIMARY KEYs are special.  Not only are they unique, but as the name implies there can only be one per table.  Some storage engines make special arrangements for primary keys to improve the speed at which data is retrieved.  For instance, the InnoDB engine stores the primary key along with the full row of data for a record.  It doesn't create a separate index.  This means that primary key lookups are very fast as there is no need to go looking for the data associated with the key.

KEY (`last_name`)

This one creates another index on the last_name field.  Because we have left off the UNIQUE keyword, this will allow duplicates on the last name field.   Since the most common lookup of an author is going to be by last name, this index makes a lot of sense.  Note that the field last_name also has a NOT NULL declaration.  This ensures we can't have an empty last name, which wouldn't make a lot of sense.

Lets create a few more tables that we are likely to need:

CREATE TABLE `publisher` (
`id` INTEGER NOT NULL AUTO_INCREMENT,
`name` VARCHAR(100) NOT NULL,
`notes` TEXT,
PRIMARY KEY (`id`),
KEY (`name`)
);

CREATE TABLE `format` (
`id` INTEGER NOT NULL AUTO_INCREMENT,
`name` VARCHAR(20) NOT NULL,
`notes` TEXT,
PRIMARY KEY (`id`),
KEY (`name`)
);

Note how similar these two tables are.  Both have a name we can search on, both have a TEXT field that allows us to store (wait for it ...) text associated with the entry, and both have an id field.

This should be enough to start with.  We can refine our design later, but for now lets create the book table, using the relationships we've now defined with the above tables.

CREATE TABLE `book` (
`id` INTEGER NOT NULL AUTO_INCREMENT,
`title` VARCHAR(200) NOT NULL,
`author` INTEGER NOT NULL DEFAULT 0,
`publisher` INTEGER NOT NULL DEFAULT 0,
`format` INTEGER NOT NULL DEFAULT 0,
`published_date` DATE,
`isbn` DECIMAL(13),
`price` DECIMAL(5,2),
PRIMARY KEY(`id`),
KEY (`title`),
KEY (`author`),
KEY (`publisher`),
KEY (`format`)
);

Now we have a book table that uses integer values that match the id fields in the related tables.  In this example we've used the table name of the related table as the field name of the relation.  We could have used something like `author_id` to make it clear we are talking about the id field in the author table.  It doesn't really matter, provided you are consistent.  Just on consistency I've also made all the table names singular. Note that we have an id field in our book table as well.  This will be useful later.

The format of the fields that link tables should have the same format as the field they link to.  In all cases id is defined as an INTEGER field, so the fields in the book table that link to the id fields in the author, publisher and format tables must also be INTEGER.  You'll also notice that we have marked them as NOT NULL, which means we must have a value in these fields.  If we don't supply a value, the database will use the DEFAULT value.  In this case I've made this the number 0, but since this cannot be a valid id value, this is really a catch-all.  It might have been better creating 'UNKNOWN' entries in the author, publisher and format tables and using the id number of those records as the default.  Even better would be not to supply a default as this would force there to be a value that should match a valid record in the parent tables.

Checking out our work

Time to see what the database structure looks like.  We have a few options here.  The easiest is the "SHOW TABLES" query:

mysql> SHOW TABLES;
+--------------------+
| Tables_in_bookshop |
+--------------------+
| author |
| book |
| format |
| publisher |
+--------------------+
4 rows in set (0.00 sec)

Looks good.  We can use the 'DESCRIBE' query to determine how each table is set up:

mysql> DESCRIBE `author`;
+------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| first_name | varchar(50) | YES | | NULL | |
| last_name | varchar(50) | NO | MUL | NULL | |
+------------+-------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

Here we can see that our id field is the PRIMARY key (PRI in the Key field), and that it is using AUTO_INCREMENT. This flag means that each new record that doesn't explicitly set the id field will have it set to the next sequential number greater than the highest used number in that field.

Adding data

Time to add some data.  First of all the reference tables.  These are the author, publisher and format that are referenced from the book table.

INSERT INTO `author` (`first_name`, `last_name`)
VALUES
('Peter', 'Temple'),
('Peter', 'Corris'),
('Tara', 'Moss'),
('Merrilee', 'Moss');

In just one query we've added four authors.  INSERT INTO expects the name of a table, and an optional field list.  If you don't supply the field list it will expect  the corresponding VALUES statements to include every field, in the order they appear in the DESCRIBE output.   Each bracketted VALUES entry provides the data matching the field names supplied (or implied) by the field list.

A few quick ones to get us rolling, note I'm not filling in the notes field yet, we'll deal with that in a later episode.

INSERT INTO `publisher` (`name`)  VALUES ('Random House'), ('Harper Collins'), ('Text'), ('Allen and Unwin'); 
INSERT INTO `format` SET `name` = 'Paperback';
INSERT INTO `format` SET `name` = 'Hardback';

Notice in the case of the format table I've used an alternate syntax, it does the same thing, but is often easier to understand than the VALUES syntax.

We have the basic table information here, so now lets put in a book or two, we'll use the SET syntax to make it a little easier to follow.  Before we can start though, we need to find out the ids that were generated when we added the above data, so we need a quick query. We'll use the SELECT statement to request information on what is held in the database.  This is the simplest form of the SELECT statement, where * means all fields.  Otherwise we would need to use a comma separated list of fields we wanted to see.

SELECT * FROM `author`;
+----+------------+-----------+
| id | first_name | last_name |
+----+------------+-----------+
| 1 | Peter | Temple |
| 2 | Peter | Corris |
| 3 | Tara | Moss |
| 4 | Merrilee | Moss |
+----+------------+-----------+
4 rows in set (0.00 sec)

SELECT * FROM `publisher`;
+----+-----------------+-------+
| id | name | notes |
+----+-----------------+-------+
| 1 | Random House | NULL |
| 2 | Harper Collins | NULL |
| 3 | Text | NULL |
| 4 | Allen and Unwin | NULL |
+----+-----------------+-------+
4 rows in set (0.00 sec)

SELECT * FROM `format`;
+----+-----------+-------+
| id | name | notes |
+----+-----------+-------+
| 1 | Paperback | NULL |
| 2 | Hardback | NULL |
+----+-----------+-------+
2 rows in set (0.00 sec)

Now we have everything we need to create some books.   Peter Temple wrote "The Broken Shore", which was published by Text in paperback under the ISBN 9781921656774 in September 2010.  Now we need to structure that so the database can store it:

INSERT INTO `book`
SET `title` = 'The Broken Shore',
`author` = 1,
`publisher` = 3,
`format` = 1,
`isbn` = 9781921656774,
`published_date` = '2010-09-27';

You'll be starting to see a pattern developing here. While I haven't stated it previously, all SQL statements must be correctly terminated.  In MySQL as with many other SQL derivatives, the semicolon ';' acts as a terminator.  Think of it like the full stop at the end of a sentence.  Also you'll note that numeric data doesn't need quotes around it, but dates do.

The date format used above is the default and is an ISO (International Standards Organisation) standard.  Often called International or European format, it is in year-month-day format.

In our next episode we'll look at queries to pull information out of the database in a meaningful manner and to search based on our relationships.

Disclaimer

I am employed by SkySQL Ab, which provides support, training and services for MySQL, MariaDB and Drizzle and is staffed by many ex-MySQL AB employees. While I try to be unbiased in all of my public opinions, and my opinions are all my own and not those of my employer, I work for SkySQL because I believe in what they are doing and therefore cannot be truly unbiased.

MySQL and 'MySQL Enterprise'  are registered trademarks of Oracle Corporation

Original post blogged on b2evolution.